豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

chore: add basic eval#766

Merged
OrKoN merged 1 commit intomainfrom
orkon/eval
Jan 14, 2026
Merged

chore: add basic eval#766
OrKoN merged 1 commit intomainfrom
orkon/eval

Conversation

@OrKoN
Copy link
Copy Markdown
Collaborator

@OrKoN OrKoN commented Jan 14, 2026

This PR adds a tool based on the node test runner that runs a loop to see what tools a model chooses given a prompt. The expectations are encoding for each prompt. Run npm run eval to get results. Currently, only Gemini and needs and API key.

Comment thread scripts/eval_gemini.ts Outdated
Comment thread scripts/eval_gemini.ts Outdated
Comment thread scripts/eval_gemini.ts Outdated
Comment thread scripts/eval_gemini.ts Outdated
@OrKoN OrKoN force-pushed the orkon/eval branch 2 times, most recently from 65d24ca to 33e5460 Compare January 14, 2026 10:31
@OrKoN OrKoN enabled auto-merge January 14, 2026 10:32
Comment thread scripts/eval_scenarios/navigation_test.ts Outdated
Comment thread scripts/eval_scenarios/performance_test.ts Outdated
Comment thread scripts/eval_gemini.ts Outdated
Copy link
Copy Markdown
Collaborator

@Lightning00Blade Lightning00Blade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor comments

@OrKoN OrKoN added this pull request to the merge queue Jan 14, 2026
Merged via the queue into main with commit 257b994 Jan 14, 2026
20 checks passed
@OrKoN OrKoN deleted the orkon/eval branch January 14, 2026 14:19
github-merge-queue bot pushed a commit that referenced this pull request Jan 19, 2026
This PR implements the watchdog process architecture for the telemetry
system. It moves the `ClearcutSender` execution to a dedicated child
process, ensuring that events—especially shutdown events—are reliably
transmitted even if the main server process terminates abruptly.

Added an e2e test that runs the server, checks the log file and confirms
the telemetry logs exist and that the watchdog process is correctly
killed after sending the shutdown event once the main process is killed.

**Implementation Roadmap:**
This is the fourth in a series of PRs designed to implement the
telemetry system:
1. **CLI & Opt-out Mechanism
([Merged](#757
    *   Added `--usage-statistics` flag and transparency logging.
2. **Logger Scaffolding & Integration
([Merged](#758
    *   **`ClearcutLogger`**: Implemented the main logging entry point.
* **One-way Data Flow**: Integrated `logToolInvocation` and
`logServerStart` hooks into `main.ts` to capture events.
    *   **`ClearcutSender`**: Introduced a transport abstraction.
* **Type Definitions**: Added TypeScript definitions for the telemetry
Protocol Buffer messages.
3. **Persistence Layer
([Merged](#766
* **`FilePersistence`**: Implemented a local file-based state manager to
persist the `lastActive` timestamp.
* **Daily Active Logic**: Integrated persistence into `ClearcutLogger`
to automatically detect and log `daily_active` events (with
`days_since_last_active` calculation) via `logDailyActiveIfNeeded`.
4.  **Watchdog Process Architecture (This PR):**
* **`WatchdogClient`**: Added a client-side wrapper to spawn and
communicate with the watchdog process via `stdin`.
* **`watchdog/main.ts`**: Created the entry point for the watchdog
process. It listens for IPC messages and uses `ClearcutSender` to
transmit events.
* **Reliable Shutdown**: The watchdog monitors the parent process and
guarantees a `shutdown` event is sent when the parent exits or crashes
(detecting `stdin` closure).
* **Refactoring**: Moved `ClearcutSender` to the `watchdog` directory
and updated `ClearcutLogger` to delegate event sending to the
`WatchdogClient`.
5.  **Transport, Batching & Retries (Next):**
* Finalize `ClearcutSender` with actual HTTP transport logic, including
event batching and exponential backoff retries.
github-merge-queue bot pushed a commit that referenced this pull request Jan 27, 2026
… by default (#805)

This PR completes the telemetry system by implementing the transport
layer for `ClearcutSender`. It enables actual HTTP communication with
the Clearcut backend, handling event batching, rate limiting, and
reliable delivery, including robust shutdown handling.

**Key Changes:**
* **HTTP Transport**: Implemented `fetch`-based transport sending `POST`
requests to the Clearcut HTTP server.
* **Event Batching**: Events are now buffered and flushed periodically
(default: 15 minutes) or on shutdown.
*   **Reliability & Rate Limiting**:
* **Server-Side Backoff**: Respects `next_request_wait_millis` from
server responses to handle rate limiting dynamically.
* **Transient Error Retries**: Failed requests (5xx, 429) result in
events being requeued for the next flush.
* **Request Timeouts**: Enforced 30s timeout on requests to prevent
hanging processes.
* **Session Rotation**: Automatically rotates session IDs every 24
hours.
*   **Safety & Stability**:
* **Buffer Overflow Protection**: Caps the buffer at 1000 events to
prevent memory leaks, dropping oldest events if necessary.
* **Optimistic Removal**: Prevents race conditions and duplicate events
during shutdown by optimistically removing events from the buffer before
sending.
*   **Testing Improvements**:
* **E2E Robustness**: Updated E2E tests to use a mock web server instead
of relying on the logger to log specific lines.

**Implementation Roadmap:**
These changes finalize the planned telemetry architecture:
1. **CLI & Opt-out Mechanism
([Merged](#757
2. **Logger Scaffolding & Integration
([Merged](#758
3. **Persistence Layer
([Merged](#766
4. **Watchdog Process Architecture
([Merged](#769
5.  **Transport, Batching & Retries (This PR):**
* Finalized `ClearcutSender` with HTTP transport, batching, and
server-directed backoff strategies.

---------

Co-authored-by: Alex Rudenko <alexrudenko@chromium.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants