* feat(eval): add claude-generated run report artifact * fix(eval): install claude code cli for CI evals * fix(eval): bypass claude code tool permissions * Eval metrics configs (#932) * feat(eval): add agisdk comparison metrics configs * fix(eval): keep cdp crashes from aborting run
BrowserOS Eval
Evaluation framework for BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web, AGI SDK / REAL Bench, WebArena-Infinity, WebBench), captures trajectories with screenshots, and grades results automatically.
Prerequisites
- BrowserOS binary at
/Applications/BrowserOS.app(macOS) orBROWSEROS_BINARYpointing at it - Bun runtime
- API keys for your LLM provider (and
CLAUDE_CODE_OAUTH_TOKENif you useperformance_grader) - Python 3.10+ with
agisdkfor AGI SDK / REAL Bench grading. SetBROWSEROS_EVAL_PYTHONif your defaultpython3is older.
Quick Start
cd apps/eval
cp .env.example .env.development
# Edit .env.development with your keys, then:
bun run eval
Opens the eval dashboard at http://localhost:9900 in config mode. From there: load a preset, edit settings, click Run.
CLI mode
bun run eval -c configs/legacy/browseros-agent-weekly.json
bun run eval suite --config configs/legacy/browseros-agent-weekly.json --publish r2
Runs immediately. Dashboard still available at http://localhost:9900 for live progress.
The suite command is the workflow-compatible full loop: execute tasks, run graders, write artifacts, and optionally publish to R2. The old -c form remains supported during migration.
bun run eval run --config configs/legacy/browseros-agent-weekly.json
bun run eval suite --suite configs/suites/agisdk-daily-10.json --variant kimi-fireworks --publish r2
bun run eval grade --run results/browseros-agent-weekly/2026-04-29-1430
bun run eval publish --run results/browseros-agent-weekly/2026-04-29-1430 --target r2
Config files live in two groups:
configs/legacy/ # Complete EvalConfig files used by older workflows and the dashboard
configs/suites/ # Suite definitions; model/provider comes from CLI flags or env
Suite mode takes model settings from CLI flags first, then env:
EVAL_VARIANT=kimi-fireworks \
EVAL_AGENT_PROVIDER=openai-compatible \
EVAL_AGENT_MODEL=accounts/fireworks/models/kimi-k2p5 \
EVAL_AGENT_API_KEY=$FIREWORKS_API_KEY \
EVAL_AGENT_BASE_URL=https://api.fireworks.ai/inference/v1 \
bun run eval suite --suite configs/suites/agisdk-daily-10.json --publish r2
Suites and variants
A suite is what we run: the task dataset, graders, worker count, timeout, and browser settings. For example, agisdk-daily-10 means "run these 10 AGI SDK tasks and grade them with agisdk_state_diff."
A variant is the model setup we are testing on that suite. EVAL_VARIANT is just the human-readable name for that setup. The actual model connection still comes from EVAL_AGENT_PROVIDER, EVAL_AGENT_MODEL, EVAL_AGENT_API_KEY, and EVAL_AGENT_BASE_URL.
This lets us run the same suite against multiple model setups without copying the benchmark config:
agisdk-daily-10 + kimi-fireworks
agisdk-daily-10 + claude-opus
agisdk-daily-10 + clado-action-000159
For orchestrator-executor suites, there can also be an executor model/backend. The EVAL_AGENT_* vars describe the main agent or orchestrator. The optional EVAL_EXECUTOR_* or CLADO_ACTION_* vars describe the delegated executor.
Agent types
| Type | Description |
|---|---|
single |
Single LLM agent driven by the BrowserOS tool loop (CDP) |
orchestrator-executor |
High-level orchestrator + per-step executor (LLM or Clado visual model) |
claude-code |
External Claude Code CLI driven through BrowserOS MCP |
Single agent
{
"agent": {
"type": "single",
"provider": "openai-compatible",
"model": "moonshotai/kimi-k2.5",
"apiKey": "OPENROUTER_API_KEY",
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
}
}
Orchestrator-Executor
The orchestrator works with any LLM provider. The executor can be another LLM, or the Clado action visual model that takes screenshots and predicts click/type/scroll coordinates.
{
"agent": {
"type": "orchestrator-executor",
"orchestrator": {
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
},
"executor": {
"provider": "clado-action",
"model": "Qwen3.5-35B-A3B-action-000159-merged",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run"
}
}
}
Claude Code
Claude Code runs as an external claude -p subprocess. The eval runner passes a task-scoped MCP config that points Claude Code at the active worker's BrowserOS MCP endpoint, while the eval capture layer still saves messages, screenshots, trajectory metadata, and grader outputs.
{
"agent": {
"type": "claude-code",
"model": "opus"
}
}
BROWSEROS_EVAL_PYTHON=/path/to/python3 bun run eval run --config configs/legacy/claude-code-agisdk-real.json
bun run eval suite --config configs/legacy/claude-code-agisdk-real.json --publish r2
Graders
| Name | Description |
|---|---|
performance_grader |
Multi-axis grader running on Claude Agent SDK (uses its own credentials via CLAUDE_CODE_OAUTH_TOKEN) |
agisdk_state_diff |
AGI SDK / REAL Bench environment state-diff grader (deterministic) |
infinity_state |
WebArena-Infinity verifier-script grader (deterministic) |
Set graders in your config to override the per-task graders field from the dataset:
"graders": ["performance_grader"]
Configuration reference
API keys
The apiKey field supports two formats:
- Env var name:
"OPENAI_API_KEY"— resolved from.env.developmentat runtime - Direct value:
"sk-xxxxx"— used as-is (not recommended)
Environment variables
| Variable | Used for |
|---|---|
EVAL_AGENT_PROVIDER, EVAL_AGENT_MODEL, EVAL_AGENT_API_KEY, EVAL_AGENT_BASE_URL, EVAL_AGENT_SUPPORTS_IMAGES |
Suite variant model selection |
FIREWORKS_API_KEY, OPENROUTER_API_KEY, ANTHROPIC_API_KEY, provider-specific keys |
Config-file or provider-backed model calls |
EVAL_EXECUTOR_MODEL, EVAL_EXECUTOR_API_KEY, EVAL_EXECUTOR_BASE_URL |
Suite-mode orchestrator executor override |
CLADO_ACTION_MODEL, CLADO_ACTION_API_KEY, CLADO_ACTION_BASE_URL |
Clado executor defaults |
BROWSEROS_BINARY |
BrowserOS binary path in CI/local smoke runs |
BROWSEROS_SERVER_URL |
Optional grader MCP URL override |
BROWSEROS_EVAL_PYTHON |
Optional Python interpreter for JSON graders such as agisdk_state_diff |
WEBARENA_INFINITY_DIR |
Local WebArena-Infinity checkout for Infinity tasks |
NOPECHA_API_KEY |
CAPTCHA solver extension |
EVAL_R2_ACCOUNT_ID, EVAL_R2_ACCESS_KEY_ID, EVAL_R2_SECRET_ACCESS_KEY, EVAL_R2_BUCKET, EVAL_R2_CDN_BASE_URL |
R2 upload and viewer URL |
Supported providers
| Provider | provider value |
Requires baseUrl |
|---|---|---|
| OpenAI | openai |
No |
| Anthropic | anthropic |
No |
google |
No | |
| Azure OpenAI | azure |
Yes |
| AWS Bedrock | bedrock |
No |
| OpenRouter | openrouter |
No |
| Fireworks, Together, etc. | openai-compatible |
Yes |
| Ollama | ollama |
No |
| Clado Action (executor only) | clado-action |
Yes |
R2 publishing
suite --config ... --publish r2 and publish --target r2 upload the run artifacts plus viewer.html to the viewer-compatible R2 layout:
export EVAL_R2_ACCOUNT_ID=...
export EVAL_R2_ACCESS_KEY_ID=...
export EVAL_R2_SECRET_ACCESS_KEY=...
export EVAL_R2_BUCKET=browseros-eval
export EVAL_R2_CDN_BASE_URL=https://eval.browseros.com
EVAL_R2_CDN_BASE_URL must be a public R2 custom domain, r2.dev URL, or Worker URL. Do not set it to the private *.r2.cloudflarestorage.com S3 API endpoint.
Published runs are available at EVAL_R2_CDN_BASE_URL/viewer.html?run=<run-id>.
BrowserOS infrastructure
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
}
Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP and server ports.
Execution settings
| Field | Description | Default |
|---|---|---|
num_workers |
Parallel workers (each gets its own Chrome) | 1 |
timeout_ms |
Per-task timeout in ms | 1800000 (30 min) |
restart_server_per_task |
Restart Chrome between tasks (cleaner state, slower) | false |
Datasets
| File | Tasks | Description |
|---|---|---|
agisdk-daily-10.jsonl |
10 | Daily AGI SDK / REAL Bench subset |
webvoyager.jsonl |
643 | Full WebVoyager benchmark |
mind2web.jsonl |
300 | Online-Mind2Web |
webbench-{0,1,2}of4-50.jsonl |
50 each | WebBench shards (50-task subsets) |
agisdk-real-smoke.jsonl |
1 | AGI SDK / REAL Bench smoke task |
agisdk-real.jsonl |
36 | AGI SDK / REAL Bench (action-only tasks) |
webarena-infinity-hard-50.jsonl |
50 | WebArena-Infinity hard set |
browsecomp-medium-hard-50.jsonl |
50 | BrowseComp medium-hard |
browsecomp-very-hard-50.jsonl |
50 | BrowseComp very-hard |
Task format (JSONL, one per line):
{
"query_id": "Amazon--0",
"dataset": "webvoyager",
"query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
"graders": ["performance_grader"],
"start_url": "https://www.amazon.com/",
"metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}
Output
Results are saved to output_dir:
results/
browseros-agent-weekly/
2026-04-29-1430/
Amazon--0/
attempt.json # Stable attempt summary for viewer/reporting
metadata.json # Task result, timing, grader scores
grades.json # Compact grader results
messages.jsonl # Full message log
grader-artifacts/ # Grader-specific inputs/outputs/stderr
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
R2 publishing preserves the task files under runs/<run-id>/..., writes runs/<run-id>/manifest.json, and uploads viewer.html at the bucket root. The viewer URL is EVAL_R2_CDN_BASE_URL/viewer.html?run=<run-id>.
R2 viewer manifest
runs/<run-id>/manifest.json is the source of truth for the public viewer. New manifests include schemaVersion: 2 and each task includes explicit artifact paths:
{
"schemaVersion": 2,
"runId": "agisdk-real-smoke-2026-04-30-0000",
"tasks": [
{
"queryId": "agisdk-dashdish-10",
"paths": {
"metadata": "tasks/agisdk-dashdish-10/metadata.json",
"messages": "tasks/agisdk-dashdish-10/messages.jsonl",
"grades": "tasks/agisdk-dashdish-10/grades.json",
"trace": "tasks/agisdk-dashdish-10/trace.jsonl",
"screenshots": "tasks/agisdk-dashdish-10/screenshots",
"graderArtifacts": "tasks/agisdk-dashdish-10/grader-artifacts"
}
}
]
}
The static viewer uses task.paths when present. Older uploaded runs without schemaVersion or task.paths still work through the legacy inferred layout: runs/<run-id>/<task-id>/metadata.json, messages.jsonl, and screenshots/<n>.png.
Manifest paths are stable artifact locations, not a guarantee that every optional artifact exists for every task. For example, attempt.json, trace.jsonl, or grader artifact directories may be absent when that artifact was not produced by the run.
Troubleshooting
BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Set BROWSEROS_BINARY to override.
Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.
API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.
Tasks timing out: Increase timeout_ms. Default is 30 minutes.
Headless vs headed: Set "headless": false to watch Chrome in real time.