* feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step
BrowserOS Eval
Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.
Prerequisites
- BrowserOS binary installed at
/Applications/BrowserOS.app(macOS) - Bun runtime
- API keys for your chosen LLM provider and grader model
Quick Start
1. Set up environment
cd apps/eval
Edit .env.development and add your API keys:
# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx
# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx
2. Launch the dashboard
bun run eval
Opens the Eval Dashboard at http://localhost:9900 in config mode.
3. Configure and run
From the dashboard:
- Load a preset — select from the dropdown or click Load File to import a config JSON
- Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
- Save Config — export your configuration for reuse
- Click Run — starts the evaluation with live progress
Alternative: Run from CLI
bun run eval -c configs/orchestrator-executor-clado-test.json
Runs immediately. Dashboard still available at http://localhost:9900 for live progress.
Agent Types
Orchestrator-Executor with Clado
The recommended architecture for visual model evals. Two tiers:
- Orchestrator — An LLM that plans and issues high-level instructions
- Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates
The orchestrator works with any LLM provider. Pick whichever you have access to:
OpenAI orchestrator
{
"agent": {
"type": "orchestrator-executor",
"orchestrator": {
"provider": "openai",
"model": "gpt-4o",
"apiKey": "OPENAI_API_KEY"
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
}
},
"dataset": "../data/webvoyager_e2e_test.jsonl",
"output_dir": "../results/oe-clado-openai",
"num_workers": 3,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"headless": true
},
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}
Anthropic orchestrator
"orchestrator": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"apiKey": "ANTHROPIC_API_KEY"
}
Google orchestrator
"orchestrator": {
"provider": "google",
"model": "gemini-2.0-flash",
"apiKey": "GOOGLE_API_KEY"
}
Fireworks orchestrator (OpenAI-compatible)
"orchestrator": {
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
}
The executor config stays the same across all orchestrator providers — it always uses the Clado action model.
Other Agent Types
| Type | Description | Example config |
|---|---|---|
single |
Single LLM agent via Gemini CLI + MCP | webvoyager-test.json |
tool-loop |
AI SDK tool loop, connects via CDP | tool-loop-test.json |
gemini-computer-use |
Google native computer use API | gemini-computer-use.json |
yutori-navigator |
Yutori N1 visual model | yutori-navigator.json |
Configuration Reference
API keys
The apiKey field supports two formats:
- Env var name:
"OPENAI_API_KEY"— resolved from.env.developmentat runtime - Direct value:
"sk-xxxxx"— used as-is (not recommended, prefer env vars)
Supported providers
| Provider | provider value |
Requires baseUrl |
|---|---|---|
| OpenAI | openai |
No |
| Anthropic | anthropic |
No |
google |
No | |
| Azure OpenAI | azure |
Yes |
| AWS Bedrock | bedrock |
No (uses region, accessKeyId, secretAccessKey) |
| OpenRouter | openrouter |
No |
| Fireworks, Together, etc. | openai-compatible |
Yes |
| Ollama | ollama |
No |
| Clado Action (executor only) | clado-action |
Yes |
BrowserOS infrastructure
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": true
}
Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP and server ports. base_extension_port is still reserved as a legacy BrowserOS launch argument for compatibility with Chromium builds that still pass it.
Execution settings
| Field | Description | Default |
|---|---|---|
num_workers |
Parallel workers (each gets its own Chrome) | 1 |
timeout_ms |
Per-task timeout in ms | 900000 (15 min) |
restart_server_per_task |
Restart Chrome between tasks (cleaner state, slower) | false |
Grading
Results are auto-graded after each task. The grader uses an LLM judge.
| Field | Description |
|---|---|
grader_model |
Model for grading (e.g., openai/gpt-4.1) |
grader_api_key_env |
Env var name for grader API key |
grader_base_url |
API endpoint (e.g., https://openrouter.ai/api/v1) |
Datasets
| File | Tasks | Description |
|---|---|---|
webvoyager_e2e_test.jsonl |
10 | WebVoyager test subset (quick smoke test) |
webvoyager.jsonl |
643 | Full WebVoyager benchmark |
mind2web_e2e_test.jsonl |
10 | Mind2Web test subset |
mind2web.jsonl |
300 | Full Mind2Web benchmark |
Task format (JSONL, one per line):
{
"query_id": "Amazon--0",
"dataset": "webvoyager",
"query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
"graders": ["webvoyager_grader", "fara_combined"],
"start_url": "https://www.amazon.com/",
"metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}
Output
Results are saved to output_dir:
results/
oe-clado-openai/
Amazon--0/
metadata.json # Task result, timing, grader scores
messages.jsonl # Full message log
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
Troubleshooting
BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.
Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.
API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.
Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.
Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.