* refactor(server): remove obsolete controller extension backend * fix: address review feedback for PR #610
7.4 KiB
Vendored
BrowserOS Eval
Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.
Prerequisites
- BrowserOS binary installed at
/Applications/BrowserOS.app(macOS) - Bun runtime
- API keys for your chosen LLM provider and grader model
Quick Start
1. Set up environment
cd apps/eval
Edit .env.development and add your API keys:
# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx
# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx
2. Launch the dashboard
bun run eval
Opens the Eval Dashboard at http://localhost:9900 in config mode.
3. Configure and run
From the dashboard:
- Load a preset — select from the dropdown or click Load File to import a config JSON
- Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
- Save Config — export your configuration for reuse
- Click Run — starts the evaluation with live progress
Alternative: Run from CLI
bun run eval -c configs/orchestrator-executor-clado-test.json
Runs immediately. Dashboard still available at http://localhost:9900 for live progress.
Agent Types
Orchestrator-Executor with Clado
The recommended architecture for visual model evals. Two tiers:
- Orchestrator — An LLM that plans and issues high-level instructions
- Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates
The orchestrator works with any LLM provider. Pick whichever you have access to:
OpenAI orchestrator
{
"agent": {
"type": "orchestrator-executor",
"orchestrator": {
"provider": "openai",
"model": "gpt-4o",
"apiKey": "OPENAI_API_KEY"
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
}
},
"dataset": "../data/webvoyager_e2e_test.jsonl",
"output_dir": "../results/oe-clado-openai",
"num_workers": 3,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"headless": true
},
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}
Anthropic orchestrator
"orchestrator": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"apiKey": "ANTHROPIC_API_KEY"
}
Google orchestrator
"orchestrator": {
"provider": "google",
"model": "gemini-2.0-flash",
"apiKey": "GOOGLE_API_KEY"
}
Fireworks orchestrator (OpenAI-compatible)
"orchestrator": {
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
}
The executor config stays the same across all orchestrator providers — it always uses the Clado action model.
Other Agent Types
| Type | Description | Example config |
|---|---|---|
single |
Single LLM agent via Gemini CLI + MCP | webvoyager-test.json |
tool-loop |
AI SDK tool loop, connects via CDP | tool-loop-test.json |
gemini-computer-use |
Google native computer use API | gemini-computer-use.json |
yutori-navigator |
Yutori N1 visual model | yutori-navigator.json |
Configuration Reference
API keys
The apiKey field supports two formats:
- Env var name:
"OPENAI_API_KEY"— resolved from.env.developmentat runtime - Direct value:
"sk-xxxxx"— used as-is (not recommended, prefer env vars)
Supported providers
| Provider | provider value |
Requires baseUrl |
|---|---|---|
| OpenAI | openai |
No |
| Anthropic | anthropic |
No |
google |
No | |
| Azure OpenAI | azure |
Yes |
| AWS Bedrock | bedrock |
No (uses region, accessKeyId, secretAccessKey) |
| OpenRouter | openrouter |
No |
| Fireworks, Together, etc. | openai-compatible |
Yes |
| Ollama | ollama |
No |
| Clado Action (executor only) | clado-action |
Yes |
BrowserOS infrastructure
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": true
}
Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP and server ports. base_extension_port is still reserved as a legacy BrowserOS launch argument for compatibility with Chromium builds that still pass it.
Execution settings
| Field | Description | Default |
|---|---|---|
num_workers |
Parallel workers (each gets its own Chrome) | 1 |
timeout_ms |
Per-task timeout in ms | 900000 (15 min) |
restart_server_per_task |
Restart Chrome between tasks (cleaner state, slower) | false |
Grading
Results are auto-graded after each task. The grader uses an LLM judge.
| Field | Description |
|---|---|
grader_model |
Model for grading (e.g., openai/gpt-4.1) |
grader_api_key_env |
Env var name for grader API key |
grader_base_url |
API endpoint (e.g., https://openrouter.ai/api/v1) |
Datasets
| File | Tasks | Description |
|---|---|---|
webvoyager_e2e_test.jsonl |
10 | WebVoyager test subset (quick smoke test) |
webvoyager.jsonl |
643 | Full WebVoyager benchmark |
mind2web_e2e_test.jsonl |
10 | Mind2Web test subset |
mind2web.jsonl |
300 | Full Mind2Web benchmark |
Task format (JSONL, one per line):
{
"query_id": "Amazon--0",
"dataset": "webvoyager",
"query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
"graders": ["webvoyager_grader", "fara_combined"],
"start_url": "https://www.amazon.com/",
"metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}
Output
Results are saved to output_dir:
results/
oe-clado-openai/
Amazon--0/
metadata.json # Task result, timing, grader scores
messages.jsonl # Full message log
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
Troubleshooting
BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.
Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.
API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.
Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.
Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.