* docs: overhaul READMEs across all major packages - Root README: restructure with feature table, LLM provider table, comparison matrix, architecture map, and docs link - New: packages/browseros/README.md (Chromium fork build system) - New: apps/server/README.md (MCP server + agent loop) - New: packages/cdp-protocol/README.md (CDP type bindings) - Polish: agent-sdk (badges, prerequisites, multi-step example, links) - Polish: cli (badges, install section, MCP server section, links) - Polish: agent extension (badges, WXT mention, architecture context) - Polish: eval (badges, paper links) * fix: address review — consistent tool count and correct default port - CLI README: "54 MCP tools" → "53+ MCP tools" to match root and server docs - Agent SDK README: localhost:3000 → localhost:9100 to match documented default * docs: add detailed comparison links to How We Compare section * docs: update comparison table with verified competitor data Research all 5 competitors via official websites and docs: - Chrome: no AI agent, Gemini Nano only, MV3 weakening ad blocking - Brave: BYOM feature, local models via BYOM, Shields ad blocking, MV2+MV3 - Dia: Skills-based AI, no BYOK, cloud AI, acquired by Atlassian - Comet: full cloud-based agent, built-in ad blocking, extensions on desktop - Atlas: standalone Chromium browser with Agent Mode, 30-day cloud memory Renamed Arc/Dia column to just Dia (Arc is sunset). * docs: simplify comparison table with clean checkmarks and key differentiators * docs: update browseros-agent README — remove submodule note, add missing packages
BrowserOS Eval
Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.
Prerequisites
- BrowserOS binary installed at
/Applications/BrowserOS.app(macOS) - Bun runtime
- API keys for your chosen LLM provider and grader model
Quick Start
1. Set up environment
cd apps/eval
Edit .env.development and add your API keys:
# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx
# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx
2. Launch the dashboard
bun run eval
Opens the Eval Dashboard at http://localhost:9900 in config mode.
3. Configure and run
From the dashboard:
- Load a preset — select from the dropdown or click Load File to import a config JSON
- Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
- Save Config — export your configuration for reuse
- Click Run — starts the evaluation with live progress
Alternative: Run from CLI
bun run eval -c configs/orchestrator-executor-clado-test.json
Runs immediately. Dashboard still available at http://localhost:9900 for live progress.
Agent Types
Orchestrator-Executor with Clado
The recommended architecture for visual model evals. Two tiers:
- Orchestrator — An LLM that plans and issues high-level instructions
- Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates
The orchestrator works with any LLM provider. Pick whichever you have access to:
OpenAI orchestrator
{
"agent": {
"type": "orchestrator-executor",
"orchestrator": {
"provider": "openai",
"model": "gpt-4o",
"apiKey": "OPENAI_API_KEY"
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
}
},
"dataset": "../data/webvoyager_e2e_test.jsonl",
"output_dir": "../results/oe-clado-openai",
"num_workers": 3,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"headless": true
},
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}
Anthropic orchestrator
"orchestrator": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"apiKey": "ANTHROPIC_API_KEY"
}
Google orchestrator
"orchestrator": {
"provider": "google",
"model": "gemini-2.0-flash",
"apiKey": "GOOGLE_API_KEY"
}
Fireworks orchestrator (OpenAI-compatible)
"orchestrator": {
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
}
The executor config stays the same across all orchestrator providers — it always uses the Clado action model.
Other Agent Types
| Type | Description | Example config |
|---|---|---|
single |
Single LLM agent via Gemini CLI + MCP | webvoyager-test.json |
tool-loop |
AI SDK tool loop, connects via CDP | tool-loop-test.json |
gemini-computer-use |
Google native computer use API | gemini-computer-use.json |
yutori-navigator |
Yutori N1 visual model | yutori-navigator.json |
Configuration Reference
API keys
The apiKey field supports two formats:
- Env var name:
"OPENAI_API_KEY"— resolved from.env.developmentat runtime - Direct value:
"sk-xxxxx"— used as-is (not recommended, prefer env vars)
Supported providers
| Provider | provider value |
Requires baseUrl |
|---|---|---|
| OpenAI | openai |
No |
| Anthropic | anthropic |
No |
google |
No | |
| Azure OpenAI | azure |
Yes |
| AWS Bedrock | bedrock |
No (uses region, accessKeyId, secretAccessKey) |
| OpenRouter | openrouter |
No |
| Fireworks, Together, etc. | openai-compatible |
Yes |
| Ollama | ollama |
No |
| Clado Action (executor only) | clado-action |
Yes |
BrowserOS infrastructure
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": true
}
Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP, server, and extension ports.
Execution settings
| Field | Description | Default |
|---|---|---|
num_workers |
Parallel workers (each gets its own Chrome) | 1 |
timeout_ms |
Per-task timeout in ms | 900000 (15 min) |
restart_server_per_task |
Restart Chrome between tasks (cleaner state, slower) | false |
Grading
Results are auto-graded after each task. The grader uses an LLM judge.
| Field | Description |
|---|---|
grader_model |
Model for grading (e.g., openai/gpt-4.1) |
grader_api_key_env |
Env var name for grader API key |
grader_base_url |
API endpoint (e.g., https://openrouter.ai/api/v1) |
Datasets
| File | Tasks | Description |
|---|---|---|
webvoyager_e2e_test.jsonl |
10 | WebVoyager test subset (quick smoke test) |
webvoyager.jsonl |
643 | Full WebVoyager benchmark |
mind2web_e2e_test.jsonl |
10 | Mind2Web test subset |
mind2web.jsonl |
300 | Full Mind2Web benchmark |
Task format (JSONL, one per line):
{
"query_id": "Amazon--0",
"dataset": "webvoyager",
"query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
"graders": ["webvoyager_grader", "fara_combined"],
"start_url": "https://www.amazon.com/",
"metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}
Output
Results are saved to output_dir:
results/
oe-clado-openai/
Amazon--0/
metadata.json # Task result, timing, grader scores
messages.jsonl # Full message log
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
Troubleshooting
BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.
Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.
API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.
Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.
Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.