mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-13 15:46:22 +00:00

Files

Nikhil 9bdb2413ec feat: clean-up - remove obsolete controller extension (#610 )

* refactor(server): remove obsolete controller extension backend

* fix: address review feedback for PR #610

2026-03-27 17:01:04 -07:00

7.4 KiB

Vendored

Raw Permalink Blame History

BrowserOS Eval

Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.

Prerequisites

BrowserOS binary installed at /Applications/BrowserOS.app (macOS)
Bun runtime
API keys for your chosen LLM provider and grader model

Quick Start

1. Set up environment

cd apps/eval

Edit .env.development and add your API keys:

# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx

# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx

2. Launch the dashboard

bun run eval

Opens the Eval Dashboard at http://localhost:9900 in config mode.

3. Configure and run

From the dashboard:

Load a preset — select from the dropdown or click Load File to import a config JSON
Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
Save Config — export your configuration for reuse
Click Run — starts the evaluation with live progress

Alternative: Run from CLI

bun run eval -c configs/orchestrator-executor-clado-test.json

Runs immediately. Dashboard still available at http://localhost:9900 for live progress.

Agent Types

Orchestrator-Executor with Clado

The recommended architecture for visual model evals. Two tiers:

Orchestrator — An LLM that plans and issues high-level instructions
Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates

The orchestrator works with any LLM provider. Pick whichever you have access to:

OpenAI orchestrator

{
  "agent": {
    "type": "orchestrator-executor",
    "orchestrator": {
      "provider": "openai",
      "model": "gpt-4o",
      "apiKey": "OPENAI_API_KEY"
    },
    "executor": {
      "provider": "clado-action",
      "model": "qwen3-vl-30b-a3b-instruct",
      "apiKey": "",
      "baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
    }
  },
  "dataset": "../data/webvoyager_e2e_test.jsonl",
  "output_dir": "../results/oe-clado-openai",
  "num_workers": 3,
  "browseros": {
    "server_url": "http://127.0.0.1:9110",
    "base_cdp_port": 9010,
    "base_server_port": 9110,
    "base_extension_port": 9310,
    "headless": true
  },
  "grader_api_key_env": "OPENROUTER_API_KEY",
  "grader_base_url": "https://openrouter.ai/api/v1",
  "grader_model": "openai/gpt-4.1",
  "timeout_ms": 1200000
}

Anthropic orchestrator

"orchestrator": {
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "apiKey": "ANTHROPIC_API_KEY"
}

Google orchestrator

"orchestrator": {
  "provider": "google",
  "model": "gemini-2.0-flash",
  "apiKey": "GOOGLE_API_KEY"
}

Fireworks orchestrator (OpenAI-compatible)

"orchestrator": {
  "provider": "openai-compatible",
  "model": "accounts/fireworks/models/kimi-k2p5",
  "apiKey": "FIREWORKS_API_KEY",
  "baseUrl": "https://api.fireworks.ai/inference/v1"
}

The executor config stays the same across all orchestrator providers — it always uses the Clado action model.

Other Agent Types

Type	Description	Example config
`single`	Single LLM agent via Gemini CLI + MCP	`webvoyager-test.json`
`tool-loop`	AI SDK tool loop, connects via CDP	`tool-loop-test.json`
`gemini-computer-use`	Google native computer use API	`gemini-computer-use.json`
`yutori-navigator`	Yutori N1 visual model	`yutori-navigator.json`

Configuration Reference

API keys

The apiKey field supports two formats:

Env var name: "OPENAI_API_KEY" — resolved from .env.development at runtime
Direct value: "sk-xxxxx" — used as-is (not recommended, prefer env vars)

Supported providers

Provider	`provider` value	Requires `baseUrl`
OpenAI	`openai`	No
Anthropic	`anthropic`	No
Google	`google`	No
Azure OpenAI	`azure`	Yes
AWS Bedrock	`bedrock`	No (uses `region`, `accessKeyId`, `secretAccessKey`)
OpenRouter	`openrouter`	No
Fireworks, Together, etc.	`openai-compatible`	Yes
Ollama	`ollama`	No
Clado Action (executor only)	`clado-action`	Yes

BrowserOS infrastructure

"browseros": {
  "server_url": "http://127.0.0.1:9110",
  "base_cdp_port": 9010,
  "base_server_port": 9110,
  "base_extension_port": 9310,
  "load_extensions": false,
  "headless": true
}

Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP and server ports. base_extension_port is still reserved as a legacy BrowserOS launch argument for compatibility with Chromium builds that still pass it.

Execution settings

Field	Description	Default
`num_workers`	Parallel workers (each gets its own Chrome)	`1`
`timeout_ms`	Per-task timeout in ms	`900000` (15 min)
`restart_server_per_task`	Restart Chrome between tasks (cleaner state, slower)	`false`

Grading

Results are auto-graded after each task. The grader uses an LLM judge.

Field	Description
`grader_model`	Model for grading (e.g., `openai/gpt-4.1`)
`grader_api_key_env`	Env var name for grader API key
`grader_base_url`	API endpoint (e.g., `https://openrouter.ai/api/v1`)

Datasets

File	Tasks	Description
`webvoyager_e2e_test.jsonl`	10	WebVoyager test subset (quick smoke test)
`webvoyager.jsonl`	643	Full WebVoyager benchmark
`mind2web_e2e_test.jsonl`	10	Mind2Web test subset
`mind2web.jsonl`	300	Full Mind2Web benchmark

Task format (JSONL, one per line):

{
  "query_id": "Amazon--0",
  "dataset": "webvoyager",
  "query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
  "graders": ["webvoyager_grader", "fara_combined"],
  "start_url": "https://www.amazon.com/",
  "metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}

Output

Results are saved to output_dir:

results/
  oe-clado-openai/
    Amazon--0/
      metadata.json         # Task result, timing, grader scores
      messages.jsonl         # Full message log
      screenshots/
        001.png              # Step-by-step screenshots
        002.png
    summary.json             # Aggregate pass rates

Troubleshooting

BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.

Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.

API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.

Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.

Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.

7.4 KiB Vendored Raw Permalink Blame History