Files
BrowserOS/packages/browseros-agent/apps/eval
shivammittal274 94a1a701f6 fix(eval): include browser context in agent prompt (#530)
The eval's single-agent was passing raw task.query as the prompt,
without browser context (active tab URL, title). The agent didn't
know which page it was on, causing it to ask "which website?" instead
of browsing.

Use formatUserMessage() (same as chat-service.ts) to include browser
context in the prompt. Re-export formatUserMessage from agent/tool-loop.
2026-03-23 17:42:03 +05:30
..

BrowserOS Eval

Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.

Prerequisites

  • BrowserOS binary installed at /Applications/BrowserOS.app (macOS)
  • Bun runtime
  • API keys for your chosen LLM provider and grader model

Quick Start

1. Set up environment

cd apps/eval

Edit .env.development and add your API keys:

# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx

# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx

2. Launch the dashboard

bun run eval

Opens the Eval Dashboard at http://localhost:9900 in config mode.

3. Configure and run

From the dashboard:

  1. Load a preset — select from the dropdown or click Load File to import a config JSON
  2. Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
  3. Save Config — export your configuration for reuse
  4. Click Run — starts the evaluation with live progress

Alternative: Run from CLI

bun run eval -c configs/orchestrator-executor-clado-test.json

Runs immediately. Dashboard still available at http://localhost:9900 for live progress.

Agent Types

Orchestrator-Executor with Clado

The recommended architecture for visual model evals. Two tiers:

  • Orchestrator — An LLM that plans and issues high-level instructions
  • Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates

The orchestrator works with any LLM provider. Pick whichever you have access to:

OpenAI orchestrator

{
  "agent": {
    "type": "orchestrator-executor",
    "orchestrator": {
      "provider": "openai",
      "model": "gpt-4o",
      "apiKey": "OPENAI_API_KEY"
    },
    "executor": {
      "provider": "clado-action",
      "model": "qwen3-vl-30b-a3b-instruct",
      "apiKey": "",
      "baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
    }
  },
  "dataset": "../data/webvoyager_e2e_test.jsonl",
  "output_dir": "../results/oe-clado-openai",
  "num_workers": 3,
  "browseros": {
    "server_url": "http://127.0.0.1:9110",
    "base_cdp_port": 9010,
    "base_server_port": 9110,
    "base_extension_port": 9310,
    "headless": true
  },
  "grader_api_key_env": "OPENROUTER_API_KEY",
  "grader_base_url": "https://openrouter.ai/api/v1",
  "grader_model": "openai/gpt-4.1",
  "timeout_ms": 1200000
}

Anthropic orchestrator

"orchestrator": {
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "apiKey": "ANTHROPIC_API_KEY"
}

Google orchestrator

"orchestrator": {
  "provider": "google",
  "model": "gemini-2.0-flash",
  "apiKey": "GOOGLE_API_KEY"
}

Fireworks orchestrator (OpenAI-compatible)

"orchestrator": {
  "provider": "openai-compatible",
  "model": "accounts/fireworks/models/kimi-k2p5",
  "apiKey": "FIREWORKS_API_KEY",
  "baseUrl": "https://api.fireworks.ai/inference/v1"
}

The executor config stays the same across all orchestrator providers — it always uses the Clado action model.

Other Agent Types

Type Description Example config
single Single LLM agent via Gemini CLI + MCP webvoyager-test.json
tool-loop AI SDK tool loop, connects via CDP tool-loop-test.json
gemini-computer-use Google native computer use API gemini-computer-use.json
yutori-navigator Yutori N1 visual model yutori-navigator.json

Configuration Reference

API keys

The apiKey field supports two formats:

  • Env var name: "OPENAI_API_KEY" — resolved from .env.development at runtime
  • Direct value: "sk-xxxxx" — used as-is (not recommended, prefer env vars)

Supported providers

Provider provider value Requires baseUrl
OpenAI openai No
Anthropic anthropic No
Google google No
Azure OpenAI azure Yes
AWS Bedrock bedrock No (uses region, accessKeyId, secretAccessKey)
OpenRouter openrouter No
Fireworks, Together, etc. openai-compatible Yes
Ollama ollama No
Clado Action (executor only) clado-action Yes

BrowserOS infrastructure

"browseros": {
  "server_url": "http://127.0.0.1:9110",
  "base_cdp_port": 9010,
  "base_server_port": 9110,
  "base_extension_port": 9310,
  "load_extensions": false,
  "headless": true
}

Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP, server, and extension ports.

Execution settings

Field Description Default
num_workers Parallel workers (each gets its own Chrome) 1
timeout_ms Per-task timeout in ms 900000 (15 min)
restart_server_per_task Restart Chrome between tasks (cleaner state, slower) false

Grading

Results are auto-graded after each task. The grader uses an LLM judge.

Field Description
grader_model Model for grading (e.g., openai/gpt-4.1)
grader_api_key_env Env var name for grader API key
grader_base_url API endpoint (e.g., https://openrouter.ai/api/v1)

Datasets

File Tasks Description
webvoyager_e2e_test.jsonl 10 WebVoyager test subset (quick smoke test)
webvoyager.jsonl 643 Full WebVoyager benchmark
mind2web_e2e_test.jsonl 10 Mind2Web test subset
mind2web.jsonl 300 Full Mind2Web benchmark

Task format (JSONL, one per line):

{
  "query_id": "Amazon--0",
  "dataset": "webvoyager",
  "query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
  "graders": ["webvoyager_grader", "fara_combined"],
  "start_url": "https://www.amazon.com/",
  "metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}

Output

Results are saved to output_dir:

results/
  oe-clado-openai/
    Amazon--0/
      metadata.json         # Task result, timing, grader scores
      messages.jsonl         # Full message log
      screenshots/
        001.png              # Step-by-step screenshots
        002.png
    summary.json             # Aggregate pass rates

Troubleshooting

BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.

Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.

API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.

Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.

Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.