Files
BrowserOS/packages/browseros-agent/apps/eval
shivammittal274 7ee8dedd53 chore(eval): drop the 60-char truncation on grader expected/actual values
Some criteria check long strings (job descriptions, post bodies, etc.) —
truncating to 60 chars hides exactly the bytes you need to diff. The
viewer's reasoning area already has max-height + scroll + word-break so
long content scrolls; nothing renders worse for being full-length.
2026-04-30 02:08:30 +05:30
..

BrowserOS Eval

License: AGPL v3

Evaluation framework for BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web, AGI SDK / REAL Bench, WebArena-Infinity, WebBench), captures trajectories with screenshots, and grades results automatically.

Prerequisites

  • BrowserOS binary at /Applications/BrowserOS.app (macOS) or BROWSEROS_BINARY pointing at it
  • Bun runtime
  • API keys for your LLM provider (and CLAUDE_CODE_OAUTH_TOKEN if you use performance_grader)

Quick Start

cd apps/eval
# Edit .env.development with your keys, then:
bun run eval

Opens the eval dashboard at http://localhost:9900 in config mode. From there: load a preset, edit settings, click Run.

CLI mode

bun run eval -c configs/browseros-agent-weekly.json

Runs immediately. Dashboard still available at http://localhost:9900 for live progress.

Agent types

Type Description
single Single LLM agent driven by the BrowserOS tool loop (CDP)
orchestrator-executor High-level orchestrator + per-step executor (LLM or Clado visual model)

Single agent

{
  "agent": {
    "type": "single",
    "provider": "openai-compatible",
    "model": "moonshotai/kimi-k2.5",
    "apiKey": "OPENROUTER_API_KEY",
    "baseUrl": "https://openrouter.ai/api/v1",
    "supportsImages": true
  }
}

Orchestrator-Executor

The orchestrator works with any LLM provider. The executor can be another LLM, or the Clado action visual model that takes screenshots and predicts click/type/scroll coordinates.

{
  "agent": {
    "type": "orchestrator-executor",
    "orchestrator": {
      "provider": "openai-compatible",
      "model": "accounts/fireworks/models/kimi-k2p5",
      "apiKey": "FIREWORKS_API_KEY",
      "baseUrl": "https://api.fireworks.ai/inference/v1"
    },
    "executor": {
      "provider": "clado-action",
      "model": "Qwen3.5-35B-A3B-action-000159-merged",
      "apiKey": "",
      "baseUrl": "https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run"
    }
  }
}

Graders

Name Description
performance_grader Multi-axis grader running on Claude Agent SDK (uses its own credentials via CLAUDE_CODE_OAUTH_TOKEN)
agisdk_state_diff AGI SDK / REAL Bench environment state-diff grader (deterministic)
infinity_state WebArena-Infinity verifier-script grader (deterministic)

Set graders in your config to override the per-task graders field from the dataset:

"graders": ["performance_grader"]

Configuration reference

API keys

The apiKey field supports two formats:

  • Env var name: "OPENAI_API_KEY" — resolved from .env.development at runtime
  • Direct value: "sk-xxxxx" — used as-is (not recommended)

Supported providers

Provider provider value Requires baseUrl
OpenAI openai No
Anthropic anthropic No
Google google No
Azure OpenAI azure Yes
AWS Bedrock bedrock No
OpenRouter openrouter No
Fireworks, Together, etc. openai-compatible Yes
Ollama ollama No
Clado Action (executor only) clado-action Yes

BrowserOS infrastructure

"browseros": {
  "server_url": "http://127.0.0.1:9110",
  "base_cdp_port": 9010,
  "base_server_port": 9110,
  "base_extension_port": 9310,
  "load_extensions": false,
  "headless": true
}

Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP and server ports.

Execution settings

Field Description Default
num_workers Parallel workers (each gets its own Chrome) 1
timeout_ms Per-task timeout in ms 1800000 (30 min)
restart_server_per_task Restart Chrome between tasks (cleaner state, slower) false

Datasets

File Tasks Description
webvoyager.jsonl 643 Full WebVoyager benchmark
mind2web.jsonl 300 Online-Mind2Web
webbench-{0,1,2}of4-50.jsonl 50 each WebBench shards (50-task subsets)
agisdk-real.jsonl 40 AGI SDK / REAL Bench (action-only tasks)
webarena-infinity-hard-50.jsonl 50 WebArena-Infinity hard set
browsecomp-medium-hard-50.jsonl 50 BrowseComp medium-hard
browsecomp-very-hard-50.jsonl 50 BrowseComp very-hard

Task format (JSONL, one per line):

{
  "query_id": "Amazon--0",
  "dataset": "webvoyager",
  "query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
  "graders": ["performance_grader"],
  "start_url": "https://www.amazon.com/",
  "metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}

Output

Results are saved to output_dir:

results/
  browseros-agent-weekly/
    2026-04-29-1430/
      Amazon--0/
        metadata.json         # Task result, timing, grader scores
        messages.jsonl         # Full message log
        screenshots/
          001.png              # Step-by-step screenshots
          002.png
      summary.json             # Aggregate pass rates

Troubleshooting

BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Set BROWSEROS_BINARY to override.

Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.

API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.

Tasks timing out: Increase timeout_ms. Default is 30 minutes.

Headless vs headed: Set "headless": false to watch Chrome in real time.