mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-13 15:46:22 +00:00

Files

shivammittal274 4a3b9ff294 feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664 )

* feat: add deterministic eval graders (AGI SDK + WebArena-Infinity)

Two new benchmark integrations with programmatic grading — no LLM judge.

AGI SDK / REAL Bench (52 tasks):
- 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.)
- Grader navigates browser to /finish, extracts state diff from <pre> tag
- Python verifier checks exact values via jmespath queries

WebArena-Infinity (50 hard tasks):
- 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.)
- InfinityAppManager starts fresh app server per task per worker
- Python verifier calls /api/state and asserts on JSON state

Infrastructure:
- GraderInput extended with mcpUrl + infinityAppUrl for parallel workers
- Each worker gets isolated ports (no cross-worker state contamination)
- CI workflow: pip install agisdk, clone webarena-infinity repo

* chore: switch eval configs back to kimi-k2p5

* fix: register deterministic graders in pass rate calculation

Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER
in both runner types and weekly report script, so scores show correctly
in the dashboard.

* chore: temp switch to opus 4.6 for eval run

* chore: restore kimi-k2p5 as default eval config

* ci: add timeout and continue-on-error for trend report step

2026-04-23 13:11:55 +05:30

configs

feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664 )

2026-04-23 13:11:55 +05:30

data

feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664 )

2026-04-23 13:11:55 +05:30

eval-targets/coordinate-click

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

scripts

feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664 )

2026-04-23 13:11:55 +05:30

src

feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664 )

2026-04-23 13:11:55 +05:30

tests

feat(eval): NopeCHA CAPTCHA solver integration (#537 )

2026-03-24 00:14:16 +05:30

.gitignore

feat(eval): switch to Linux GitHub-hosted runner (#519 )

2026-03-21 23:04:45 +05:30

config.json

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

DESIGN_DOC.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

IMPLEMENTATION_PHASES.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

IMPLEMENTATION_PLAN.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

package.json

feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516 )

2026-03-21 22:12:52 +05:30

README.md

feat: clean-up - remove obsolete controller extension (#610 )

2026-03-27 17:01:04 -07:00

tsconfig.json

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

README.md

BrowserOS Eval

Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.

Prerequisites

BrowserOS binary installed at /Applications/BrowserOS.app (macOS)
Bun runtime
API keys for your chosen LLM provider and grader model

Quick Start

1. Set up environment

cd apps/eval

Edit .env.development and add your API keys:

# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx

# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx

2. Launch the dashboard

bun run eval

Opens the Eval Dashboard at http://localhost:9900 in config mode.

3. Configure and run

From the dashboard:

Load a preset — select from the dropdown or click Load File to import a config JSON
Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
Save Config — export your configuration for reuse
Click Run — starts the evaluation with live progress

Alternative: Run from CLI

bun run eval -c configs/orchestrator-executor-clado-test.json

Runs immediately. Dashboard still available at http://localhost:9900 for live progress.

Agent Types

Orchestrator-Executor with Clado

The recommended architecture for visual model evals. Two tiers:

Orchestrator — An LLM that plans and issues high-level instructions
Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates

The orchestrator works with any LLM provider. Pick whichever you have access to:

OpenAI orchestrator

{
  "agent": {
    "type": "orchestrator-executor",
    "orchestrator": {
      "provider": "openai",
      "model": "gpt-4o",
      "apiKey": "OPENAI_API_KEY"
    },
    "executor": {
      "provider": "clado-action",
      "model": "qwen3-vl-30b-a3b-instruct",
      "apiKey": "",
      "baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
    }
  },
  "dataset": "../data/webvoyager_e2e_test.jsonl",
  "output_dir": "../results/oe-clado-openai",
  "num_workers": 3,
  "browseros": {
    "server_url": "http://127.0.0.1:9110",
    "base_cdp_port": 9010,
    "base_server_port": 9110,
    "base_extension_port": 9310,
    "headless": true
  },
  "grader_api_key_env": "OPENROUTER_API_KEY",
  "grader_base_url": "https://openrouter.ai/api/v1",
  "grader_model": "openai/gpt-4.1",
  "timeout_ms": 1200000
}

Anthropic orchestrator

"orchestrator": {
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "apiKey": "ANTHROPIC_API_KEY"
}

Google orchestrator

"orchestrator": {
  "provider": "google",
  "model": "gemini-2.0-flash",
  "apiKey": "GOOGLE_API_KEY"
}

Fireworks orchestrator (OpenAI-compatible)

"orchestrator": {
  "provider": "openai-compatible",
  "model": "accounts/fireworks/models/kimi-k2p5",
  "apiKey": "FIREWORKS_API_KEY",
  "baseUrl": "https://api.fireworks.ai/inference/v1"
}

The executor config stays the same across all orchestrator providers — it always uses the Clado action model.

Other Agent Types

Type	Description	Example config
`single`	Single LLM agent via Gemini CLI + MCP	`webvoyager-test.json`
`tool-loop`	AI SDK tool loop, connects via CDP	`tool-loop-test.json`
`gemini-computer-use`	Google native computer use API	`gemini-computer-use.json`
`yutori-navigator`	Yutori N1 visual model	`yutori-navigator.json`

Configuration Reference

API keys

The apiKey field supports two formats:

Env var name: "OPENAI_API_KEY" — resolved from .env.development at runtime
Direct value: "sk-xxxxx" — used as-is (not recommended, prefer env vars)

Supported providers

Provider	`provider` value	Requires `baseUrl`
OpenAI	`openai`	No
Anthropic	`anthropic`	No
Google	`google`	No
Azure OpenAI	`azure`	Yes
AWS Bedrock	`bedrock`	No (uses `region`, `accessKeyId`, `secretAccessKey`)
OpenRouter	`openrouter`	No
Fireworks, Together, etc.	`openai-compatible`	Yes
Ollama	`ollama`	No
Clado Action (executor only)	`clado-action`	Yes

BrowserOS infrastructure

"browseros": {
  "server_url": "http://127.0.0.1:9110",
  "base_cdp_port": 9010,
  "base_server_port": 9110,
  "base_extension_port": 9310,
  "load_extensions": false,
  "headless": true
}

Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP and server ports. base_extension_port is still reserved as a legacy BrowserOS launch argument for compatibility with Chromium builds that still pass it.

Execution settings

Field	Description	Default
`num_workers`	Parallel workers (each gets its own Chrome)	`1`
`timeout_ms`	Per-task timeout in ms	`900000` (15 min)
`restart_server_per_task`	Restart Chrome between tasks (cleaner state, slower)	`false`

Grading

Results are auto-graded after each task. The grader uses an LLM judge.

Field	Description
`grader_model`	Model for grading (e.g., `openai/gpt-4.1`)
`grader_api_key_env`	Env var name for grader API key
`grader_base_url`	API endpoint (e.g., `https://openrouter.ai/api/v1`)

Datasets

File	Tasks	Description
`webvoyager_e2e_test.jsonl`	10	WebVoyager test subset (quick smoke test)
`webvoyager.jsonl`	643	Full WebVoyager benchmark
`mind2web_e2e_test.jsonl`	10	Mind2Web test subset
`mind2web.jsonl`	300	Full Mind2Web benchmark

Task format (JSONL, one per line):

{
  "query_id": "Amazon--0",
  "dataset": "webvoyager",
  "query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
  "graders": ["webvoyager_grader", "fara_combined"],
  "start_url": "https://www.amazon.com/",
  "metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}

Output

Results are saved to output_dir:

results/
  oe-clado-openai/
    Amazon--0/
      metadata.json         # Task result, timing, grader scores
      messages.jsonl         # Full message log
      screenshots/
        001.png              # Step-by-step screenshots
        002.png
    summary.json             # Aggregate pass rates

Troubleshooting

BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.

Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.

API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.

Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.

Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.