mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-13 15:46:22 +00:00

Files

Dani Akash b3003542d8 docs: overhaul READMEs across all major packages (#594 )

* docs: overhaul READMEs across all major packages

- Root README: restructure with feature table, LLM provider table,
  comparison matrix, architecture map, and docs link
- New: packages/browseros/README.md (Chromium fork build system)
- New: apps/server/README.md (MCP server + agent loop)
- New: packages/cdp-protocol/README.md (CDP type bindings)
- Polish: agent-sdk (badges, prerequisites, multi-step example, links)
- Polish: cli (badges, install section, MCP server section, links)
- Polish: agent extension (badges, WXT mention, architecture context)
- Polish: eval (badges, paper links)

* fix: address review — consistent tool count and correct default port

- CLI README: "54 MCP tools" → "53+ MCP tools" to match root and server docs
- Agent SDK README: localhost:3000 → localhost:9100 to match documented default

* docs: add detailed comparison links to How We Compare section

* docs: update comparison table with verified competitor data

Research all 5 competitors via official websites and docs:
- Chrome: no AI agent, Gemini Nano only, MV3 weakening ad blocking
- Brave: BYOM feature, local models via BYOM, Shields ad blocking, MV2+MV3
- Dia: Skills-based AI, no BYOK, cloud AI, acquired by Atlassian
- Comet: full cloud-based agent, built-in ad blocking, extensions on desktop
- Atlas: standalone Chromium browser with Agent Mode, 30-day cloud memory

Renamed Arc/Dia column to just Dia (Arc is sunset).

* docs: simplify comparison table with clean checkmarks and key differentiators

* docs: update browseros-agent README — remove submodule note, add missing packages

2026-03-27 11:59:04 +05:30

configs

fix(eval): clean up eval configs and add test-clado-api script (#540 )

2026-03-24 01:28:05 +05:30

data

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

eval-targets/coordinate-click

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

scripts

fix(eval): clean up eval configs and add test-clado-api script (#540 )

2026-03-24 01:28:05 +05:30

src

feat: integrate models.dev for dynamic LLM provider/model data (#547 )

2026-03-25 02:41:07 +05:30

tests

feat(eval): NopeCHA CAPTCHA solver integration (#537 )

2026-03-24 00:14:16 +05:30

.gitignore

feat(eval): switch to Linux GitHub-hosted runner (#519 )

2026-03-21 23:04:45 +05:30

config.json

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

DESIGN_DOC.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

IMPLEMENTATION_PHASES.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

IMPLEMENTATION_PLAN.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

package.json

feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516 )

2026-03-21 22:12:52 +05:30

README.md

docs: overhaul READMEs across all major packages (#594 )

2026-03-27 11:59:04 +05:30

tsconfig.json

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

README.md

BrowserOS Eval

Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.

Prerequisites

BrowserOS binary installed at /Applications/BrowserOS.app (macOS)
Bun runtime
API keys for your chosen LLM provider and grader model

Quick Start

1. Set up environment

cd apps/eval

Edit .env.development and add your API keys:

# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx

# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx

2. Launch the dashboard

bun run eval

Opens the Eval Dashboard at http://localhost:9900 in config mode.

3. Configure and run

From the dashboard:

Load a preset — select from the dropdown or click Load File to import a config JSON
Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
Save Config — export your configuration for reuse
Click Run — starts the evaluation with live progress

Alternative: Run from CLI

bun run eval -c configs/orchestrator-executor-clado-test.json

Runs immediately. Dashboard still available at http://localhost:9900 for live progress.

Agent Types

Orchestrator-Executor with Clado

The recommended architecture for visual model evals. Two tiers:

Orchestrator — An LLM that plans and issues high-level instructions
Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates

The orchestrator works with any LLM provider. Pick whichever you have access to:

OpenAI orchestrator

{
  "agent": {
    "type": "orchestrator-executor",
    "orchestrator": {
      "provider": "openai",
      "model": "gpt-4o",
      "apiKey": "OPENAI_API_KEY"
    },
    "executor": {
      "provider": "clado-action",
      "model": "qwen3-vl-30b-a3b-instruct",
      "apiKey": "",
      "baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
    }
  },
  "dataset": "../data/webvoyager_e2e_test.jsonl",
  "output_dir": "../results/oe-clado-openai",
  "num_workers": 3,
  "browseros": {
    "server_url": "http://127.0.0.1:9110",
    "base_cdp_port": 9010,
    "base_server_port": 9110,
    "base_extension_port": 9310,
    "headless": true
  },
  "grader_api_key_env": "OPENROUTER_API_KEY",
  "grader_base_url": "https://openrouter.ai/api/v1",
  "grader_model": "openai/gpt-4.1",
  "timeout_ms": 1200000
}

Anthropic orchestrator

"orchestrator": {
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "apiKey": "ANTHROPIC_API_KEY"
}

Google orchestrator

"orchestrator": {
  "provider": "google",
  "model": "gemini-2.0-flash",
  "apiKey": "GOOGLE_API_KEY"
}

Fireworks orchestrator (OpenAI-compatible)

"orchestrator": {
  "provider": "openai-compatible",
  "model": "accounts/fireworks/models/kimi-k2p5",
  "apiKey": "FIREWORKS_API_KEY",
  "baseUrl": "https://api.fireworks.ai/inference/v1"
}

The executor config stays the same across all orchestrator providers — it always uses the Clado action model.

Other Agent Types

Type	Description	Example config
`single`	Single LLM agent via Gemini CLI + MCP	`webvoyager-test.json`
`tool-loop`	AI SDK tool loop, connects via CDP	`tool-loop-test.json`
`gemini-computer-use`	Google native computer use API	`gemini-computer-use.json`
`yutori-navigator`	Yutori N1 visual model	`yutori-navigator.json`

Configuration Reference

API keys

The apiKey field supports two formats:

Env var name: "OPENAI_API_KEY" — resolved from .env.development at runtime
Direct value: "sk-xxxxx" — used as-is (not recommended, prefer env vars)

Supported providers

Provider	`provider` value	Requires `baseUrl`
OpenAI	`openai`	No
Anthropic	`anthropic`	No
Google	`google`	No
Azure OpenAI	`azure`	Yes
AWS Bedrock	`bedrock`	No (uses `region`, `accessKeyId`, `secretAccessKey`)
OpenRouter	`openrouter`	No
Fireworks, Together, etc.	`openai-compatible`	Yes
Ollama	`ollama`	No
Clado Action (executor only)	`clado-action`	Yes

BrowserOS infrastructure

"browseros": {
  "server_url": "http://127.0.0.1:9110",
  "base_cdp_port": 9010,
  "base_server_port": 9110,
  "base_extension_port": 9310,
  "load_extensions": false,
  "headless": true
}

Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP, server, and extension ports.

Execution settings

Field	Description	Default
`num_workers`	Parallel workers (each gets its own Chrome)	`1`
`timeout_ms`	Per-task timeout in ms	`900000` (15 min)
`restart_server_per_task`	Restart Chrome between tasks (cleaner state, slower)	`false`

Grading

Results are auto-graded after each task. The grader uses an LLM judge.

Field	Description
`grader_model`	Model for grading (e.g., `openai/gpt-4.1`)
`grader_api_key_env`	Env var name for grader API key
`grader_base_url`	API endpoint (e.g., `https://openrouter.ai/api/v1`)

Datasets

File	Tasks	Description
`webvoyager_e2e_test.jsonl`	10	WebVoyager test subset (quick smoke test)
`webvoyager.jsonl`	643	Full WebVoyager benchmark
`mind2web_e2e_test.jsonl`	10	Mind2Web test subset
`mind2web.jsonl`	300	Full Mind2Web benchmark

Task format (JSONL, one per line):

{
  "query_id": "Amazon--0",
  "dataset": "webvoyager",
  "query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
  "graders": ["webvoyager_grader", "fara_combined"],
  "start_url": "https://www.amazon.com/",
  "metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}

Output

Results are saved to output_dir:

results/
  oe-clado-openai/
    Amazon--0/
      metadata.json         # Task result, timing, grader scores
      messages.jsonl         # Full message log
      screenshots/
        001.png              # Step-by-step screenshots
        002.png
    summary.json             # Aggregate pass rates

Troubleshooting

BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.

Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.

API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.

Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.

Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.