* feat: integrate models.dev for dynamic LLM provider/model data (#TKT-657) Replace hardcoded model lists with data sourced from models.dev so new providers and models appear automatically when the community adds them. - Add build script (scripts/generate-models.ts) that fetches models.dev/api.json and outputs a compact JSON with 10 providers and 520 models - Replace hardcoded MODELS_DATA (50 models) with dynamic models.dev lookups - Add searchable model combobox (Popover + Command) replacing plain Select dropdown - Enrich provider templates with models.dev metadata (context window, image support) - Keep chatgpt-pro, qwen-code, browseros, openai-compatible as hardcoded providers * fix: address review — remove ollama-cloud mapping, fix default models, remove dead code - Remove ollama from PROVIDER_MAP (ollama-cloud has cloud models, not local) - Add ollama to CUSTOM_PROVIDER_MODELS with empty list (users type custom IDs) - Update defaultModelIds to ones that exist in models.dev data: openrouter → anthropic/claude-sonnet-4.5 lmstudio → openai/gpt-oss-20b bedrock → anthropic.claude-sonnet-4-6 - Remove dead isCustomModel export - Regenerate models-dev-data.json (9 providers, 486 models) * fix: model suggestion list focus/dismiss behavior - List only opens when input is focused or user types - Clicking a model selects it and closes the list - Clicking outside (blur) dismisses the list - onMouseDown preventDefault on list items prevents blur race condition * refactor: extract ModelPickerList component with proper open/close UX - Collapsed state: Select-like trigger showing selected model + chevron - Expanded state: search input + scrollable filtered list, inline - Click outside or Escape to close, Enter to submit custom model - Extracted as separate component (reduces dialog nesting, testable) - No more setTimeout hacks for blur handling * chore: remove plan doc from repo
BrowserOS Eval
Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.
Prerequisites
- BrowserOS binary installed at
/Applications/BrowserOS.app(macOS) - Bun runtime
- API keys for your chosen LLM provider and grader model
Quick Start
1. Set up environment
cd apps/eval
Edit .env.development and add your API keys:
# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx
# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx
2. Launch the dashboard
bun run eval
Opens the Eval Dashboard at http://localhost:9900 in config mode.
3. Configure and run
From the dashboard:
- Load a preset — select from the dropdown or click Load File to import a config JSON
- Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
- Save Config — export your configuration for reuse
- Click Run — starts the evaluation with live progress
Alternative: Run from CLI
bun run eval -c configs/orchestrator-executor-clado-test.json
Runs immediately. Dashboard still available at http://localhost:9900 for live progress.
Agent Types
Orchestrator-Executor with Clado
The recommended architecture for visual model evals. Two tiers:
- Orchestrator — An LLM that plans and issues high-level instructions
- Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates
The orchestrator works with any LLM provider. Pick whichever you have access to:
OpenAI orchestrator
{
"agent": {
"type": "orchestrator-executor",
"orchestrator": {
"provider": "openai",
"model": "gpt-4o",
"apiKey": "OPENAI_API_KEY"
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
}
},
"dataset": "../data/webvoyager_e2e_test.jsonl",
"output_dir": "../results/oe-clado-openai",
"num_workers": 3,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"headless": true
},
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}
Anthropic orchestrator
"orchestrator": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"apiKey": "ANTHROPIC_API_KEY"
}
Google orchestrator
"orchestrator": {
"provider": "google",
"model": "gemini-2.0-flash",
"apiKey": "GOOGLE_API_KEY"
}
Fireworks orchestrator (OpenAI-compatible)
"orchestrator": {
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
}
The executor config stays the same across all orchestrator providers — it always uses the Clado action model.
Other Agent Types
| Type | Description | Example config |
|---|---|---|
single |
Single LLM agent via Gemini CLI + MCP | webvoyager-test.json |
tool-loop |
AI SDK tool loop, connects via CDP | tool-loop-test.json |
gemini-computer-use |
Google native computer use API | gemini-computer-use.json |
yutori-navigator |
Yutori N1 visual model | yutori-navigator.json |
Configuration Reference
API keys
The apiKey field supports two formats:
- Env var name:
"OPENAI_API_KEY"— resolved from.env.developmentat runtime - Direct value:
"sk-xxxxx"— used as-is (not recommended, prefer env vars)
Supported providers
| Provider | provider value |
Requires baseUrl |
|---|---|---|
| OpenAI | openai |
No |
| Anthropic | anthropic |
No |
google |
No | |
| Azure OpenAI | azure |
Yes |
| AWS Bedrock | bedrock |
No (uses region, accessKeyId, secretAccessKey) |
| OpenRouter | openrouter |
No |
| Fireworks, Together, etc. | openai-compatible |
Yes |
| Ollama | ollama |
No |
| Clado Action (executor only) | clado-action |
Yes |
BrowserOS infrastructure
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": true
}
Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP, server, and extension ports.
Execution settings
| Field | Description | Default |
|---|---|---|
num_workers |
Parallel workers (each gets its own Chrome) | 1 |
timeout_ms |
Per-task timeout in ms | 900000 (15 min) |
restart_server_per_task |
Restart Chrome between tasks (cleaner state, slower) | false |
Grading
Results are auto-graded after each task. The grader uses an LLM judge.
| Field | Description |
|---|---|
grader_model |
Model for grading (e.g., openai/gpt-4.1) |
grader_api_key_env |
Env var name for grader API key |
grader_base_url |
API endpoint (e.g., https://openrouter.ai/api/v1) |
Datasets
| File | Tasks | Description |
|---|---|---|
webvoyager_e2e_test.jsonl |
10 | WebVoyager test subset (quick smoke test) |
webvoyager.jsonl |
643 | Full WebVoyager benchmark |
mind2web_e2e_test.jsonl |
10 | Mind2Web test subset |
mind2web.jsonl |
300 | Full Mind2Web benchmark |
Task format (JSONL, one per line):
{
"query_id": "Amazon--0",
"dataset": "webvoyager",
"query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
"graders": ["webvoyager_grader", "fara_combined"],
"start_url": "https://www.amazon.com/",
"metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}
Output
Results are saved to output_dir:
results/
oe-clado-openai/
Amazon--0/
metadata.json # Task result, timing, grader scores
messages.jsonl # Full message log
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
Troubleshooting
BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.
Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.
API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.
Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.
Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.