* feat: add GitHub Copilot as OAuth-based LLM provider Add GitHub Copilot as a second OAuth provider using the Device Code flow (RFC 8628). Users authenticate via github.com/login/device, and the server polls for token completion. Supports 25+ models through a single Copilot subscription. Key changes: - Device Code OAuth flow in token manager (poll with safety margin) - Custom fetch wrapper injecting Copilot headers + vision detection - Provider factory using createOpenAICompatible for Chat Completions API - Extension UI with template card, auto-create on auth, and disconnect * fix: address PR review comments for GitHub Copilot OAuth - Validate device code response for error fields (GitHub can return 200 with error payload) - Store empty refreshToken instead of access token for GitHub tokens - Add closeButton to Toaster for dismissing device code toast * fix: add github-copilot to agent provider factory The chat route uses a separate provider-factory.ts (agent layer) from the test-provider route (llm/provider.ts). Added createGitHubCopilotFactory to the agent factory so chat works with GitHub Copilot. * fix: add github-copilot to provider icons, models, and dialog - Add Github icon from lucide-react to providerIcons map - Add 8 Copilot models (GPT-4o, Claude, Gemini, Grok) to models.ts - Add github-copilot to NewProviderDialog zod enum, validation skip, canTest check, and OAuth credential message * fix: reorder copilot models with free-tier models first Put models available on Copilot Free at the top (gpt-4o, gpt-4.1, gpt-5-mini, claude-haiku-4.5, grok-code-fast-1), followed by premium models that require paid Copilot subscription. * fix: set correct 64K context window for Copilot models Copilot API enforces a 64K input token limit regardless of the underlying model's native context window. Updated all model entries and the default template to 64000 so compaction triggers correctly. * fix: use actual per-model prompt limits from Copilot /models API Queried api.githubcopilot.com/models for real max_prompt_tokens values. GPT-4o/4.1 have 64K, Claude/gpt-5-mini have 128K, GPT-5.x have 272K. Also updated model list to match what's actually available on the API (e.g. claude-sonnet-4.6 instead of 4.5, added gpt-5.4/5.2-codex). * feat: resize images for Copilot using VS Code's algorithm Large screenshots cause 413 errors on Copilot's API. Resize images following VS Code's approach: max 2048px longest side, 768px shortest side, re-encode as JPEG at 75% quality. Uses sharp for server-side image processing. * fix: address all Greptile P1 review comments - Add .catch() on fire-and-forget pollDeviceCode to prevent unhandled rejection crashes (Node 15+) - Add deduplication guard (activeDeviceFlows Set) to prevent concurrent device code flows for the same provider - Add runtime validation of server response in frontend before calling window.open() and showing toast - Remove dead GITHUB_DEVICE_VERIFICATION constant from urls.ts * fix: upgrade biome to 2.4.8, fix all lint errors, and address review bugs - Upgrade biome from 2.4.5 to 2.4.8 (matches CI) and migrate configs - Fix image resize: only re-encode when dimensions actually change - Fix device code polling: retry on transient network errors instead of aborting - Allow restarting device code flow (clear old flow instead of throwing 500) - Fix pre-existing noNonNullAssertion and noExplicitAny lint errors globally * fix: address Greptile P2 review — image resize and config guard - Fix early-return guard: check max/min sides against their respective limits (MAX_LONG_SIDE/MAX_SHORT_SIDE) instead of both against SHORT - Preserve PNG alpha: detect hasAlpha and keep PNG format instead of unconditionally converting to lossy JPEG - Keep browserosId guard in resolveGitHubCopilotConfig consistent with ChatGPT Pro pattern (safety check that caller context is valid) * feat: update Copilot models to full list from pricing page, default to gpt-5-mini Added all 23 models from GitHub Copilot pricing page. Ordered with free-tier models first (gpt-5-mini, claude-haiku-4.5), then premium. Changed default from gpt-4o to gpt-5-mini since it's unlimited on Pro plan and has 128K context (vs gpt-4o's 64K limit).
BrowserOS Eval
Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.
Prerequisites
- BrowserOS binary installed at
/Applications/BrowserOS.app(macOS) - Bun runtime
- API keys for your chosen LLM provider and grader model
Quick Start
1. Set up environment
cd apps/eval
Edit .env.development and add your API keys:
# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx
# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx
2. Launch the dashboard
bun run eval
Opens the Eval Dashboard at http://localhost:9900 in config mode.
3. Configure and run
From the dashboard:
- Load a preset — select from the dropdown or click Load File to import a config JSON
- Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
- Save Config — export your configuration for reuse
- Click Run — starts the evaluation with live progress
Alternative: Run from CLI
bun run eval -c configs/orchestrator-executor-clado-test.json
Runs immediately. Dashboard still available at http://localhost:9900 for live progress.
Agent Types
Orchestrator-Executor with Clado
The recommended architecture for visual model evals. Two tiers:
- Orchestrator — An LLM that plans and issues high-level instructions
- Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates
The orchestrator works with any LLM provider. Pick whichever you have access to:
OpenAI orchestrator
{
"agent": {
"type": "orchestrator-executor",
"orchestrator": {
"provider": "openai",
"model": "gpt-4o",
"apiKey": "OPENAI_API_KEY"
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
}
},
"dataset": "../data/webvoyager_e2e_test.jsonl",
"output_dir": "../results/oe-clado-openai",
"num_workers": 3,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"headless": true
},
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}
Anthropic orchestrator
"orchestrator": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"apiKey": "ANTHROPIC_API_KEY"
}
Google orchestrator
"orchestrator": {
"provider": "google",
"model": "gemini-2.0-flash",
"apiKey": "GOOGLE_API_KEY"
}
Fireworks orchestrator (OpenAI-compatible)
"orchestrator": {
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
}
The executor config stays the same across all orchestrator providers — it always uses the Clado action model.
Other Agent Types
| Type | Description | Example config |
|---|---|---|
single |
Single LLM agent via Gemini CLI + MCP | webvoyager-test.json |
tool-loop |
AI SDK tool loop, connects via CDP | tool-loop-test.json |
gemini-computer-use |
Google native computer use API | gemini-computer-use.json |
yutori-navigator |
Yutori N1 visual model | yutori-navigator.json |
Configuration Reference
API keys
The apiKey field supports two formats:
- Env var name:
"OPENAI_API_KEY"— resolved from.env.developmentat runtime - Direct value:
"sk-xxxxx"— used as-is (not recommended, prefer env vars)
Supported providers
| Provider | provider value |
Requires baseUrl |
|---|---|---|
| OpenAI | openai |
No |
| Anthropic | anthropic |
No |
google |
No | |
| Azure OpenAI | azure |
Yes |
| AWS Bedrock | bedrock |
No (uses region, accessKeyId, secretAccessKey) |
| OpenRouter | openrouter |
No |
| Fireworks, Together, etc. | openai-compatible |
Yes |
| Ollama | ollama |
No |
| Clado Action (executor only) | clado-action |
Yes |
BrowserOS infrastructure
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": true
}
Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP, server, and extension ports.
Execution settings
| Field | Description | Default |
|---|---|---|
num_workers |
Parallel workers (each gets its own Chrome) | 1 |
timeout_ms |
Per-task timeout in ms | 900000 (15 min) |
restart_server_per_task |
Restart Chrome between tasks (cleaner state, slower) | false |
Grading
Results are auto-graded after each task. The grader uses an LLM judge.
| Field | Description |
|---|---|
grader_model |
Model for grading (e.g., openai/gpt-4.1) |
grader_api_key_env |
Env var name for grader API key |
grader_base_url |
API endpoint (e.g., https://openrouter.ai/api/v1) |
Datasets
| File | Tasks | Description |
|---|---|---|
webvoyager_e2e_test.jsonl |
10 | WebVoyager test subset (quick smoke test) |
webvoyager.jsonl |
643 | Full WebVoyager benchmark |
mind2web_e2e_test.jsonl |
10 | Mind2Web test subset |
mind2web.jsonl |
300 | Full Mind2Web benchmark |
Task format (JSONL, one per line):
{
"query_id": "Amazon--0",
"dataset": "webvoyager",
"query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
"graders": ["webvoyager_grader", "fara_combined"],
"start_url": "https://www.amazon.com/",
"metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}
Output
Results are saved to output_dir:
results/
oe-clado-openai/
Amazon--0/
metadata.json # Task result, timing, grader scores
messages.jsonl # Full message log
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
Troubleshooting
BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.
Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.
API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.
Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.
Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.