mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-13 15:46:22 +00:00

Files

shivammittal274 720baaed3e feat: add GitHub Copilot as OAuth LLM provider (#500 )

* feat: add GitHub Copilot as OAuth-based LLM provider

Add GitHub Copilot as a second OAuth provider using the Device Code flow
(RFC 8628). Users authenticate via github.com/login/device, and the server
polls for token completion. Supports 25+ models through a single Copilot
subscription.

Key changes:
- Device Code OAuth flow in token manager (poll with safety margin)
- Custom fetch wrapper injecting Copilot headers + vision detection
- Provider factory using createOpenAICompatible for Chat Completions API
- Extension UI with template card, auto-create on auth, and disconnect

* fix: address PR review comments for GitHub Copilot OAuth

- Validate device code response for error fields (GitHub can return 200
  with error payload)
- Store empty refreshToken instead of access token for GitHub tokens
- Add closeButton to Toaster for dismissing device code toast

* fix: add github-copilot to agent provider factory

The chat route uses a separate provider-factory.ts (agent layer) from the
test-provider route (llm/provider.ts). Added createGitHubCopilotFactory
to the agent factory so chat works with GitHub Copilot.

* fix: add github-copilot to provider icons, models, and dialog

- Add Github icon from lucide-react to providerIcons map
- Add 8 Copilot models (GPT-4o, Claude, Gemini, Grok) to models.ts
- Add github-copilot to NewProviderDialog zod enum, validation skip,
  canTest check, and OAuth credential message

* fix: reorder copilot models with free-tier models first

Put models available on Copilot Free at the top (gpt-4o, gpt-4.1,
gpt-5-mini, claude-haiku-4.5, grok-code-fast-1), followed by
premium models that require paid Copilot subscription.

* fix: set correct 64K context window for Copilot models

Copilot API enforces a 64K input token limit regardless of the
underlying model's native context window. Updated all model entries
and the default template to 64000 so compaction triggers correctly.

* fix: use actual per-model prompt limits from Copilot /models API

Queried api.githubcopilot.com/models for real max_prompt_tokens values.
GPT-4o/4.1 have 64K, Claude/gpt-5-mini have 128K, GPT-5.x have 272K.
Also updated model list to match what's actually available on the API
(e.g. claude-sonnet-4.6 instead of 4.5, added gpt-5.4/5.2-codex).

* feat: resize images for Copilot using VS Code's algorithm

Large screenshots cause 413 errors on Copilot's API. Resize images
following VS Code's approach: max 2048px longest side, 768px shortest
side, re-encode as JPEG at 75% quality. Uses sharp for server-side
image processing.

* fix: address all Greptile P1 review comments

- Add .catch() on fire-and-forget pollDeviceCode to prevent unhandled
  rejection crashes (Node 15+)
- Add deduplication guard (activeDeviceFlows Set) to prevent concurrent
  device code flows for the same provider
- Add runtime validation of server response in frontend before calling
  window.open() and showing toast
- Remove dead GITHUB_DEVICE_VERIFICATION constant from urls.ts

* fix: upgrade biome to 2.4.8, fix all lint errors, and address review bugs

- Upgrade biome from 2.4.5 to 2.4.8 (matches CI) and migrate configs
- Fix image resize: only re-encode when dimensions actually change
- Fix device code polling: retry on transient network errors instead of aborting
- Allow restarting device code flow (clear old flow instead of throwing 500)
- Fix pre-existing noNonNullAssertion and noExplicitAny lint errors globally

* fix: address Greptile P2 review — image resize and config guard

- Fix early-return guard: check max/min sides against their respective
  limits (MAX_LONG_SIDE/MAX_SHORT_SIDE) instead of both against SHORT
- Preserve PNG alpha: detect hasAlpha and keep PNG format instead of
  unconditionally converting to lossy JPEG
- Keep browserosId guard in resolveGitHubCopilotConfig consistent with
  ChatGPT Pro pattern (safety check that caller context is valid)

* feat: update Copilot models to full list from pricing page, default to gpt-5-mini

Added all 23 models from GitHub Copilot pricing page. Ordered with
free-tier models first (gpt-5-mini, claude-haiku-4.5), then premium.
Changed default from gpt-4o to gpt-5-mini since it's unlimited on
Pro plan and has 128K context (vs gpt-4o's 64K limit).

2026-03-20 02:33:09 +05:30

configs

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

data

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

eval-targets/coordinate-click

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

scripts

feat: add GitHub Copilot as OAuth LLM provider (#500 )

2026-03-20 02:33:09 +05:30

src

fix: biome & tsc setup across repo (#493 )

2026-03-19 18:18:24 +05:30

.gitignore

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

config.json

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

DESIGN_DOC.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

IMPLEMENTATION_PHASES.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

IMPLEMENTATION_PLAN.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

package.json

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

README.md

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

tsconfig.json

feat: add eval framework and coordinate-based input tools (#453 )

2026-03-16 23:12:23 +05:30

README.md

BrowserOS Eval

Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets (WebVoyager, Mind2Web), captures trajectories with screenshots, and grades results automatically.

Prerequisites

BrowserOS binary installed at /Applications/BrowserOS.app (macOS)
Bun runtime
API keys for your chosen LLM provider and grader model

Quick Start

1. Set up environment

cd apps/eval

Edit .env.development and add your API keys:

# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx

# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx

2. Launch the dashboard

bun run eval

Opens the Eval Dashboard at http://localhost:9900 in config mode.

3. Configure and run

From the dashboard:

Load a preset — select from the dropdown or click Load File to import a config JSON
Edit settings — change agent type, provider, model, API keys, dataset, workers, timeouts
Save Config — export your configuration for reuse
Click Run — starts the evaluation with live progress

Alternative: Run from CLI

bun run eval -c configs/orchestrator-executor-clado-test.json

Runs immediately. Dashboard still available at http://localhost:9900 for live progress.

Agent Types

Orchestrator-Executor with Clado

The recommended architecture for visual model evals. Two tiers:

Orchestrator — An LLM that plans and issues high-level instructions
Executor — The Clado Action visual model that takes screenshots and predicts click/type/scroll coordinates

The orchestrator works with any LLM provider. Pick whichever you have access to:

OpenAI orchestrator

{
  "agent": {
    "type": "orchestrator-executor",
    "orchestrator": {
      "provider": "openai",
      "model": "gpt-4o",
      "apiKey": "OPENAI_API_KEY"
    },
    "executor": {
      "provider": "clado-action",
      "model": "qwen3-vl-30b-a3b-instruct",
      "apiKey": "",
      "baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
    }
  },
  "dataset": "../data/webvoyager_e2e_test.jsonl",
  "output_dir": "../results/oe-clado-openai",
  "num_workers": 3,
  "browseros": {
    "server_url": "http://127.0.0.1:9110",
    "base_cdp_port": 9010,
    "base_server_port": 9110,
    "base_extension_port": 9310,
    "headless": true
  },
  "grader_api_key_env": "OPENROUTER_API_KEY",
  "grader_base_url": "https://openrouter.ai/api/v1",
  "grader_model": "openai/gpt-4.1",
  "timeout_ms": 1200000
}

Anthropic orchestrator

"orchestrator": {
  "provider": "anthropic",
  "model": "claude-sonnet-4-20250514",
  "apiKey": "ANTHROPIC_API_KEY"
}

Google orchestrator

"orchestrator": {
  "provider": "google",
  "model": "gemini-2.0-flash",
  "apiKey": "GOOGLE_API_KEY"
}

Fireworks orchestrator (OpenAI-compatible)

"orchestrator": {
  "provider": "openai-compatible",
  "model": "accounts/fireworks/models/kimi-k2p5",
  "apiKey": "FIREWORKS_API_KEY",
  "baseUrl": "https://api.fireworks.ai/inference/v1"
}

The executor config stays the same across all orchestrator providers — it always uses the Clado action model.

Other Agent Types

Type	Description	Example config
`single`	Single LLM agent via Gemini CLI + MCP	`webvoyager-test.json`
`tool-loop`	AI SDK tool loop, connects via CDP	`tool-loop-test.json`
`gemini-computer-use`	Google native computer use API	`gemini-computer-use.json`
`yutori-navigator`	Yutori N1 visual model	`yutori-navigator.json`

Configuration Reference

API keys

The apiKey field supports two formats:

Env var name: "OPENAI_API_KEY" — resolved from .env.development at runtime
Direct value: "sk-xxxxx" — used as-is (not recommended, prefer env vars)

Supported providers

Provider	`provider` value	Requires `baseUrl`
OpenAI	`openai`	No
Anthropic	`anthropic`	No
Google	`google`	No
Azure OpenAI	`azure`	Yes
AWS Bedrock	`bedrock`	No (uses `region`, `accessKeyId`, `secretAccessKey`)
OpenRouter	`openrouter`	No
Fireworks, Together, etc.	`openai-compatible`	Yes
Ollama	`ollama`	No
Clado Action (executor only)	`clado-action`	Yes

BrowserOS infrastructure

"browseros": {
  "server_url": "http://127.0.0.1:9110",
  "base_cdp_port": 9010,
  "base_server_port": 9110,
  "base_extension_port": 9310,
  "load_extensions": false,
  "headless": true
}

Each worker gets its own Chrome instance. Worker N uses base_port + N for CDP, server, and extension ports.

Execution settings

Field	Description	Default
`num_workers`	Parallel workers (each gets its own Chrome)	`1`
`timeout_ms`	Per-task timeout in ms	`900000` (15 min)
`restart_server_per_task`	Restart Chrome between tasks (cleaner state, slower)	`false`

Grading

Results are auto-graded after each task. The grader uses an LLM judge.

Field	Description
`grader_model`	Model for grading (e.g., `openai/gpt-4.1`)
`grader_api_key_env`	Env var name for grader API key
`grader_base_url`	API endpoint (e.g., `https://openrouter.ai/api/v1`)

Datasets

File	Tasks	Description
`webvoyager_e2e_test.jsonl`	10	WebVoyager test subset (quick smoke test)
`webvoyager.jsonl`	643	Full WebVoyager benchmark
`mind2web_e2e_test.jsonl`	10	Mind2Web test subset
`mind2web.jsonl`	300	Full Mind2Web benchmark

Task format (JSONL, one per line):

{
  "query_id": "Amazon--0",
  "dataset": "webvoyager",
  "query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
  "graders": ["webvoyager_grader", "fara_combined"],
  "start_url": "https://www.amazon.com/",
  "metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}

Output

Results are saved to output_dir:

results/
  oe-clado-openai/
    Amazon--0/
      metadata.json         # Task result, timing, grader scores
      messages.jsonl         # Full message log
      screenshots/
        001.png              # Step-by-step screenshots
        002.png
    summary.json             # Aggregate pass rates

Troubleshooting

BrowserOS not found: Expects /Applications/BrowserOS.app/Contents/MacOS/BrowserOS. Make sure it's installed.

Port conflicts: Each worker uses base_port + workerIndex. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.

API key not resolving: If your config has "apiKey": "OPENAI_API_KEY", ensure the env var is set in .env.development.

Tasks timing out: Increase timeout_ms. Default is 15 minutes; complex tasks may need 20+ minutes.

Headless vs headed: Set "headless": false to watch Chrome in real-time. Useful for debugging.