BrowserOS

mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-13 15:46:22 +00:00

Author	SHA1	Message	Date
shivammittal274	4a3b9ff294	feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664 ) * feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step	2026-04-23 13:11:55 +05:30
Nikhil	9bdb2413ec	feat: clean-up - remove obsolete controller extension (#610 ) * refactor(server): remove obsolete controller extension backend * fix: address review feedback for PR #610	2026-03-27 17:01:04 -07:00
Dani Akash	b3003542d8	docs: overhaul READMEs across all major packages (#594 ) * docs: overhaul READMEs across all major packages - Root README: restructure with feature table, LLM provider table, comparison matrix, architecture map, and docs link - New: packages/browseros/README.md (Chromium fork build system) - New: apps/server/README.md (MCP server + agent loop) - New: packages/cdp-protocol/README.md (CDP type bindings) - Polish: agent-sdk (badges, prerequisites, multi-step example, links) - Polish: cli (badges, install section, MCP server section, links) - Polish: agent extension (badges, WXT mention, architecture context) - Polish: eval (badges, paper links) * fix: address review — consistent tool count and correct default port - CLI README: "54 MCP tools" → "53+ MCP tools" to match root and server docs - Agent SDK README: localhost:3000 → localhost:9100 to match documented default * docs: add detailed comparison links to How We Compare section * docs: update comparison table with verified competitor data Research all 5 competitors via official websites and docs: - Chrome: no AI agent, Gemini Nano only, MV3 weakening ad blocking - Brave: BYOM feature, local models via BYOM, Shields ad blocking, MV2+MV3 - Dia: Skills-based AI, no BYOK, cloud AI, acquired by Atlassian - Comet: full cloud-based agent, built-in ad blocking, extensions on desktop - Atlas: standalone Chromium browser with Agent Mode, 30-day cloud memory Renamed Arc/Dia column to just Dia (Arc is sunset). * docs: simplify comparison table with clean checkmarks and key differentiators * docs: update browseros-agent README — remove submodule note, add missing packages	2026-03-27 11:59:04 +05:30
shivammittal274	dde35ccbd5	feat: integrate models.dev for dynamic LLM provider/model data (#547 ) * feat: integrate models.dev for dynamic LLM provider/model data (#TKT-657) Replace hardcoded model lists with data sourced from models.dev so new providers and models appear automatically when the community adds them. - Add build script (scripts/generate-models.ts) that fetches models.dev/api.json and outputs a compact JSON with 10 providers and 520 models - Replace hardcoded MODELS_DATA (50 models) with dynamic models.dev lookups - Add searchable model combobox (Popover + Command) replacing plain Select dropdown - Enrich provider templates with models.dev metadata (context window, image support) - Keep chatgpt-pro, qwen-code, browseros, openai-compatible as hardcoded providers * fix: address review — remove ollama-cloud mapping, fix default models, remove dead code - Remove ollama from PROVIDER_MAP (ollama-cloud has cloud models, not local) - Add ollama to CUSTOM_PROVIDER_MODELS with empty list (users type custom IDs) - Update defaultModelIds to ones that exist in models.dev data: openrouter → anthropic/claude-sonnet-4.5 lmstudio → openai/gpt-oss-20b bedrock → anthropic.claude-sonnet-4-6 - Remove dead isCustomModel export - Regenerate models-dev-data.json (9 providers, 486 models) * fix: model suggestion list focus/dismiss behavior - List only opens when input is focused or user types - Clicking a model selects it and closes the list - Clicking outside (blur) dismisses the list - onMouseDown preventDefault on list items prevents blur race condition * refactor: extract ModelPickerList component with proper open/close UX - Collapsed state: Select-like trigger showing selected model + chevron - Expanded state: search input + scrollable filtered list, inline - Click outside or Escape to close, Enter to submit custom model - Extracted as separate component (reduces dialog nesting, testable) - No more setTimeout hacks for blur handling * chore: remove plan doc from repo	2026-03-25 02:41:07 +05:30
shivammittal274	65547c60c0	fix(eval): clean up eval configs and add test-clado-api script (#540 ) Consolidate 13 configs down to 7 with uniform settings: - 3 weekly (CI): browseros-agent, browseros-oe-agent, browseros-oe-clado - 4 test (local): test_gemini-computer-use, test_yutori-navigator, test_webvoyager, test_mind2web - All configs: headless=false, captcha block, full browseros ports, restart_server_per_task Deleted: debug-test, mind2web-test, tool-loop-test, orchestrator-executor-test, orchestrator-executor-clado-test, fireworks-minimax-m2, webvoyager-test Added: test-clado-api.ts script, browseros-oe-agent-weekly.json (OE with AI SDK executor)	2026-03-24 01:28:05 +05:30
shivammittal274	0babc05077	feat(eval): NopeCHA CAPTCHA solver integration (#537 ) * feat(eval): show mean score instead of pass/fail in report and viewer * feat(eval): integrate NopeCHA CAPTCHA solver into eval pipeline Add CAPTCHA detection and waiting so screenshots capture post-solve state. Run headed with xvfb on CI since headless breaks extension content scripts. - Add CaptchaWaiter module (detect reCAPTCHA/hCaptcha/Turnstile, poll until solved) - Add optional `captcha` config block to EvalConfigSchema - Wait for CAPTCHA solve before screenshot in single-agent and orchestrator-executor - Patch NopeCHA manifest with API key before launching workers - Fix CAPTCHA_EXT_DIR path (was pointing one level too high) - Remove --incognito (extensions don't run in incognito; fresh user-data-dir isolates) - CI: install xvfb, run headed via xvfb-run, pass NOPECHA_API_KEY secret	2026-03-24 00:14:16 +05:30
shivammittal274	f14942c6f9	feat(eval): show mean score instead of pass/fail in report and viewer (#534 )	2026-03-23 20:28:34 +05:30
Dani Akash	4928b7e84b	fix: no current window and sentry context (#531 ) * fix: error reporting and better breadcrumbs * fix: lint issues	2026-03-23 18:46:39 +05:30
shivammittal274	94a1a701f6	fix(eval): include browser context in agent prompt (#530 ) The eval's single-agent was passing raw task.query as the prompt, without browser context (active tab URL, title). The agent didn't know which page it was on, causing it to ask "which website?" instead of browsing. Use formatUserMessage() (same as chat-service.ts) to include browser context in the prompt. Re-export formatUserMessage from agent/tool-loop.	2026-03-23 17:42:03 +05:30
shivammittal274	70be5c5c21	fix(eval): log agent errors in task progress for CI visibility (#523 )	2026-03-21 23:33:19 +05:30
shivammittal274	f157436e7d	feat(eval): switch to Linux GitHub-hosted runner (#519 ) * feat(eval): switch to ubuntu-latest runner, add OE-Clado config - Switch workflow from self-hosted Mac Studio to ubuntu-latest - Install BrowserOS Linux .deb in CI (no self-hosted runner needed) - Add browseros-oe-clado-weekly.json config for orchestrator-executor - Fix report chart to show date+time (not just date) - Make BROWSEROS_BINARY configurable via env var * feat(eval): add NopeCHA captcha solver extension to eval runs - Auto-load NopeCHA extension in eval Chrome instances - Works in incognito + headless mode - CI workflow downloads NopeCHA before eval - extensions/ directory gitignored (downloaded at runtime) * feat(eval): per-config concurrency — different configs run in parallel * feat(eval): remove concurrency limit — all runs execute in parallel	2026-03-21 23:04:45 +05:30
shivammittal274	4e90b4561a	feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516 ) * feat(eval): weekly eval pipeline with R2 uploads and trend dashboard Add infrastructure for running weekly evaluations and tracking score trends over time: - Auto-generated output dirs: results/{config-name}/{timestamp}/ Each eval run gets its own timestamped folder, nothing is overwritten. - upload-run.ts: uploads eval results to Cloudflare R2. Supports uploading a specific run or all un-uploaded runs for a config. - weekly-report.ts: generates an interactive HTML dashboard from R2 data. Config dropdown, trend chart with hover tooltips, searchable runs table. Groups runs by config name. - viewer.html: client-facing 3-column run viewer (task list, screenshots with autoplay, agent stream with messages.jsonl). Shows performance grader axis breakdown with per-axis scores. - browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5, webbench-2of4-50, 10 workers, performance grader, headless). - eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am) and manual trigger. Runs on self-hosted Mac Studio runner. Concurrency group ensures only one eval runs at a time. - Dashboard updates: load previous runs, messages.jsonl viewer, grade badges show percentages, async stream loading. - Grader updates: timeout 30min, max turns 100, DOM content verification guidance for performance grader. * fix(eval): address Greptile review — injection, nested dirs, escaping - Fix script injection in eval-weekly.yml: pass github.event.inputs through env var instead of interpolating into shell - Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs - Fix /api/load-run to allow single-slash run names (config/timestamp) - Add HTML escaping for R2-sourced values in weekly-report.ts - Escape axis names in viewer.html renderAxesBreakdown * fix(eval): fix biome lint — non-null assertion, template literals * fix(eval): fix biome errors — replace var with let, fix inner function declaration * fix(eval): address Greptile P2 issues - isRunDir: check all subdirs for metadata.json, not just first 3 - eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval') - load-run: default unknown termination_reason to 'failed' not 'completed' * feat(eval): make BROWSEROS_BINARY configurable via env var	2026-03-21 22:12:52 +05:30
shivammittal274	720baaed3e	feat: add GitHub Copilot as OAuth LLM provider (#500 ) * feat: add GitHub Copilot as OAuth-based LLM provider Add GitHub Copilot as a second OAuth provider using the Device Code flow (RFC 8628). Users authenticate via github.com/login/device, and the server polls for token completion. Supports 25+ models through a single Copilot subscription. Key changes: - Device Code OAuth flow in token manager (poll with safety margin) - Custom fetch wrapper injecting Copilot headers + vision detection - Provider factory using createOpenAICompatible for Chat Completions API - Extension UI with template card, auto-create on auth, and disconnect * fix: address PR review comments for GitHub Copilot OAuth - Validate device code response for error fields (GitHub can return 200 with error payload) - Store empty refreshToken instead of access token for GitHub tokens - Add closeButton to Toaster for dismissing device code toast * fix: add github-copilot to agent provider factory The chat route uses a separate provider-factory.ts (agent layer) from the test-provider route (llm/provider.ts). Added createGitHubCopilotFactory to the agent factory so chat works with GitHub Copilot. * fix: add github-copilot to provider icons, models, and dialog - Add Github icon from lucide-react to providerIcons map - Add 8 Copilot models (GPT-4o, Claude, Gemini, Grok) to models.ts - Add github-copilot to NewProviderDialog zod enum, validation skip, canTest check, and OAuth credential message * fix: reorder copilot models with free-tier models first Put models available on Copilot Free at the top (gpt-4o, gpt-4.1, gpt-5-mini, claude-haiku-4.5, grok-code-fast-1), followed by premium models that require paid Copilot subscription. * fix: set correct 64K context window for Copilot models Copilot API enforces a 64K input token limit regardless of the underlying model's native context window. Updated all model entries and the default template to 64000 so compaction triggers correctly. * fix: use actual per-model prompt limits from Copilot /models API Queried api.githubcopilot.com/models for real max_prompt_tokens values. GPT-4o/4.1 have 64K, Claude/gpt-5-mini have 128K, GPT-5.x have 272K. Also updated model list to match what's actually available on the API (e.g. claude-sonnet-4.6 instead of 4.5, added gpt-5.4/5.2-codex). * feat: resize images for Copilot using VS Code's algorithm Large screenshots cause 413 errors on Copilot's API. Resize images following VS Code's approach: max 2048px longest side, 768px shortest side, re-encode as JPEG at 75% quality. Uses sharp for server-side image processing. * fix: address all Greptile P1 review comments - Add .catch() on fire-and-forget pollDeviceCode to prevent unhandled rejection crashes (Node 15+) - Add deduplication guard (activeDeviceFlows Set) to prevent concurrent device code flows for the same provider - Add runtime validation of server response in frontend before calling window.open() and showing toast - Remove dead GITHUB_DEVICE_VERIFICATION constant from urls.ts * fix: upgrade biome to 2.4.8, fix all lint errors, and address review bugs - Upgrade biome from 2.4.5 to 2.4.8 (matches CI) and migrate configs - Fix image resize: only re-encode when dimensions actually change - Fix device code polling: retry on transient network errors instead of aborting - Allow restarting device code flow (clear old flow instead of throwing 500) - Fix pre-existing noNonNullAssertion and noExplicitAny lint errors globally * fix: address Greptile P2 review — image resize and config guard - Fix early-return guard: check max/min sides against their respective limits (MAX_LONG_SIDE/MAX_SHORT_SIDE) instead of both against SHORT - Preserve PNG alpha: detect hasAlpha and keep PNG format instead of unconditionally converting to lossy JPEG - Keep browserosId guard in resolveGitHubCopilotConfig consistent with ChatGPT Pro pattern (safety check that caller context is valid) * feat: update Copilot models to full list from pricing page, default to gpt-5-mini Added all 23 models from GitHub Copilot pricing page. Ordered with free-tier models first (gpt-5-mini, claude-haiku-4.5), then premium. Changed default from gpt-4o to gpt-5-mini since it's unlimited on Pro plan and has 128K context (vs gpt-4o's 64K limit).	2026-03-20 02:33:09 +05:30
Dani Akash	d965698905	fix: biome & tsc setup across repo (#493 ) * fix: biome lint issues * fix: code quality workflow * fix: all lint issues * chore: test lefthook pre-commit hook * chore: test lefthook with agent file * chore: revert test comment from lefthook verification * feat: setup tsgo for typechecking agent * fix: typecheck cli command * fix: early return to prevent errors	2026-03-19 18:18:24 +05:30
shivammittal274	515ad44826	fix: resolve biome v2 config and lint errors (#471 ) Migrate `files.ignore` to `files.includes` for Biome v2 compatibility, fix forEach callback return value, unused variable, import ordering, and formatting violations.	2026-03-17 19:14:01 +05:30
shivammittal274	29056226bb	feat: add eval framework and coordinate-based input tools (#453 ) - Add hover_at, type_at, drag_at coordinate tools to server - Add hoverAt, typeAt, dragAt methods to Browser class - Export server internals (browser, tool-loop, registry) for eval imports - Copy eval app from enterprise repo with agents, graders, runner, dashboard - Nest eval-targets inside apps/eval - Adapt sessionExecutionDir → workingDir for current server API - Add biome ignore for dashboard HTML to prevent lint breaking onclick handlers	2026-03-16 23:12:23 +05:30

16 Commits