BrowserOS

mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-21 04:45:12 +00:00

Author	SHA1	Message	Date
shivammittal274	7ee8dedd53	chore(eval): drop the 60-char truncation on grader expected/actual values Some criteria check long strings (job descriptions, post bodies, etc.) — truncating to 60 chars hides exactly the bytes you need to diff. The viewer's reasoning area already has max-height + scroll + word-break so long content scrolls; nothing renders worse for being full-length.	2026-04-30 02:08:30 +05:30
shivammittal274	a3b5ef4da3	chore(eval): show every criterion in agisdk grader message, not just failures Listing only failures hid the bigger picture — when 1 of 4 criteria fails you still want to know which 3 passed and what was checked. Now the message is the full checklist, ✓/✗ per criterion, with expected vs actual on the failing lines. Examples: All 4 criteria passed. ✓ correct job title ✓ includes Java skill ✓ includes Spring Boot skill ✓ includes Angular skill 2 of 4 criteria failed: ✓ correct job title (softened) ✓ includes Java skill ✗ includes Spring Boot skill: expected True, got False ✗ includes Angular skill: expected True, got False	2026-04-30 02:08:07 +05:30
shivammittal274	3333728e4e	fix(eval): surface per-criterion descriptions in agisdk grader output The viewer's grader-reasoning pill was showing "Task not completed successfully." for every agisdk_state_diff failure. The rich data was actually available — agisdk's TaskConfig exposes a 'description' (e.g. "includes Spring Boot skill") and the JMESPath 'query' for each criterion, zip-aligned 1:1 with info['results'] — we just weren't extracting it. Now agisdk-evaluate.py emits per-criterion entries with description, query, expected_value, actual_value, and builds the message as a useful multi-line summary: 2 of 4 criteria failed: • includes Spring Boot skill: expected True, got False • includes Angular skill: expected True, got False The viewer's grader-reasoning area already has white-space: pre-wrap so the multi-line message renders correctly. The structured per_criterion fields are also stored under details.per_criterion in metadata.json for anyone who wants to grep R2 artifacts directly.	2026-04-30 02:06:51 +05:30
shivammittal274	72cbffe2bb	chore(eval): refresh test-clado-api script for new Clado contract Updated the local smoke-test to match the new Clado endpoint and response contract: - New action + health URLs (000159-merged checkpoint). - Drop the grounding-model branch (orchestrator-executor doesn't use it; the README David shared only documents the action model). - Health-check waits up to 6 minutes for cold start with a 30s warning so the operator knows it's spinning up. - Print every documented response field (action, x/y, text, key, direction, amount, drag start/end, time, final_answer, thinking, parse_error, inference_time_seconds). - Three-step run that exercises a click, a typing continuation with formatted history, and an end+final_answer probe.	2026-04-30 00:37:44 +05:30
shivammittal274	231bd6821d	fix(eval): pin agisdk version + exclude 4 invalid tasks (Phase 2 dataset hygiene) (#844 ) * chore(eval): pin agisdk version to prevent silent dataset drift `pip install agisdk` previously fetched whatever version pip resolved at CI time. If agisdk publishes a new version with changed task definitions or grader behavior, the weekly eval silently shifts under our feet — making "did the score move because of code or data?" unanswerable. Pin to agisdk==0.3.5 (the version we currently develop against). Bump intentionally with a documented re-baseline run. * fix(eval): exclude 4 more tasks identified by 8-trial never-passing audit After 8 trials across K2.5 + Opus 4.6 (Phase 1 and Phase 2), 5 tasks never passed. Per-task root-cause investigation via parallel deep-dive subagents flagged 4 of them as fundamentally unfixable in the eval pipeline as it stands; the 5th (`dashdish-5`) is a prompt-rule fix that stays in. - gocalendar-7: goal/grader contradiction. Goal says "move event to July 19, 10 AM"; grader expects `eventsDiff.updated.*.start == "2024-07-18T17:00Z"` (= July 18, 10 AM PDT — same day, 1 hour shift). Even after the Phase 2 HTML5 dnd dispatch fix correctly populates `eventsDiff.updated`, the values are July 19 (matching the goal), which the grader rejects. - staynb-5: grader hardcodes literal `'Oct 13 2025'` and `'Oct 23 2025'` year strings. The staynb date picker interprets bare "Oct 13" as the most-recent-past instance (currently 2024 since today is 2026), not 2025. No agent can produce a persisted date string containing 2025. - staynb-9: under-specified task. Goal says "maximum number of guests supported"; grader requires the very specific string "32 Guests, 16 Infants" — encoding UI knowledge (Adults+Children=Guests display, Infants render separately, per-category cap=16, Pets excluded) that isn't in the prompt. Even Opus 4.6 stopped at 16 across 3 trials. - opendining-3: grader requires `contains(booking.date, '2024-07-20')` but the React-controlled date textbox flakily no-ops on `fill`. 3/8 trial pass rate is essentially coin-flip noise driven by tool-fidelity variance rather than agent capability. Removing to reduce score noise; Phase 2 fill post-validate warning helps when it does work, but the task's signal-to-noise is too low for the eval set. Dataset goes from 40 -> 36 tasks. Total EXCLUDED_TASKS now 11 entries. Validated by 8-trial pass-record audit; deep-dive notes saved to plans/audits/.	2026-04-29 22:07:53 +05:30
shivammittal274	d9c254053e	refactor(eval): drop unused agents/graders, collapse registries (#847 ) * refactor(eval): drop unused agents/graders, collapse registries Sweep of dead code in the eval app: deleted gemini-computer-use and yutori-navigator agents, fara/webvoyager/mind2web graders, eight debug/analyze/test scripts, three stale planning docs, and the orphaned eval-targets/coordinate-click testbed. With two agents and three graders left, the Map-backed plugin registries were over-engineered — collapsed both into plain switches. Removed the now-dead GraderOptions plumbing (no remaining grader takes API keys), dropped grader_api_key_env/grader_base_url/grader_model from the schema and configs, and de-duped PASS_FAIL_GRADER_ORDER (was defined in three places). Replaced the URL-parsing extractCdpPort hack in single-agent and orchestrator-executor with workerIndex passed cleanly through AgentContext. README and --help text rewritten to match reality. Renamed configs/test_.json to test-.json for kebab-case consistency. Net: ~10,460 LOC removed across 60 files. Typecheck clean, all tests pass. * ci(eval): pull BrowserOS from rolling stable CDN URL The pinned v0.44.0.1 .deb on GitHub releases regressed on Linux — servers start but never become healthy. Switch to the canonical rolling URL at cdn.browseros.com/download/BrowserOS.deb so CI tracks the same stable channel users get from the marketing site.	2026-04-29 02:14:47 +05:30
shivammittal274	af48a2110c	feat(eval): Phase 1 — exclude broken tasks, freshen card dates, add grader leniency (#841 ) * fix(eval): exclude broken tasks + freshen expired card dates Two AGISDK tasks are unsolvable today for non-model reasons: - topwork-1: evals-topwork.vercel.app throws Minified React error #185 ("Maximum update depth exceeded") on every form submit. The page renders "Application error: a client-side exception has occurred" instead of saving. Whole-task failure, every model affected. - fly-unified-2: hardcodes Exp: 12/25 in both the goal text AND a jmespath grader criterion. Today is 2026-04, so the eval-site rejects the card. Freshening the goal alone leaves the grader expecting the original value; freshening both would require monkey-patching agisdk's TaskConfig at runtime — too fragile to maintain. Adds these to a new EXCLUDED_TASKS set alongside the existing EXCLUDED_WEBSITES (omnizon). Also adds freshen_goal_dates(): for AGISDK fly-unified tasks whose goal contains an `Exp: MM/YY` within 6 months of today (or past), rewrites it to a far-future date (12/30). This rescues fly-unified-5 (had Exp 12/25, no card-exp grader criterion) and protects fly-unified-4 (had Exp 06/26, 2 months from expiring) from the next eval run hitting the same trap. Dataset goes from 47 -> 45 tasks; 2 freshened. * feat(eval): add lenient-strings grader softening The agisdk grader compares jmespath-extracted values via strict equality. For tasks where the model adds harmless decoration to a free-text field (e.g. topwork-3 expects title "Full-Stack Developer" but model produces "Full-Stack Developer - Enterprise Microservices Platform"), this fails every other criterion would pass. Adds a substring fallback in the wrapper: a failed criterion is re-marked as a softened pass when both actual_value and expected_value are strings and the (stripped, lower-cased) expected_value is contained in the actual_value. Numbers/bools/dates/None stay strict. - Default-on. Set AGISDK_STRICT_STRINGS=1 to recover the strict score. - Softened criteria are tagged with `softened: true` in per_criterion output for transparency in run manifests. - Aggregate `pass`/`reward` are recomputed after softening. Expected to rescue 4 tasks in our 45-set: topwork-3, topwork-4 (both pure title-decoration), gomail-8 (grader contradicts goal), and networkin-6 (grader hardcodes profile id). * fix(eval): exclude 5 more tasks where pipeline (not agent) fails Extends EXCLUDED_TASKS to 7 entries based on the K2.5 + Opus 4.6 head-to-head deep-dive on the 2026-04-28 runs. The exclusion rule: remove a task only if it is unsolvable for any agent — either the task data is invalid, the eval site is broken, or the grader penalizes correct work. Tasks that fail because of our agent's tool fidelity (drag, custom-widget fill, click on React submit, etc.) STAY in — those are real capability gaps the team should see in the score. New exclusions: - fly-unified-9: goal references "Dec 18 2024 at 10:00" but the live eval site has only 2025 inventory and no 10:00 slot. Both models successfully booked the closest available flight and were penalized on a grader expectation that can never be met. - fly-unified-4: eval site stores wall-clock flight times as bare UTC (T08:00:00.000Z) while the grader expects them shifted by 8h (T16:00:00.000Z = 8 AM PST). Opus 4.6 completed the entire booking correctly. Eval-site TZ-storage bug. - gomail-8: goal says "Clear all emails from GitHub in the inbox", but criterion 3 expects exactly 1 email updated. Both K2.5 and Opus correctly cleared all 4 GitHub emails. Grader contradicts goal. - networkin-6: goal says "Choose a random person you haven't connected with"; grader hardcodes profilesDiff.updated."4".connectionGrade. Both models randomized correctly and missed id 4. Grader contradicts goal. - networkin-9: eval site's searchHistoryDiff doesn't record queries submitted via the autocomplete + Enter path. Opus 4.6 completed the task end-to-end (Stanford alum, connection request, message); only failed because the search-history criterion was never written server-side. Eval-site bug. Dataset goes from 45 -> 40 tasks. Score impact (same K2.5/Opus runs, recomputed against the cleaned 40-task denominator): K2.5: 21/45 (46.7%) -> 21/40 (52.5%) Opus 4.6: 28/45 (62.2%) -> 28/40 (70.0%) Δ: 15.6 pp -> 17.5 pp (real model gap, less pipeline noise)	2026-04-28 23:19:31 +05:30
shivammittal274	01d649da9a	feat(eval): bring deterministic graders to dev + drop omnizon (#824 ) * feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664) * feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step * fix(eval): drop omnizon from AGISDK dataset (DMCA takedown) evals-omnizon.vercel.app returns HTTP 451 ("This content has been blocked for legal reasons / DMCA_TAKEDOWN"). All 5 omnizon-* tasks fail grading with "Failed to fetch /finish endpoint: JSON Parse error". Adds an EXCLUDED_WEBSITES set to the dataset builder and regenerates agisdk-real.jsonl (52 → 47 tasks). * fix(eval): correct Infinity port-assignment bugs Two related bugs in the Infinity eval runner that cause silent port collisions / fallbacks under parallel execution: 1. build-infinity-dataset.py emitted "app_port" but task-executor and the committed JSONL both read "app_base_port". Re-running the build script would silently make every task fall back to the 8000 default, ignoring per-app port assignments. Renamed the key to match. 2. task-executor derived workerIndex as `base_server_port - 9110`, but parallel-executor doesn't override base_server_port per worker — only server_url. Every worker computed workerIndex = 0, causing all parallel workers to spawn Infinity app servers on the same port. Threading workerIndex explicitly through TaskExecutor instead. Also drops an unused app_name parameter from load_tasks().	2026-04-27 21:35:43 +05:30
Nikhil	9bdb2413ec	feat: clean-up - remove obsolete controller extension (#610 ) * refactor(server): remove obsolete controller extension backend * fix: address review feedback for PR #610	2026-03-27 17:01:04 -07:00
shivammittal274	65547c60c0	fix(eval): clean up eval configs and add test-clado-api script (#540 ) Consolidate 13 configs down to 7 with uniform settings: - 3 weekly (CI): browseros-agent, browseros-oe-agent, browseros-oe-clado - 4 test (local): test_gemini-computer-use, test_yutori-navigator, test_webvoyager, test_mind2web - All configs: headless=false, captcha block, full browseros ports, restart_server_per_task Deleted: debug-test, mind2web-test, tool-loop-test, orchestrator-executor-test, orchestrator-executor-clado-test, fireworks-minimax-m2, webvoyager-test Added: test-clado-api.ts script, browseros-oe-agent-weekly.json (OE with AI SDK executor)	2026-03-24 01:28:05 +05:30
shivammittal274	f14942c6f9	feat(eval): show mean score instead of pass/fail in report and viewer (#534 )	2026-03-23 20:28:34 +05:30
shivammittal274	f157436e7d	feat(eval): switch to Linux GitHub-hosted runner (#519 ) * feat(eval): switch to ubuntu-latest runner, add OE-Clado config - Switch workflow from self-hosted Mac Studio to ubuntu-latest - Install BrowserOS Linux .deb in CI (no self-hosted runner needed) - Add browseros-oe-clado-weekly.json config for orchestrator-executor - Fix report chart to show date+time (not just date) - Make BROWSEROS_BINARY configurable via env var * feat(eval): add NopeCHA captcha solver extension to eval runs - Auto-load NopeCHA extension in eval Chrome instances - Works in incognito + headless mode - CI workflow downloads NopeCHA before eval - extensions/ directory gitignored (downloaded at runtime) * feat(eval): per-config concurrency — different configs run in parallel * feat(eval): remove concurrency limit — all runs execute in parallel	2026-03-21 23:04:45 +05:30
shivammittal274	4e90b4561a	feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516 ) * feat(eval): weekly eval pipeline with R2 uploads and trend dashboard Add infrastructure for running weekly evaluations and tracking score trends over time: - Auto-generated output dirs: results/{config-name}/{timestamp}/ Each eval run gets its own timestamped folder, nothing is overwritten. - upload-run.ts: uploads eval results to Cloudflare R2. Supports uploading a specific run or all un-uploaded runs for a config. - weekly-report.ts: generates an interactive HTML dashboard from R2 data. Config dropdown, trend chart with hover tooltips, searchable runs table. Groups runs by config name. - viewer.html: client-facing 3-column run viewer (task list, screenshots with autoplay, agent stream with messages.jsonl). Shows performance grader axis breakdown with per-axis scores. - browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5, webbench-2of4-50, 10 workers, performance grader, headless). - eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am) and manual trigger. Runs on self-hosted Mac Studio runner. Concurrency group ensures only one eval runs at a time. - Dashboard updates: load previous runs, messages.jsonl viewer, grade badges show percentages, async stream loading. - Grader updates: timeout 30min, max turns 100, DOM content verification guidance for performance grader. * fix(eval): address Greptile review — injection, nested dirs, escaping - Fix script injection in eval-weekly.yml: pass github.event.inputs through env var instead of interpolating into shell - Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs - Fix /api/load-run to allow single-slash run names (config/timestamp) - Add HTML escaping for R2-sourced values in weekly-report.ts - Escape axis names in viewer.html renderAxesBreakdown * fix(eval): fix biome lint — non-null assertion, template literals * fix(eval): fix biome errors — replace var with let, fix inner function declaration * fix(eval): address Greptile P2 issues - isRunDir: check all subdirs for metadata.json, not just first 3 - eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval') - load-run: default unknown termination_reason to 'failed' not 'completed' * feat(eval): make BROWSEROS_BINARY configurable via env var	2026-03-21 22:12:52 +05:30
shivammittal274	720baaed3e	feat: add GitHub Copilot as OAuth LLM provider (#500 ) * feat: add GitHub Copilot as OAuth-based LLM provider Add GitHub Copilot as a second OAuth provider using the Device Code flow (RFC 8628). Users authenticate via github.com/login/device, and the server polls for token completion. Supports 25+ models through a single Copilot subscription. Key changes: - Device Code OAuth flow in token manager (poll with safety margin) - Custom fetch wrapper injecting Copilot headers + vision detection - Provider factory using createOpenAICompatible for Chat Completions API - Extension UI with template card, auto-create on auth, and disconnect * fix: address PR review comments for GitHub Copilot OAuth - Validate device code response for error fields (GitHub can return 200 with error payload) - Store empty refreshToken instead of access token for GitHub tokens - Add closeButton to Toaster for dismissing device code toast * fix: add github-copilot to agent provider factory The chat route uses a separate provider-factory.ts (agent layer) from the test-provider route (llm/provider.ts). Added createGitHubCopilotFactory to the agent factory so chat works with GitHub Copilot. * fix: add github-copilot to provider icons, models, and dialog - Add Github icon from lucide-react to providerIcons map - Add 8 Copilot models (GPT-4o, Claude, Gemini, Grok) to models.ts - Add github-copilot to NewProviderDialog zod enum, validation skip, canTest check, and OAuth credential message * fix: reorder copilot models with free-tier models first Put models available on Copilot Free at the top (gpt-4o, gpt-4.1, gpt-5-mini, claude-haiku-4.5, grok-code-fast-1), followed by premium models that require paid Copilot subscription. * fix: set correct 64K context window for Copilot models Copilot API enforces a 64K input token limit regardless of the underlying model's native context window. Updated all model entries and the default template to 64000 so compaction triggers correctly. * fix: use actual per-model prompt limits from Copilot /models API Queried api.githubcopilot.com/models for real max_prompt_tokens values. GPT-4o/4.1 have 64K, Claude/gpt-5-mini have 128K, GPT-5.x have 272K. Also updated model list to match what's actually available on the API (e.g. claude-sonnet-4.6 instead of 4.5, added gpt-5.4/5.2-codex). * feat: resize images for Copilot using VS Code's algorithm Large screenshots cause 413 errors on Copilot's API. Resize images following VS Code's approach: max 2048px longest side, 768px shortest side, re-encode as JPEG at 75% quality. Uses sharp for server-side image processing. * fix: address all Greptile P1 review comments - Add .catch() on fire-and-forget pollDeviceCode to prevent unhandled rejection crashes (Node 15+) - Add deduplication guard (activeDeviceFlows Set) to prevent concurrent device code flows for the same provider - Add runtime validation of server response in frontend before calling window.open() and showing toast - Remove dead GITHUB_DEVICE_VERIFICATION constant from urls.ts * fix: upgrade biome to 2.4.8, fix all lint errors, and address review bugs - Upgrade biome from 2.4.5 to 2.4.8 (matches CI) and migrate configs - Fix image resize: only re-encode when dimensions actually change - Fix device code polling: retry on transient network errors instead of aborting - Allow restarting device code flow (clear old flow instead of throwing 500) - Fix pre-existing noNonNullAssertion and noExplicitAny lint errors globally * fix: address Greptile P2 review — image resize and config guard - Fix early-return guard: check max/min sides against their respective limits (MAX_LONG_SIDE/MAX_SHORT_SIDE) instead of both against SHORT - Preserve PNG alpha: detect hasAlpha and keep PNG format instead of unconditionally converting to lossy JPEG - Keep browserosId guard in resolveGitHubCopilotConfig consistent with ChatGPT Pro pattern (safety check that caller context is valid) * feat: update Copilot models to full list from pricing page, default to gpt-5-mini Added all 23 models from GitHub Copilot pricing page. Ordered with free-tier models first (gpt-5-mini, claude-haiku-4.5), then premium. Changed default from gpt-4o to gpt-5-mini since it's unlimited on Pro plan and has 128K context (vs gpt-4o's 64K limit).	2026-03-20 02:33:09 +05:30
shivammittal274	515ad44826	fix: resolve biome v2 config and lint errors (#471 ) Migrate `files.ignore` to `files.includes` for Biome v2 compatibility, fix forEach callback return value, unused variable, import ordering, and formatting violations.	2026-03-17 19:14:01 +05:30
shivammittal274	29056226bb	feat: add eval framework and coordinate-based input tools (#453 ) - Add hover_at, type_at, drag_at coordinate tools to server - Add hoverAt, typeAt, dragAt methods to Browser class - Export server internals (browser, tool-loop, registry) for eval imports - Copy eval app from enterprise repo with agents, graders, runner, dashboard - Nest eval-targets inside apps/eval - Adapt sessionExecutionDir → workingDir for current server API - Add biome ignore for dashboard HTML to prevent lint breaking onclick handlers	2026-03-16 23:12:23 +05:30

16 Commits