BrowserOS

mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-19 11:31:03 +00:00

Author	SHA1	Message	Date
shivammittal274	d383b5e344	feat(eval): add claude-generated run report artifact (#892 ) * feat(eval): add claude-generated run report artifact * fix(eval): install claude code cli for CI evals * fix(eval): bypass claude code tool permissions * Eval metrics configs (#932) * feat(eval): add agisdk comparison metrics configs * fix(eval): keep cdp crashes from aborting run	2026-05-04 21:09:06 +05:30
Nikhil	84a79ba0a1	feat: refactor eval pipeline workflow (#875 ) * feat(eval): add suite variant config bridge * feat(eval): add stable run artifacts * refactor(eval): add shared grader contract * feat(eval): persist grader artifacts * refactor(eval): rename runner layers * refactor(eval): add executor backend boundary * refactor(eval): split clado backend * feat(eval): add workflow compatible cli * feat(eval): add r2 publisher module * ci(eval): migrate weekly workflow to eval cli * docs(eval): document suite pipeline * chore(eval): verify pipeline refactor * fix: address review feedback for PR #875 * docs(eval): add env example * docs(eval): explain suites and variants * chore(eval): organize config layouts * chore(eval): colocate grader python evaluators	2026-04-29 17:21:02 -07:00
shivammittal274	df0f45dd29	Feat: eval debug dev ci (#869 ) * chore(eval): instrument server startup to root-cause dev CI health-check timeouts Three diagnostics + one config swap to investigate why the eval-weekly workflow has been failing on dev since 2026-04-25 with "Server health check timed out" (every worker, every retry). Background: - Last successful weekly eval on dev: 2026-04-18 (sha `f5a2b73`) - Since then, ~30 server commits landed including Lima/VM runtime, OpenClaw service, ACL system, ACP SDK — 108 server files changed, ~13K LOC added. - Server process spawns cleanly in CI (PID logged) but never binds /health within the 30s eval-side timeout. Static analysis finds no obvious blocker; we need runtime evidence. Changes: 1. apps/server/package.json — add `start:ci` script (no `--watch`). The default `start` uses `bun --watch` which forks a child process that watches every file in the import graph. Dev's graph is ~108 files larger than main's; on a cold CI runner the watcher setup is a plausible source of multi-second startup overhead. 2. apps/eval/src/runner/browseros-app-manager.ts: - Use `start:ci` when `process.env.CI` is set (true on GitHub-hosted runners by default), else `start`. - Capture per-worker server stderr to /tmp/browseros-server-logs/ instead of ignoring it. Without this we have no visibility into why the server is hung pre-/health. - Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module graph may simply need more cold-start time on CI. 3. .github/workflows/eval-weekly.yml — upload the server logs dir as a workflow artifact (always, not just on success) so we can post-mortem any startup failure on the next run. 4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter -> Fireworks (bypasses the OpenRouter per-key spend cap that has been eating recent runs) and drop num_workers 10 -> 4 (well below the Fireworks per-account TPM threshold that overwhelmed the original 2026-04-23 run). Plan: trigger the eval-weekly workflow on this branch with the agisdk config and observe (a) whether it gets past server startup, and (b) if it doesn't, what the captured server stderr says. * fix(eval): capture stdout too — pino logger writes to stdout, not stderr Previous diagnostic patch only redirected stderr; the captured per-worker log files came back as 0 bytes because the server uses pino which writes all log output to stdout (fd 1), not stderr (fd 2). Capture both into the same file. * fix(server): catch sync throw from OpenClaw constructor on Linux The container runtime constructor in OpenClawService throws synchronously on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing .catch() on tryAutoStart() only handles async throws inside auto-start — the sync throw from configureOpenClawService(...) itself propagates up through Application.start() and crashes the process via index.ts:48 (process.exit(EXIT_CODES.GENERAL_ERROR)). This is what's been killing dev's eval-weekly CI: the server crashes in milliseconds, the eval client polls /health, gets nothing, times out. Fix: wrap the configureOpenClawService call in try/catch matching the existing .catch() intent (best-effort, don't crash). Server continues without OpenClaw on platforms where it can't initialize. Verified by reading captured server stdout from run 25123195126: Failed to start server: error: browseros-vm currently supports macOS only at buildContainerRuntime (container-runtime-factory.ts:54:11) at new OpenClawService (openclaw-service.ts:652:15) at configureOpenClawService (openclaw-service.ts:1527:19) at start (main.ts:127:5) * fix(server): defer OpenClaw chat client port lookup to request time apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort() synchronously when constructing the OpenClawGatewayChatClient inside the createHttpServer object literal. On non-darwin platforms this throws via the OpenClawService constructor → buildContainerRuntime, escaping the try/catch added in `5cf7b765` (which only protected the configureOpenClawService call further down in main.ts). Every other getOpenClawService() reference in server.ts is already wrapped in an arrow function. This was the lone holdout. Make it lazy too: change the chat client constructor to take getHostPort: () => number instead of hostPort: number, evaluate it inside streamTurn at request time. Behavior on darwin is unchanged. This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't available — the chat endpoint isn't exercised by the eval, so a deferred throw is acceptable. * fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1 Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't unblock dev's Linux CI — same throw kept reproducing. Whether this is bun caching stale stack frames or a missed eager call site, the safer move is to fix it at the root: make buildContainerRuntime never throw on Linux when the runner has explicitly opted out. Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test escape hatch in container-runtime-factory.ts. When set, returns the existing UnsupportedPlatformTestRuntime stub — server boots normally, /health binds, any actual OpenClaw API call still fails loudly at request time. eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and non-CI Linux behavior unchanged (without the flag they still throw). * feat(eval): align Clado action executor with new endpoint contract David Shan shared the updated Clado BrowserOS Action Model spec. Changes to match it: - Bump endpoint URL + model id to the 000159-merged checkpoint (clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef) in browseros-oe-clado-weekly.json and the README example. - CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min; the 2-min ceiling was failing every cold-start request. - Treat HTTP 200 with action=null / parse_error as an INVALID step instead of aborting the executor loop. The model can self-correct on the next call. Cap consecutive parse failures at 3 to avoid infinite loops. - Capture final_answer from end actions. Surface it in the observation back to the orchestrator so its task answer can use the model's declared result. - Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X). - Switch screenshot format from webp → png to match the documented "PNG or JPEG" contract. * chore(eval): refresh test-clado-api script for new Clado contract Updated the local smoke-test to match the new Clado endpoint and response contract: - New action + health URLs (000159-merged checkpoint). - Drop the grounding-model branch (orchestrator-executor doesn't use it; the README David shared only documents the action model). - Health-check waits up to 6 minutes for cold start with a 30s warning so the operator knows it's spinning up. - Print every documented response field (action, x/y, text, key, direction, amount, drag start/end, time, final_answer, thinking, parse_error, inference_time_seconds). - Three-step run that exercises a click, a typing continuation with formatted history, and an end+final_answer probe. * chore(eval): point clado weekly config at agisdk-real Switches the orchestrator-executor + Clado weekly config to run on the AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff grader. Matches the orchestrator-executor smoke target (Fireworks K2.5 orchestrator + Clado action executor) we want to track week-over-week. * chore(eval): run clado weekly headless Default to headless so the weekly job (and local repros) don't pop ten visible Chrome windows. Set headless=false locally if you need to watch a worker. * fix(eval): address Greptile P1+P2 on server log fd handling P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the log directory missing and crash the server spawn with ENOENT. Move openSync into the same try block; fall back to /dev/null so spawn always succeeds. P2: the log fd was opened on every server start but never closed. Each restart attempt leaked one fd across all workers — over a long eval run that could exhaust the process fd limit. Track the fd on the manager and closeSync it in killApp() right after the server process exits (the child's dup keeps the file open until it exits, so we don't truncate output).	2026-04-30 01:33:49 +05:30
shivammittal274	231bd6821d	fix(eval): pin agisdk version + exclude 4 invalid tasks (Phase 2 dataset hygiene) (#844 ) * chore(eval): pin agisdk version to prevent silent dataset drift `pip install agisdk` previously fetched whatever version pip resolved at CI time. If agisdk publishes a new version with changed task definitions or grader behavior, the weekly eval silently shifts under our feet — making "did the score move because of code or data?" unanswerable. Pin to agisdk==0.3.5 (the version we currently develop against). Bump intentionally with a documented re-baseline run. * fix(eval): exclude 4 more tasks identified by 8-trial never-passing audit After 8 trials across K2.5 + Opus 4.6 (Phase 1 and Phase 2), 5 tasks never passed. Per-task root-cause investigation via parallel deep-dive subagents flagged 4 of them as fundamentally unfixable in the eval pipeline as it stands; the 5th (`dashdish-5`) is a prompt-rule fix that stays in. - gocalendar-7: goal/grader contradiction. Goal says "move event to July 19, 10 AM"; grader expects `eventsDiff.updated.*.start == "2024-07-18T17:00Z"` (= July 18, 10 AM PDT — same day, 1 hour shift). Even after the Phase 2 HTML5 dnd dispatch fix correctly populates `eventsDiff.updated`, the values are July 19 (matching the goal), which the grader rejects. - staynb-5: grader hardcodes literal `'Oct 13 2025'` and `'Oct 23 2025'` year strings. The staynb date picker interprets bare "Oct 13" as the most-recent-past instance (currently 2024 since today is 2026), not 2025. No agent can produce a persisted date string containing 2025. - staynb-9: under-specified task. Goal says "maximum number of guests supported"; grader requires the very specific string "32 Guests, 16 Infants" — encoding UI knowledge (Adults+Children=Guests display, Infants render separately, per-category cap=16, Pets excluded) that isn't in the prompt. Even Opus 4.6 stopped at 16 across 3 trials. - opendining-3: grader requires `contains(booking.date, '2024-07-20')` but the React-controlled date textbox flakily no-ops on `fill`. 3/8 trial pass rate is essentially coin-flip noise driven by tool-fidelity variance rather than agent capability. Removing to reduce score noise; Phase 2 fill post-validate warning helps when it does work, but the task's signal-to-noise is too low for the eval set. Dataset goes from 40 -> 36 tasks. Total EXCLUDED_TASKS now 11 entries. Validated by 8-trial pass-record audit; deep-dive notes saved to plans/audits/.	2026-04-29 22:07:53 +05:30
Nikhil	1946ca0cf8	chore: clean up unused agent sdk (#855 )	2026-04-28 17:21:46 -07:00
shivammittal274	d9c254053e	refactor(eval): drop unused agents/graders, collapse registries (#847 ) * refactor(eval): drop unused agents/graders, collapse registries Sweep of dead code in the eval app: deleted gemini-computer-use and yutori-navigator agents, fara/webvoyager/mind2web graders, eight debug/analyze/test scripts, three stale planning docs, and the orphaned eval-targets/coordinate-click testbed. With two agents and three graders left, the Map-backed plugin registries were over-engineered — collapsed both into plain switches. Removed the now-dead GraderOptions plumbing (no remaining grader takes API keys), dropped grader_api_key_env/grader_base_url/grader_model from the schema and configs, and de-duped PASS_FAIL_GRADER_ORDER (was defined in three places). Replaced the URL-parsing extractCdpPort hack in single-agent and orchestrator-executor with workerIndex passed cleanly through AgentContext. README and --help text rewritten to match reality. Renamed configs/test_.json to test-.json for kebab-case consistency. Net: ~10,460 LOC removed across 60 files. Typecheck clean, all tests pass. * ci(eval): pull BrowserOS from rolling stable CDN URL The pinned v0.44.0.1 .deb on GitHub releases regressed on Linux — servers start but never become healthy. Switch to the canonical rolling URL at cdn.browseros.com/download/BrowserOS.deb so CI tracks the same stable channel users get from the marketing site.	2026-04-29 02:14:47 +05:30
shivammittal274	01d649da9a	feat(eval): bring deterministic graders to dev + drop omnizon (#824 ) * feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664) * feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step * fix(eval): drop omnizon from AGISDK dataset (DMCA takedown) evals-omnizon.vercel.app returns HTTP 451 ("This content has been blocked for legal reasons / DMCA_TAKEDOWN"). All 5 omnizon-* tasks fail grading with "Failed to fetch /finish endpoint: JSON Parse error". Adds an EXCLUDED_WEBSITES set to the dataset builder and regenerates agisdk-real.jsonl (52 → 47 tasks). * fix(eval): correct Infinity port-assignment bugs Two related bugs in the Infinity eval runner that cause silent port collisions / fallbacks under parallel execution: 1. build-infinity-dataset.py emitted "app_port" but task-executor and the committed JSONL both read "app_base_port". Re-running the build script would silently make every task fall back to the 8000 default, ignoring per-app port assignments. Renamed the key to match. 2. task-executor derived workerIndex as `base_server_port - 9110`, but parallel-executor doesn't override base_server_port per worker — only server_url. Every worker computed workerIndex = 0, causing all parallel workers to spawn Infinity app servers on the same port. Threading workerIndex explicitly through TaskExecutor instead. Also drops an unused app_name parameter from load_tasks().	2026-04-27 21:35:43 +05:30
shivammittal274	0babc05077	feat(eval): NopeCHA CAPTCHA solver integration (#537 ) * feat(eval): show mean score instead of pass/fail in report and viewer * feat(eval): integrate NopeCHA CAPTCHA solver into eval pipeline Add CAPTCHA detection and waiting so screenshots capture post-solve state. Run headed with xvfb on CI since headless breaks extension content scripts. - Add CaptchaWaiter module (detect reCAPTCHA/hCaptcha/Turnstile, poll until solved) - Add optional `captcha` config block to EvalConfigSchema - Wait for CAPTCHA solve before screenshot in single-agent and orchestrator-executor - Patch NopeCHA manifest with API key before launching workers - Fix CAPTCHA_EXT_DIR path (was pointing one level too high) - Remove --incognito (extensions don't run in incognito; fresh user-data-dir isolates) - CI: install xvfb, run headed via xvfb-run, pass NOPECHA_API_KEY secret	2026-03-24 00:14:16 +05:30
shivammittal274	026c6a03a3	feat(eval): auto-trigger eval on agent/tools changes pushed to main (#528 )	2026-03-23 16:52:30 +05:30
shivammittal274	0f9d93058f	chore(eval): remove unused env vars from workflow (OPENROUTER, OPENAI) (#522 )	2026-03-21 23:22:03 +05:30
shivammittal274	cafed57832	fix(eval): use CLAUDE_CODE_OAUTH_TOKEN for performance grader auth (#521 )	2026-03-21 23:14:23 +05:30
shivammittal274	f157436e7d	feat(eval): switch to Linux GitHub-hosted runner (#519 ) * feat(eval): switch to ubuntu-latest runner, add OE-Clado config - Switch workflow from self-hosted Mac Studio to ubuntu-latest - Install BrowserOS Linux .deb in CI (no self-hosted runner needed) - Add browseros-oe-clado-weekly.json config for orchestrator-executor - Fix report chart to show date+time (not just date) - Make BROWSEROS_BINARY configurable via env var * feat(eval): add NopeCHA captcha solver extension to eval runs - Auto-load NopeCHA extension in eval Chrome instances - Works in incognito + headless mode - CI workflow downloads NopeCHA before eval - extensions/ directory gitignored (downloaded at runtime) * feat(eval): per-config concurrency — different configs run in parallel * feat(eval): remove concurrency limit — all runs execute in parallel	2026-03-21 23:04:45 +05:30
shivammittal274	4e90b4561a	feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516 ) * feat(eval): weekly eval pipeline with R2 uploads and trend dashboard Add infrastructure for running weekly evaluations and tracking score trends over time: - Auto-generated output dirs: results/{config-name}/{timestamp}/ Each eval run gets its own timestamped folder, nothing is overwritten. - upload-run.ts: uploads eval results to Cloudflare R2. Supports uploading a specific run or all un-uploaded runs for a config. - weekly-report.ts: generates an interactive HTML dashboard from R2 data. Config dropdown, trend chart with hover tooltips, searchable runs table. Groups runs by config name. - viewer.html: client-facing 3-column run viewer (task list, screenshots with autoplay, agent stream with messages.jsonl). Shows performance grader axis breakdown with per-axis scores. - browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5, webbench-2of4-50, 10 workers, performance grader, headless). - eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am) and manual trigger. Runs on self-hosted Mac Studio runner. Concurrency group ensures only one eval runs at a time. - Dashboard updates: load previous runs, messages.jsonl viewer, grade badges show percentages, async stream loading. - Grader updates: timeout 30min, max turns 100, DOM content verification guidance for performance grader. * fix(eval): address Greptile review — injection, nested dirs, escaping - Fix script injection in eval-weekly.yml: pass github.event.inputs through env var instead of interpolating into shell - Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs - Fix /api/load-run to allow single-slash run names (config/timestamp) - Add HTML escaping for R2-sourced values in weekly-report.ts - Escape axis names in viewer.html renderAxesBreakdown * fix(eval): fix biome lint — non-null assertion, template literals * fix(eval): fix biome errors — replace var with let, fix inner function declaration * fix(eval): address Greptile P2 issues - isRunDir: check all subdirs for metadata.json, not just first 3 - eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval') - load-run: default unknown termination_reason to 'failed' not 'completed' * feat(eval): make BROWSEROS_BINARY configurable via env var	2026-03-21 22:12:52 +05:30

13 Commits