BrowserOS

mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-21 12:55:09 +00:00

Author	SHA1	Message	Date
Nikhil	26afb826c6	feat(eval): add viewer manifest contract (#878 ) * refactor(eval): canonicalize viewer manifest contract * refactor(eval): publish canonical viewer manifests * feat(eval): make r2 viewer use manifest artifact paths * fix(eval): keep weekly report compatible with viewer manifests * docs(eval): document r2 viewer manifest contract * chore: self-review fixes * fix: address review feedback for PR #878	2026-04-29 20:50:35 -07:00
shivammittal274	d9c254053e	refactor(eval): drop unused agents/graders, collapse registries (#847 ) * refactor(eval): drop unused agents/graders, collapse registries Sweep of dead code in the eval app: deleted gemini-computer-use and yutori-navigator agents, fara/webvoyager/mind2web graders, eight debug/analyze/test scripts, three stale planning docs, and the orphaned eval-targets/coordinate-click testbed. With two agents and three graders left, the Map-backed plugin registries were over-engineered — collapsed both into plain switches. Removed the now-dead GraderOptions plumbing (no remaining grader takes API keys), dropped grader_api_key_env/grader_base_url/grader_model from the schema and configs, and de-duped PASS_FAIL_GRADER_ORDER (was defined in three places). Replaced the URL-parsing extractCdpPort hack in single-agent and orchestrator-executor with workerIndex passed cleanly through AgentContext. README and --help text rewritten to match reality. Renamed configs/test_.json to test-.json for kebab-case consistency. Net: ~10,460 LOC removed across 60 files. Typecheck clean, all tests pass. * ci(eval): pull BrowserOS from rolling stable CDN URL The pinned v0.44.0.1 .deb on GitHub releases regressed on Linux — servers start but never become healthy. Switch to the canonical rolling URL at cdn.browseros.com/download/BrowserOS.deb so CI tracks the same stable channel users get from the marketing site.	2026-04-29 02:14:47 +05:30
shivammittal274	01d649da9a	feat(eval): bring deterministic graders to dev + drop omnizon (#824 ) * feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664) * feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step * fix(eval): drop omnizon from AGISDK dataset (DMCA takedown) evals-omnizon.vercel.app returns HTTP 451 ("This content has been blocked for legal reasons / DMCA_TAKEDOWN"). All 5 omnizon-* tasks fail grading with "Failed to fetch /finish endpoint: JSON Parse error". Adds an EXCLUDED_WEBSITES set to the dataset builder and regenerates agisdk-real.jsonl (52 → 47 tasks). * fix(eval): correct Infinity port-assignment bugs Two related bugs in the Infinity eval runner that cause silent port collisions / fallbacks under parallel execution: 1. build-infinity-dataset.py emitted "app_port" but task-executor and the committed JSONL both read "app_base_port". Re-running the build script would silently make every task fall back to the 8000 default, ignoring per-app port assignments. Renamed the key to match. 2. task-executor derived workerIndex as `base_server_port - 9110`, but parallel-executor doesn't override base_server_port per worker — only server_url. Every worker computed workerIndex = 0, causing all parallel workers to spawn Infinity app servers on the same port. Threading workerIndex explicitly through TaskExecutor instead. Also drops an unused app_name parameter from load_tasks().	2026-04-27 21:35:43 +05:30
shivammittal274	f14942c6f9	feat(eval): show mean score instead of pass/fail in report and viewer (#534 )	2026-03-23 20:28:34 +05:30
shivammittal274	f157436e7d	feat(eval): switch to Linux GitHub-hosted runner (#519 ) * feat(eval): switch to ubuntu-latest runner, add OE-Clado config - Switch workflow from self-hosted Mac Studio to ubuntu-latest - Install BrowserOS Linux .deb in CI (no self-hosted runner needed) - Add browseros-oe-clado-weekly.json config for orchestrator-executor - Fix report chart to show date+time (not just date) - Make BROWSEROS_BINARY configurable via env var * feat(eval): add NopeCHA captcha solver extension to eval runs - Auto-load NopeCHA extension in eval Chrome instances - Works in incognito + headless mode - CI workflow downloads NopeCHA before eval - extensions/ directory gitignored (downloaded at runtime) * feat(eval): per-config concurrency — different configs run in parallel * feat(eval): remove concurrency limit — all runs execute in parallel	2026-03-21 23:04:45 +05:30
shivammittal274	4e90b4561a	feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516 ) * feat(eval): weekly eval pipeline with R2 uploads and trend dashboard Add infrastructure for running weekly evaluations and tracking score trends over time: - Auto-generated output dirs: results/{config-name}/{timestamp}/ Each eval run gets its own timestamped folder, nothing is overwritten. - upload-run.ts: uploads eval results to Cloudflare R2. Supports uploading a specific run or all un-uploaded runs for a config. - weekly-report.ts: generates an interactive HTML dashboard from R2 data. Config dropdown, trend chart with hover tooltips, searchable runs table. Groups runs by config name. - viewer.html: client-facing 3-column run viewer (task list, screenshots with autoplay, agent stream with messages.jsonl). Shows performance grader axis breakdown with per-axis scores. - browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5, webbench-2of4-50, 10 workers, performance grader, headless). - eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am) and manual trigger. Runs on self-hosted Mac Studio runner. Concurrency group ensures only one eval runs at a time. - Dashboard updates: load previous runs, messages.jsonl viewer, grade badges show percentages, async stream loading. - Grader updates: timeout 30min, max turns 100, DOM content verification guidance for performance grader. * fix(eval): address Greptile review — injection, nested dirs, escaping - Fix script injection in eval-weekly.yml: pass github.event.inputs through env var instead of interpolating into shell - Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs - Fix /api/load-run to allow single-slash run names (config/timestamp) - Add HTML escaping for R2-sourced values in weekly-report.ts - Escape axis names in viewer.html renderAxesBreakdown * fix(eval): fix biome lint — non-null assertion, template literals * fix(eval): fix biome errors — replace var with let, fix inner function declaration * fix(eval): address Greptile P2 issues - isRunDir: check all subdirs for metadata.json, not just first 3 - eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval') - load-run: default unknown termination_reason to 'failed' not 'completed' * feat(eval): make BROWSEROS_BINARY configurable via env var	2026-03-21 22:12:52 +05:30

6 Commits