Commit Graph

6 Commits

Author SHA1 Message Date
Nikhil
26afb826c6 feat(eval): add viewer manifest contract (#878)
* refactor(eval): canonicalize viewer manifest contract

* refactor(eval): publish canonical viewer manifests

* feat(eval): make r2 viewer use manifest artifact paths

* fix(eval): keep weekly report compatible with viewer manifests

* docs(eval): document r2 viewer manifest contract

* chore: self-review fixes

* fix: address review feedback for PR #878
2026-04-29 20:50:35 -07:00
shivammittal274
d9c254053e refactor(eval): drop unused agents/graders, collapse registries (#847)
* refactor(eval): drop unused agents/graders, collapse registries

Sweep of dead code in the eval app: deleted gemini-computer-use and
yutori-navigator agents, fara/webvoyager/mind2web graders, eight
debug/analyze/test scripts, three stale planning docs, and the orphaned
eval-targets/coordinate-click testbed.

With two agents and three graders left, the Map-backed plugin registries
were over-engineered — collapsed both into plain switches. Removed the
now-dead GraderOptions plumbing (no remaining grader takes API keys),
dropped grader_api_key_env/grader_base_url/grader_model from the schema
and configs, and de-duped PASS_FAIL_GRADER_ORDER (was defined in three
places). Replaced the URL-parsing extractCdpPort hack in single-agent
and orchestrator-executor with workerIndex passed cleanly through
AgentContext.

README and --help text rewritten to match reality. Renamed
configs/test_*.json to test-*.json for kebab-case consistency.

Net: ~10,460 LOC removed across 60 files. Typecheck clean, all tests
pass.

* ci(eval): pull BrowserOS from rolling stable CDN URL

The pinned v0.44.0.1 .deb on GitHub releases regressed on Linux —
servers start but never become healthy. Switch to the canonical rolling
URL at cdn.browseros.com/download/BrowserOS.deb so CI tracks the same
stable channel users get from the marketing site.
2026-04-29 02:14:47 +05:30
shivammittal274
01d649da9a feat(eval): bring deterministic graders to dev + drop omnizon (#824)
* feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664)

* feat: add deterministic eval graders (AGI SDK + WebArena-Infinity)

Two new benchmark integrations with programmatic grading — no LLM judge.

AGI SDK / REAL Bench (52 tasks):
- 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.)
- Grader navigates browser to /finish, extracts state diff from <pre> tag
- Python verifier checks exact values via jmespath queries

WebArena-Infinity (50 hard tasks):
- 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.)
- InfinityAppManager starts fresh app server per task per worker
- Python verifier calls /api/state and asserts on JSON state

Infrastructure:
- GraderInput extended with mcpUrl + infinityAppUrl for parallel workers
- Each worker gets isolated ports (no cross-worker state contamination)
- CI workflow: pip install agisdk, clone webarena-infinity repo

* chore: switch eval configs back to kimi-k2p5

* fix: register deterministic graders in pass rate calculation

Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER
in both runner types and weekly report script, so scores show correctly
in the dashboard.

* chore: temp switch to opus 4.6 for eval run

* chore: restore kimi-k2p5 as default eval config

* ci: add timeout and continue-on-error for trend report step

* fix(eval): drop omnizon from AGISDK dataset (DMCA takedown)

evals-omnizon.vercel.app returns HTTP 451 ("This content has been
blocked for legal reasons / DMCA_TAKEDOWN"). All 5 omnizon-* tasks
fail grading with "Failed to fetch /finish endpoint: JSON Parse error".

Adds an EXCLUDED_WEBSITES set to the dataset builder and regenerates
agisdk-real.jsonl (52 → 47 tasks).

* fix(eval): correct Infinity port-assignment bugs

Two related bugs in the Infinity eval runner that cause silent port
collisions / fallbacks under parallel execution:

1. build-infinity-dataset.py emitted "app_port" but task-executor and
   the committed JSONL both read "app_base_port". Re-running the build
   script would silently make every task fall back to the 8000 default,
   ignoring per-app port assignments. Renamed the key to match.

2. task-executor derived workerIndex as `base_server_port - 9110`, but
   parallel-executor doesn't override base_server_port per worker —
   only server_url. Every worker computed workerIndex = 0, causing all
   parallel workers to spawn Infinity app servers on the same port.
   Threading workerIndex explicitly through TaskExecutor instead.

Also drops an unused app_name parameter from load_tasks().
2026-04-27 21:35:43 +05:30
shivammittal274
f14942c6f9 feat(eval): show mean score instead of pass/fail in report and viewer (#534) 2026-03-23 20:28:34 +05:30
shivammittal274
f157436e7d feat(eval): switch to Linux GitHub-hosted runner (#519)
* feat(eval): switch to ubuntu-latest runner, add OE-Clado config

- Switch workflow from self-hosted Mac Studio to ubuntu-latest
- Install BrowserOS Linux .deb in CI (no self-hosted runner needed)
- Add browseros-oe-clado-weekly.json config for orchestrator-executor
- Fix report chart to show date+time (not just date)
- Make BROWSEROS_BINARY configurable via env var

* feat(eval): add NopeCHA captcha solver extension to eval runs

- Auto-load NopeCHA extension in eval Chrome instances
- Works in incognito + headless mode
- CI workflow downloads NopeCHA before eval
- extensions/ directory gitignored (downloaded at runtime)

* feat(eval): per-config concurrency — different configs run in parallel

* feat(eval): remove concurrency limit — all runs execute in parallel
2026-03-21 23:04:45 +05:30
shivammittal274
4e90b4561a feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516)
* feat(eval): weekly eval pipeline with R2 uploads and trend dashboard

Add infrastructure for running weekly evaluations and tracking score
trends over time:

- Auto-generated output dirs: results/{config-name}/{timestamp}/
  Each eval run gets its own timestamped folder, nothing is overwritten.

- upload-run.ts: uploads eval results to Cloudflare R2. Supports
  uploading a specific run or all un-uploaded runs for a config.

- weekly-report.ts: generates an interactive HTML dashboard from R2
  data. Config dropdown, trend chart with hover tooltips, searchable
  runs table. Groups runs by config name.

- viewer.html: client-facing 3-column run viewer (task list,
  screenshots with autoplay, agent stream with messages.jsonl).
  Shows performance grader axis breakdown with per-axis scores.

- browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5,
  webbench-2of4-50, 10 workers, performance grader, headless).

- eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am)
  and manual trigger. Runs on self-hosted Mac Studio runner.
  Concurrency group ensures only one eval runs at a time.

- Dashboard updates: load previous runs, messages.jsonl viewer,
  grade badges show percentages, async stream loading.

- Grader updates: timeout 30min, max turns 100, DOM content
  verification guidance for performance grader.

* fix(eval): address Greptile review — injection, nested dirs, escaping

- Fix script injection in eval-weekly.yml: pass github.event.inputs
  through env var instead of interpolating into shell
- Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs
- Fix /api/load-run to allow single-slash run names (config/timestamp)
- Add HTML escaping for R2-sourced values in weekly-report.ts
- Escape axis names in viewer.html renderAxesBreakdown

* fix(eval): fix biome lint — non-null assertion, template literals

* fix(eval): fix biome errors — replace var with let, fix inner function declaration

* fix(eval): address Greptile P2 issues

- isRunDir: check all subdirs for metadata.json, not just first 3
- eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval')
- load-run: default unknown termination_reason to 'failed' not 'completed'

* feat(eval): make BROWSEROS_BINARY configurable via env var
2026-03-21 22:12:52 +05:30