Files
BrowserOS/.github/workflows
shivammittal274 6305c87aa4 feat: add deterministic eval graders (AGI SDK + WebArena-Infinity)
Two new benchmark integrations with programmatic grading — no LLM judge.

AGI SDK / REAL Bench (52 tasks):
- 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.)
- Grader navigates browser to /finish, extracts state diff from <pre> tag
- Python verifier checks exact values via jmespath queries

WebArena-Infinity (50 hard tasks):
- 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.)
- InfinityAppManager starts fresh app server per task per worker
- Python verifier calls /api/state and asserts on JSON state

Infrastructure:
- GraderInput extended with mcpUrl + infinityAppUrl for parallel workers
- Each worker gets isolated ports (no cross-worker state contamination)
- CI workflow: pip install agisdk, clone webarena-infinity repo
2026-04-09 12:19:55 +05:30
..
2026-03-17 18:56:55 +05:30
2026-03-17 18:56:55 +05:30
2026-03-17 18:56:55 +05:30
2026-03-17 18:56:55 +05:30