Files
shivammittal274 4e90b4561a feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516)
* feat(eval): weekly eval pipeline with R2 uploads and trend dashboard

Add infrastructure for running weekly evaluations and tracking score
trends over time:

- Auto-generated output dirs: results/{config-name}/{timestamp}/
  Each eval run gets its own timestamped folder, nothing is overwritten.

- upload-run.ts: uploads eval results to Cloudflare R2. Supports
  uploading a specific run or all un-uploaded runs for a config.

- weekly-report.ts: generates an interactive HTML dashboard from R2
  data. Config dropdown, trend chart with hover tooltips, searchable
  runs table. Groups runs by config name.

- viewer.html: client-facing 3-column run viewer (task list,
  screenshots with autoplay, agent stream with messages.jsonl).
  Shows performance grader axis breakdown with per-axis scores.

- browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5,
  webbench-2of4-50, 10 workers, performance grader, headless).

- eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am)
  and manual trigger. Runs on self-hosted Mac Studio runner.
  Concurrency group ensures only one eval runs at a time.

- Dashboard updates: load previous runs, messages.jsonl viewer,
  grade badges show percentages, async stream loading.

- Grader updates: timeout 30min, max turns 100, DOM content
  verification guidance for performance grader.

* fix(eval): address Greptile review — injection, nested dirs, escaping

- Fix script injection in eval-weekly.yml: pass github.event.inputs
  through env var instead of interpolating into shell
- Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs
- Fix /api/load-run to allow single-slash run names (config/timestamp)
- Add HTML escaping for R2-sourced values in weekly-report.ts
- Escape axis names in viewer.html renderAxesBreakdown

* fix(eval): fix biome lint — non-null assertion, template literals

* fix(eval): fix biome errors — replace var with let, fix inner function declaration

* fix(eval): address Greptile P2 issues

- isRunDir: check all subdirs for metadata.json, not just first 3
- eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval')
- load-run: default unknown termination_reason to 'failed' not 'completed'

* feat(eval): make BROWSEROS_BINARY configurable via env var
2026-03-21 22:12:52 +05:30

31 lines
765 B
JSON
Vendored

{
"name": "@browseros/eval",
"version": "0.1.0",
"private": true,
"type": "module",
"scripts": {
"eval": "bun --env-file=.env.development run src/index.ts",
"typecheck": "tsc --noEmit"
},
"dependencies": {
"@anthropic-ai/claude-agent-sdk": "^0.2.63",
"@aws-sdk/client-s3": "^3.1014.0",
"@browseros/server": "workspace:*",
"@browseros/shared": "workspace:*",
"@google/gemini-cli-core": "^0.16.0",
"@google/genai": "1.30.0",
"@modelcontextprotocol/sdk": "^1.25.2",
"ai": "^6.0.94",
"hono": "^4.6.0",
"openai": "^4.0.0",
"sharp": "^0.34.5",
"uuid": "^9.0.0",
"zod": "^3.22.4"
},
"devDependencies": {
"@types/bun": "latest",
"@types/uuid": "^9.0.0",
"typescript": "^5.0.0"
}
}