mirror of
https://github.com/browseros-ai/BrowserOS.git
synced 2026-05-13 15:46:22 +00:00
* feat(eval): weekly eval pipeline with R2 uploads and trend dashboard
Add infrastructure for running weekly evaluations and tracking score
trends over time:
- Auto-generated output dirs: results/{config-name}/{timestamp}/
Each eval run gets its own timestamped folder, nothing is overwritten.
- upload-run.ts: uploads eval results to Cloudflare R2. Supports
uploading a specific run or all un-uploaded runs for a config.
- weekly-report.ts: generates an interactive HTML dashboard from R2
data. Config dropdown, trend chart with hover tooltips, searchable
runs table. Groups runs by config name.
- viewer.html: client-facing 3-column run viewer (task list,
screenshots with autoplay, agent stream with messages.jsonl).
Shows performance grader axis breakdown with per-axis scores.
- browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5,
webbench-2of4-50, 10 workers, performance grader, headless).
- eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am)
and manual trigger. Runs on self-hosted Mac Studio runner.
Concurrency group ensures only one eval runs at a time.
- Dashboard updates: load previous runs, messages.jsonl viewer,
grade badges show percentages, async stream loading.
- Grader updates: timeout 30min, max turns 100, DOM content
verification guidance for performance grader.
* fix(eval): address Greptile review — injection, nested dirs, escaping
- Fix script injection in eval-weekly.yml: pass github.event.inputs
through env var instead of interpolating into shell
- Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs
- Fix /api/load-run to allow single-slash run names (config/timestamp)
- Add HTML escaping for R2-sourced values in weekly-report.ts
- Escape axis names in viewer.html renderAxesBreakdown
* fix(eval): fix biome lint — non-null assertion, template literals
* fix(eval): fix biome errors — replace var with let, fix inner function declaration
* fix(eval): address Greptile P2 issues
- isRunDir: check all subdirs for metadata.json, not just first 3
- eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval')
- load-run: default unknown termination_reason to 'failed' not 'completed'
* feat(eval): make BROWSEROS_BINARY configurable via env var