mirror of
https://github.com/browseros-ai/BrowserOS.git
synced 2026-05-13 15:46:22 +00:00
* feat(eval): weekly eval pipeline with R2 uploads and trend dashboard
Add infrastructure for running weekly evaluations and tracking score
trends over time:
- Auto-generated output dirs: results/{config-name}/{timestamp}/
Each eval run gets its own timestamped folder, nothing is overwritten.
- upload-run.ts: uploads eval results to Cloudflare R2. Supports
uploading a specific run or all un-uploaded runs for a config.
- weekly-report.ts: generates an interactive HTML dashboard from R2
data. Config dropdown, trend chart with hover tooltips, searchable
runs table. Groups runs by config name.
- viewer.html: client-facing 3-column run viewer (task list,
screenshots with autoplay, agent stream with messages.jsonl).
Shows performance grader axis breakdown with per-axis scores.
- browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5,
webbench-2of4-50, 10 workers, performance grader, headless).
- eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am)
and manual trigger. Runs on self-hosted Mac Studio runner.
Concurrency group ensures only one eval runs at a time.
- Dashboard updates: load previous runs, messages.jsonl viewer,
grade badges show percentages, async stream loading.
- Grader updates: timeout 30min, max turns 100, DOM content
verification guidance for performance grader.
* fix(eval): address Greptile review — injection, nested dirs, escaping
- Fix script injection in eval-weekly.yml: pass github.event.inputs
through env var instead of interpolating into shell
- Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs
- Fix /api/load-run to allow single-slash run names (config/timestamp)
- Add HTML escaping for R2-sourced values in weekly-report.ts
- Escape axis names in viewer.html renderAxesBreakdown
* fix(eval): fix biome lint — non-null assertion, template literals
* fix(eval): fix biome errors — replace var with let, fix inner function declaration
* fix(eval): address Greptile P2 issues
- isRunDir: check all subdirs for metadata.json, not just first 3
- eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval')
- load-run: default unknown termination_reason to 'failed' not 'completed'
* feat(eval): make BROWSEROS_BINARY configurable via env var
31 lines
765 B
JSON
Vendored
31 lines
765 B
JSON
Vendored
{
|
|
"name": "@browseros/eval",
|
|
"version": "0.1.0",
|
|
"private": true,
|
|
"type": "module",
|
|
"scripts": {
|
|
"eval": "bun --env-file=.env.development run src/index.ts",
|
|
"typecheck": "tsc --noEmit"
|
|
},
|
|
"dependencies": {
|
|
"@anthropic-ai/claude-agent-sdk": "^0.2.63",
|
|
"@aws-sdk/client-s3": "^3.1014.0",
|
|
"@browseros/server": "workspace:*",
|
|
"@browseros/shared": "workspace:*",
|
|
"@google/gemini-cli-core": "^0.16.0",
|
|
"@google/genai": "1.30.0",
|
|
"@modelcontextprotocol/sdk": "^1.25.2",
|
|
"ai": "^6.0.94",
|
|
"hono": "^4.6.0",
|
|
"openai": "^4.0.0",
|
|
"sharp": "^0.34.5",
|
|
"uuid": "^9.0.0",
|
|
"zod": "^3.22.4"
|
|
},
|
|
"devDependencies": {
|
|
"@types/bun": "latest",
|
|
"@types/uuid": "^9.0.0",
|
|
"typescript": "^5.0.0"
|
|
}
|
|
}
|