mirror of
https://github.com/browseros-ai/BrowserOS.git
synced 2026-05-13 15:46:22 +00:00
* feat(eval): weekly eval pipeline with R2 uploads and trend dashboard
Add infrastructure for running weekly evaluations and tracking score
trends over time:
- Auto-generated output dirs: results/{config-name}/{timestamp}/
Each eval run gets its own timestamped folder, nothing is overwritten.
- upload-run.ts: uploads eval results to Cloudflare R2. Supports
uploading a specific run or all un-uploaded runs for a config.
- weekly-report.ts: generates an interactive HTML dashboard from R2
data. Config dropdown, trend chart with hover tooltips, searchable
runs table. Groups runs by config name.
- viewer.html: client-facing 3-column run viewer (task list,
screenshots with autoplay, agent stream with messages.jsonl).
Shows performance grader axis breakdown with per-axis scores.
- browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5,
webbench-2of4-50, 10 workers, performance grader, headless).
- eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am)
and manual trigger. Runs on self-hosted Mac Studio runner.
Concurrency group ensures only one eval runs at a time.
- Dashboard updates: load previous runs, messages.jsonl viewer,
grade badges show percentages, async stream loading.
- Grader updates: timeout 30min, max turns 100, DOM content
verification guidance for performance grader.
* fix(eval): address Greptile review — injection, nested dirs, escaping
- Fix script injection in eval-weekly.yml: pass github.event.inputs
through env var instead of interpolating into shell
- Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs
- Fix /api/load-run to allow single-slash run names (config/timestamp)
- Add HTML escaping for R2-sourced values in weekly-report.ts
- Escape axis names in viewer.html renderAxesBreakdown
* fix(eval): fix biome lint — non-null assertion, template literals
* fix(eval): fix biome errors — replace var with let, fix inner function declaration
* fix(eval): address Greptile P2 issues
- isRunDir: check all subdirs for metadata.json, not just first 3
- eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval')
- load-run: default unknown termination_reason to 'failed' not 'completed'
* feat(eval): make BROWSEROS_BINARY configurable via env var
27 lines
756 B
JSON
27 lines
756 B
JSON
{
|
|
"agent": {
|
|
"type": "single",
|
|
"provider": "openai-compatible",
|
|
"model": "accounts/fireworks/models/kimi-k2p5",
|
|
"apiKey": "FIREWORKS_API_KEY",
|
|
"baseUrl": "https://api.fireworks.ai/inference/v1",
|
|
"supportsImages": true
|
|
},
|
|
"dataset": "../data/webbench-2of4-50.jsonl",
|
|
"num_workers": 10,
|
|
"restart_server_per_task": true,
|
|
"browseros": {
|
|
"server_url": "http://127.0.0.1:9110",
|
|
"base_cdp_port": 9010,
|
|
"base_server_port": 9110,
|
|
"base_extension_port": 9310,
|
|
"load_extensions": false,
|
|
"headless": true
|
|
},
|
|
"graders": ["performance_grader"],
|
|
"grader_api_key_env": "OPENROUTER_API_KEY",
|
|
"grader_base_url": "https://openrouter.ai/api/v1",
|
|
"grader_model": "openai/gpt-4.1",
|
|
"timeout_ms": 1800000
|
|
}
|