mirror of
https://github.com/browseros-ai/BrowserOS.git
synced 2026-05-13 15:46:22 +00:00
feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516)
* feat(eval): weekly eval pipeline with R2 uploads and trend dashboard
Add infrastructure for running weekly evaluations and tracking score
trends over time:
- Auto-generated output dirs: results/{config-name}/{timestamp}/
Each eval run gets its own timestamped folder, nothing is overwritten.
- upload-run.ts: uploads eval results to Cloudflare R2. Supports
uploading a specific run or all un-uploaded runs for a config.
- weekly-report.ts: generates an interactive HTML dashboard from R2
data. Config dropdown, trend chart with hover tooltips, searchable
runs table. Groups runs by config name.
- viewer.html: client-facing 3-column run viewer (task list,
screenshots with autoplay, agent stream with messages.jsonl).
Shows performance grader axis breakdown with per-axis scores.
- browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5,
webbench-2of4-50, 10 workers, performance grader, headless).
- eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am)
and manual trigger. Runs on self-hosted Mac Studio runner.
Concurrency group ensures only one eval runs at a time.
- Dashboard updates: load previous runs, messages.jsonl viewer,
grade badges show percentages, async stream loading.
- Grader updates: timeout 30min, max turns 100, DOM content
verification guidance for performance grader.
* fix(eval): address Greptile review — injection, nested dirs, escaping
- Fix script injection in eval-weekly.yml: pass github.event.inputs
through env var instead of interpolating into shell
- Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs
- Fix /api/load-run to allow single-slash run names (config/timestamp)
- Add HTML escaping for R2-sourced values in weekly-report.ts
- Escape axis names in viewer.html renderAxesBreakdown
* fix(eval): fix biome lint — non-null assertion, template literals
* fix(eval): fix biome errors — replace var with let, fix inner function declaration
* fix(eval): address Greptile P2 issues
- isRunDir: check all subdirs for metadata.json, not just first 3
- eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval')
- load-run: default unknown termination_reason to 'failed' not 'completed'
* feat(eval): make BROWSEROS_BINARY configurable via env var
This commit is contained in:
82
.github/workflows/eval-weekly.yml
vendored
Normal file
82
.github/workflows/eval-weekly.yml
vendored
Normal file
@@ -0,0 +1,82 @@
|
||||
name: Weekly Eval
|
||||
|
||||
on:
|
||||
schedule:
|
||||
# Every Saturday at 06:00 UTC
|
||||
- cron: '0 6 * * 6'
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
config:
|
||||
description: 'Eval config file (relative to apps/eval/)'
|
||||
required: false
|
||||
default: 'configs/browseros-agent-weekly.json'
|
||||
|
||||
concurrency:
|
||||
group: eval-runner
|
||||
cancel-in-progress: false
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
jobs:
|
||||
eval:
|
||||
runs-on: self-hosted
|
||||
timeout-minutes: 360
|
||||
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Install Bun
|
||||
uses: oven-sh/setup-bun@v2
|
||||
with:
|
||||
bun-version: latest
|
||||
|
||||
- name: Install dependencies
|
||||
working-directory: packages/browseros-agent
|
||||
run: bun install
|
||||
|
||||
- name: Run eval
|
||||
working-directory: packages/browseros-agent/apps/eval
|
||||
env:
|
||||
FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
|
||||
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
|
||||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||
BROWSEROS_BINARY: ${{ secrets.BROWSEROS_BINARY }}
|
||||
EVAL_CONFIG: ${{ github.event.inputs.config || 'configs/browseros-agent-weekly.json' }}
|
||||
run: |
|
||||
echo "Running eval with config: $EVAL_CONFIG"
|
||||
bun run src/index.ts -c "$EVAL_CONFIG"
|
||||
|
||||
- name: Upload runs to R2
|
||||
if: success()
|
||||
working-directory: packages/browseros-agent/apps/eval
|
||||
env:
|
||||
EVAL_R2_ACCOUNT_ID: ${{ secrets.EVAL_R2_ACCOUNT_ID }}
|
||||
EVAL_R2_ACCESS_KEY_ID: ${{ secrets.EVAL_R2_ACCESS_KEY_ID }}
|
||||
EVAL_R2_SECRET_ACCESS_KEY: ${{ secrets.EVAL_R2_SECRET_ACCESS_KEY }}
|
||||
EVAL_R2_BUCKET: ${{ secrets.EVAL_R2_BUCKET }}
|
||||
EVAL_R2_CDN_BASE_URL: ${{ secrets.EVAL_R2_CDN_BASE_URL }}
|
||||
EVAL_CONFIG: ${{ github.event.inputs.config || 'configs/browseros-agent-weekly.json' }}
|
||||
run: |
|
||||
CONFIG_NAME=$(basename "$EVAL_CONFIG" .json)
|
||||
bun scripts/upload-run.ts "results/$CONFIG_NAME"
|
||||
|
||||
- name: Generate trend report
|
||||
if: success()
|
||||
working-directory: packages/browseros-agent
|
||||
env:
|
||||
EVAL_R2_ACCOUNT_ID: ${{ secrets.EVAL_R2_ACCOUNT_ID }}
|
||||
EVAL_R2_ACCESS_KEY_ID: ${{ secrets.EVAL_R2_ACCESS_KEY_ID }}
|
||||
EVAL_R2_SECRET_ACCESS_KEY: ${{ secrets.EVAL_R2_SECRET_ACCESS_KEY }}
|
||||
EVAL_R2_BUCKET: ${{ secrets.EVAL_R2_BUCKET }}
|
||||
EVAL_R2_CDN_BASE_URL: ${{ secrets.EVAL_R2_CDN_BASE_URL }}
|
||||
run: bun apps/eval/scripts/weekly-report.ts /tmp/eval-report.html
|
||||
|
||||
- name: Upload report as artifact
|
||||
if: success()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: eval-report-${{ github.run_id }}
|
||||
path: /tmp/eval-report.html
|
||||
Reference in New Issue
Block a user