BrowserOS

mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-18 19:16:22 +00:00

Author	SHA1	Message	Date
shivammittal274	d383b5e344	feat(eval): add claude-generated run report artifact (#892 ) * feat(eval): add claude-generated run report artifact * fix(eval): install claude code cli for CI evals * fix(eval): bypass claude code tool permissions * Eval metrics configs (#932) * feat(eval): add agisdk comparison metrics configs * fix(eval): keep cdp crashes from aborting run	2026-05-04 21:09:06 +05:30
Nikhil	ab354d7dd7	fix(ci): restore PAT on actions/checkout for submodule fetch (#898 ) Without a token on actions/checkout, the action falls back to GITHUB_TOKEN, which has no access to the private internal-docs repo. Submodule clone fails with "repository not found". PAT is back on checkout. PR ops still use GITHUB_TOKEN via the GH_TOKEN env var on the run step. The bot-branch git push uses the credential helper set up by checkout (the PAT, which has Contents: Read and write).	2026-04-30 16:23:58 -07:00
Nikhil	0e779fa344	fix(ci): switch internal-docs sync to PR + auto-merge (#897 ) Direct push to dev fails the dev ruleset's "Require pull request" rule. Open a tiny PR from a bot branch and enable auto-merge (squash, 0 approvals required) instead. No bypass actor needed — the rule stays strict for everyone, including the bot. PR ops use GITHUB_TOKEN with explicit pull-requests: write permission. The cross-repo PAT is only used to rewrite the SSH submodule URL so internal-docs can be cloned over HTTPS.	2026-04-30 16:17:15 -07:00
Nikhil	1ff92c44b3	feat(internal-docs): scaffold private docs submodule, skills, sync action (#894 ) * feat(internal-docs): scaffold private docs submodule, skills, sync action Adds the OSS-side scaffolding for the internal-docs system: - /document-internal skill — drafts a 1-page feature/architecture/design doc from the current branch's diff, asks four sharp questions, enforces voice rules (no em dashes, banned filler words, 60-line cap on feature notes), then opens a PR to browseros-ai/internal-docs via a tmp clone. - /ask-internal skill — answers team-internal questions by greping internal-docs and the codebase, synthesizing with file:line citations, optionally executing surfaced commands with per-command confirmation, and drafting a new doc + PR if grep returns nothing useful. - .github/workflows/sync-internal-docs.yml — every 4 hours, bumps the submodule pointer on dev directly (no PR; relies on dev branch protection blocking force-push). Skips silently until the submodule is configured. Uses url.insteadOf to rewrite the SSH submodule URL to HTTPS-with-token for the bot, while keeping SSH the local default. - .claude/skills/document-internal/seeds/ — root README and three templates (feature-note, architecture-note, design-spec) ready to copy into the new internal-docs repo on rollout. Design spec: .llm/superpowers/specs/2026-04-30-internal-docs-submodule-design.md Manual prereqs (NOT in this PR — handled out-of-band): 1. Create private repo browseros-ai/internal-docs with branch protection on main. 2. Seed it with the contents of .claude/skills/document-internal/seeds/. 3. Create a bot account, mark as bypass actor on dev branch protection. 4. Add INTERNAL_DOCS_SYNC_TOKEN secret with repo + read access to internal-docs. 5. Once internal-docs exists, on a follow-up branch: git submodule add -b main git@github.com:browseros-ai/internal-docs.git .internal-docs 6. Send the team the one-time init snippet for their existing checkouts: git submodule update --init .internal-docs * fix(internal-docs): address Greptile review feedback - Workflow: rebase onto dev before push to handle non-fast-forward race; bump fetch-depth 1->50 so rebase has merge-base history. - Workflow: move INTERNAL_DOCS_SYNC_TOKEN into step env: per Actions credential-injection pattern, instead of inlining in the script body. - Skill (BASE bug): suppress git rev-parse stdout so SHA does not get captured into BASE alongside the literal 'dev'. Was breaking every downstream git log/diff call. - Skill (tmp clone): trap 'rm -rf "$TMP" EXIT after mktemp so cleanup always runs, even if any subsequent step fails.	2026-04-30 15:04:08 -07:00
Nikhil	492f3fcdf2	feat(openclaw): prewarm ghcr image in vm (#887 ) * feat(openclaw): add gateway image inspection * feat(openclaw): pull gateway image from registry * refactor(vm): decouple readiness from image cache * refactor(openclaw): remove vm cache from runtime factory * feat(openclaw): detect current gateway image * feat(openclaw): prewarm vm runtime and reuse current gateway * feat(openclaw): prewarm runtime on server startup * refactor(vm): remove browseros image cache runtime * refactor(build-tools): remove openclaw tarball pipeline * chore: self-review fixes * fix(openclaw): suppress prewarm pull progress logs * fix(openclaw): address review feedback * fix(openclaw): resolve review findings * fix(dev): stop stale watch supervisors	2026-04-30 11:18:11 -07:00
Nikhil	cb0c0dd0c1	chore: simplify root test scripts (#886 ) * chore: simplify root test scripts * fix: avoid chained root test scripts * fix: update test workflow commands * fix: move app test commands into packages	2026-04-30 10:58:08 -07:00
Nikhil	84a79ba0a1	feat: refactor eval pipeline workflow (#875 ) * feat(eval): add suite variant config bridge * feat(eval): add stable run artifacts * refactor(eval): add shared grader contract * feat(eval): persist grader artifacts * refactor(eval): rename runner layers * refactor(eval): add executor backend boundary * refactor(eval): split clado backend * feat(eval): add workflow compatible cli * feat(eval): add r2 publisher module * ci(eval): migrate weekly workflow to eval cli * docs(eval): document suite pipeline * chore(eval): verify pipeline refactor * fix: address review feedback for PR #875 * docs(eval): add env example * docs(eval): explain suites and variants * chore(eval): organize config layouts * chore(eval): colocate grader python evaluators	2026-04-29 17:21:02 -07:00
Nikhil	c244462b29	fix: use Node 24 GitHub actions (#872 )	2026-04-29 15:31:23 -07:00
Nikhil	ebf97f74f6	fix: bound VM agent cache smoke test (#870 ) * fix: bound VM agent cache smoke test * fix: address review feedback for PR #870	2026-04-29 13:43:37 -07:00
shivammittal274	df0f45dd29	Feat: eval debug dev ci (#869 ) * chore(eval): instrument server startup to root-cause dev CI health-check timeouts Three diagnostics + one config swap to investigate why the eval-weekly workflow has been failing on dev since 2026-04-25 with "Server health check timed out" (every worker, every retry). Background: - Last successful weekly eval on dev: 2026-04-18 (sha `f5a2b73`) - Since then, ~30 server commits landed including Lima/VM runtime, OpenClaw service, ACL system, ACP SDK — 108 server files changed, ~13K LOC added. - Server process spawns cleanly in CI (PID logged) but never binds /health within the 30s eval-side timeout. Static analysis finds no obvious blocker; we need runtime evidence. Changes: 1. apps/server/package.json — add `start:ci` script (no `--watch`). The default `start` uses `bun --watch` which forks a child process that watches every file in the import graph. Dev's graph is ~108 files larger than main's; on a cold CI runner the watcher setup is a plausible source of multi-second startup overhead. 2. apps/eval/src/runner/browseros-app-manager.ts: - Use `start:ci` when `process.env.CI` is set (true on GitHub-hosted runners by default), else `start`. - Capture per-worker server stderr to /tmp/browseros-server-logs/ instead of ignoring it. Without this we have no visibility into why the server is hung pre-/health. - Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module graph may simply need more cold-start time on CI. 3. .github/workflows/eval-weekly.yml — upload the server logs dir as a workflow artifact (always, not just on success) so we can post-mortem any startup failure on the next run. 4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter -> Fireworks (bypasses the OpenRouter per-key spend cap that has been eating recent runs) and drop num_workers 10 -> 4 (well below the Fireworks per-account TPM threshold that overwhelmed the original 2026-04-23 run). Plan: trigger the eval-weekly workflow on this branch with the agisdk config and observe (a) whether it gets past server startup, and (b) if it doesn't, what the captured server stderr says. * fix(eval): capture stdout too — pino logger writes to stdout, not stderr Previous diagnostic patch only redirected stderr; the captured per-worker log files came back as 0 bytes because the server uses pino which writes all log output to stdout (fd 1), not stderr (fd 2). Capture both into the same file. * fix(server): catch sync throw from OpenClaw constructor on Linux The container runtime constructor in OpenClawService throws synchronously on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing .catch() on tryAutoStart() only handles async throws inside auto-start — the sync throw from configureOpenClawService(...) itself propagates up through Application.start() and crashes the process via index.ts:48 (process.exit(EXIT_CODES.GENERAL_ERROR)). This is what's been killing dev's eval-weekly CI: the server crashes in milliseconds, the eval client polls /health, gets nothing, times out. Fix: wrap the configureOpenClawService call in try/catch matching the existing .catch() intent (best-effort, don't crash). Server continues without OpenClaw on platforms where it can't initialize. Verified by reading captured server stdout from run 25123195126: Failed to start server: error: browseros-vm currently supports macOS only at buildContainerRuntime (container-runtime-factory.ts:54:11) at new OpenClawService (openclaw-service.ts:652:15) at configureOpenClawService (openclaw-service.ts:1527:19) at start (main.ts:127:5) * fix(server): defer OpenClaw chat client port lookup to request time apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort() synchronously when constructing the OpenClawGatewayChatClient inside the createHttpServer object literal. On non-darwin platforms this throws via the OpenClawService constructor → buildContainerRuntime, escaping the try/catch added in `5cf7b765` (which only protected the configureOpenClawService call further down in main.ts). Every other getOpenClawService() reference in server.ts is already wrapped in an arrow function. This was the lone holdout. Make it lazy too: change the chat client constructor to take getHostPort: () => number instead of hostPort: number, evaluate it inside streamTurn at request time. Behavior on darwin is unchanged. This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't available — the chat endpoint isn't exercised by the eval, so a deferred throw is acceptable. * fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1 Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't unblock dev's Linux CI — same throw kept reproducing. Whether this is bun caching stale stack frames or a missed eager call site, the safer move is to fix it at the root: make buildContainerRuntime never throw on Linux when the runner has explicitly opted out. Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test escape hatch in container-runtime-factory.ts. When set, returns the existing UnsupportedPlatformTestRuntime stub — server boots normally, /health binds, any actual OpenClaw API call still fails loudly at request time. eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and non-CI Linux behavior unchanged (without the flag they still throw). * feat(eval): align Clado action executor with new endpoint contract David Shan shared the updated Clado BrowserOS Action Model spec. Changes to match it: - Bump endpoint URL + model id to the 000159-merged checkpoint (clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef) in browseros-oe-clado-weekly.json and the README example. - CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min; the 2-min ceiling was failing every cold-start request. - Treat HTTP 200 with action=null / parse_error as an INVALID step instead of aborting the executor loop. The model can self-correct on the next call. Cap consecutive parse failures at 3 to avoid infinite loops. - Capture final_answer from end actions. Surface it in the observation back to the orchestrator so its task answer can use the model's declared result. - Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X). - Switch screenshot format from webp → png to match the documented "PNG or JPEG" contract. * chore(eval): refresh test-clado-api script for new Clado contract Updated the local smoke-test to match the new Clado endpoint and response contract: - New action + health URLs (000159-merged checkpoint). - Drop the grounding-model branch (orchestrator-executor doesn't use it; the README David shared only documents the action model). - Health-check waits up to 6 minutes for cold start with a 30s warning so the operator knows it's spinning up. - Print every documented response field (action, x/y, text, key, direction, amount, drag start/end, time, final_answer, thinking, parse_error, inference_time_seconds). - Three-step run that exercises a click, a typing continuation with formatted history, and an end+final_answer probe. * chore(eval): point clado weekly config at agisdk-real Switches the orchestrator-executor + Clado weekly config to run on the AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff grader. Matches the orchestrator-executor smoke target (Fireworks K2.5 orchestrator + Clado action executor) we want to track week-over-week. * chore(eval): run clado weekly headless Default to headless so the weekly job (and local repros) don't pop ten visible Chrome windows. Set headless=false locally if you need to watch a worker. * fix(eval): address Greptile P1+P2 on server log fd handling P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the log directory missing and crash the server spawn with ENOENT. Move openSync into the same try block; fall back to /dev/null so spawn always succeeds. P2: the log fd was opened on every server start but never closed. Each restart attempt leaked one fd across all workers — over a long eval run that could exhaust the process fd limit. Track the fd on the manager and closeSync it in killApp() right after the server process exits (the child's dup keeps the file open until it exits, so we don't truncate output).	2026-04-30 01:33:49 +05:30
Nikhil	edfc5c751c	fix: align OpenClaw gateway image with VM cache (#868 ) * fix: load OpenClaw gateway image from VM cache * fix: use container port for OpenClaw ACP bridge * fix: address review feedback for PR #868	2026-04-29 12:11:00 -07:00
shivammittal274	231bd6821d	fix(eval): pin agisdk version + exclude 4 invalid tasks (Phase 2 dataset hygiene) (#844 ) * chore(eval): pin agisdk version to prevent silent dataset drift `pip install agisdk` previously fetched whatever version pip resolved at CI time. If agisdk publishes a new version with changed task definitions or grader behavior, the weekly eval silently shifts under our feet — making "did the score move because of code or data?" unanswerable. Pin to agisdk==0.3.5 (the version we currently develop against). Bump intentionally with a documented re-baseline run. * fix(eval): exclude 4 more tasks identified by 8-trial never-passing audit After 8 trials across K2.5 + Opus 4.6 (Phase 1 and Phase 2), 5 tasks never passed. Per-task root-cause investigation via parallel deep-dive subagents flagged 4 of them as fundamentally unfixable in the eval pipeline as it stands; the 5th (`dashdish-5`) is a prompt-rule fix that stays in. - gocalendar-7: goal/grader contradiction. Goal says "move event to July 19, 10 AM"; grader expects `eventsDiff.updated.*.start == "2024-07-18T17:00Z"` (= July 18, 10 AM PDT — same day, 1 hour shift). Even after the Phase 2 HTML5 dnd dispatch fix correctly populates `eventsDiff.updated`, the values are July 19 (matching the goal), which the grader rejects. - staynb-5: grader hardcodes literal `'Oct 13 2025'` and `'Oct 23 2025'` year strings. The staynb date picker interprets bare "Oct 13" as the most-recent-past instance (currently 2024 since today is 2026), not 2025. No agent can produce a persisted date string containing 2025. - staynb-9: under-specified task. Goal says "maximum number of guests supported"; grader requires the very specific string "32 Guests, 16 Infants" — encoding UI knowledge (Adults+Children=Guests display, Infants render separately, per-category cap=16, Pets excluded) that isn't in the prompt. Even Opus 4.6 stopped at 16 across 3 trials. - opendining-3: grader requires `contains(booking.date, '2024-07-20')` but the React-controlled date textbox flakily no-ops on `fill`. 3/8 trial pass rate is essentially coin-flip noise driven by tool-fidelity variance rather than agent capability. Removing to reduce score noise; Phase 2 fill post-validate warning helps when it does work, but the task's signal-to-noise is too low for the eval set. Dataset goes from 40 -> 36 tasks. Total EXCLUDED_TASKS now 11 entries. Validated by 8-trial pass-record audit; deep-dive notes saved to plans/audits/.	2026-04-29 22:07:53 +05:30
Nikhil	1946ca0cf8	chore: clean up unused agent sdk (#855 )	2026-04-28 17:21:46 -07:00
Nikhil	85bb3f7b42	fix: avoid eager limactl resolution in server tests (#853 )	2026-04-28 16:56:41 -07:00
shivammittal274	d9c254053e	refactor(eval): drop unused agents/graders, collapse registries (#847 ) * refactor(eval): drop unused agents/graders, collapse registries Sweep of dead code in the eval app: deleted gemini-computer-use and yutori-navigator agents, fara/webvoyager/mind2web graders, eight debug/analyze/test scripts, three stale planning docs, and the orphaned eval-targets/coordinate-click testbed. With two agents and three graders left, the Map-backed plugin registries were over-engineered — collapsed both into plain switches. Removed the now-dead GraderOptions plumbing (no remaining grader takes API keys), dropped grader_api_key_env/grader_base_url/grader_model from the schema and configs, and de-duped PASS_FAIL_GRADER_ORDER (was defined in three places). Replaced the URL-parsing extractCdpPort hack in single-agent and orchestrator-executor with workerIndex passed cleanly through AgentContext. README and --help text rewritten to match reality. Renamed configs/test_.json to test-.json for kebab-case consistency. Net: ~10,460 LOC removed across 60 files. Typecheck clean, all tests pass. * ci(eval): pull BrowserOS from rolling stable CDN URL The pinned v0.44.0.1 .deb on GitHub releases regressed on Linux — servers start but never become healthy. Switch to the canonical rolling URL at cdn.browseros.com/download/BrowserOS.deb so CI tracks the same stable channel users get from the marketing site.	2026-04-29 02:14:47 +05:30
shivammittal274	01d649da9a	feat(eval): bring deterministic graders to dev + drop omnizon (#824 ) * feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664) * feat: add deterministic eval graders (AGI SDK + WebArena-Infinity) Two new benchmark integrations with programmatic grading — no LLM judge. AGI SDK / REAL Bench (52 tasks): - 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.) - Grader navigates browser to /finish, extracts state diff from <pre> tag - Python verifier checks exact values via jmespath queries WebArena-Infinity (50 hard tasks): - 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.) - InfinityAppManager starts fresh app server per task per worker - Python verifier calls /api/state and asserts on JSON state Infrastructure: - GraderInput extended with mcpUrl + infinityAppUrl for parallel workers - Each worker gets isolated ports (no cross-worker state contamination) - CI workflow: pip install agisdk, clone webarena-infinity repo * chore: switch eval configs back to kimi-k2p5 * fix: register deterministic graders in pass rate calculation Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER in both runner types and weekly report script, so scores show correctly in the dashboard. * chore: temp switch to opus 4.6 for eval run * chore: restore kimi-k2p5 as default eval config * ci: add timeout and continue-on-error for trend report step * fix(eval): drop omnizon from AGISDK dataset (DMCA takedown) evals-omnizon.vercel.app returns HTTP 451 ("This content has been blocked for legal reasons / DMCA_TAKEDOWN"). All 5 omnizon-* tasks fail grading with "Failed to fetch /finish endpoint: JSON Parse error". Adds an EXCLUDED_WEBSITES set to the dataset builder and regenerates agisdk-real.jsonl (52 → 47 tasks). * fix(eval): correct Infinity port-assignment bugs Two related bugs in the Infinity eval runner that cause silent port collisions / fallbacks under parallel execution: 1. build-infinity-dataset.py emitted "app_port" but task-executor and the committed JSONL both read "app_base_port". Re-running the build script would silently make every task fall back to the 8000 default, ignoring per-app port assignments. Renamed the key to match. 2. task-executor derived workerIndex as `base_server_port - 9110`, but parallel-executor doesn't override base_server_port per worker — only server_url. Every worker computed workerIndex = 0, causing all parallel workers to spawn Infinity app servers on the same port. Threading workerIndex explicitly through TaskExecutor instead. Also drops an unused app_name parameter from load_tasks().	2026-04-27 21:35:43 +05:30
Nikhil	c656f6236c	feat: ship Lima template for BrowserOS VM (#787 ) * feat(build-tools): add Lima template for BrowserOS VM * feat(build-tools): remove build-disk pipeline and recipe directory Task 2 verification removed the scripts, recipe directory, workflow, and package scripts. Typecheck remains green here because manifest disk fields are removed in the next task, so the plan's expected missing-import failure does not apply yet. * feat(build-tools): rename VmManifest to AgentManifest, drop disk fields * feat(build): stage Lima template into server resources Verified local-resource staging with: bun scripts/build/server.ts --target=darwin-arm64 --ci. The template was copied to dist/prod/server/darwin-arm64/resources/vm/browseros-vm.yaml and included in the zip. bun run build:server:test still fails on the pre-existing R2 limactl resource with: The specified key does not exist. * docs(build-tools): Lima template dev loop + record D9 Updated the build-tools README in this worktree. Also recorded D9 in the canonical external spec file at /Users/shadowfax/llm/code/browseros-project/grove-ref/browseros-main/specs/decisions.md, which is outside this git checkout. * chore(build-tools): sweep orphaned references to retired disk pipeline * chore: self-review fixes	2026-04-22 17:17:12 -07:00
Nikhil	4d660874ad	feat: consolidate build tools package (#785 ) * feat(build-tools): scaffold package + cache dir helpers * feat(build-tools): manifest types + R2 helper * feat(build-tools): build-disk script with virt-customize + zstd * feat(build-tools): build-tarball script * feat(build-tools): emit-manifest + cache:sync * ci(build-tools): independent build-vm + build-agent workflows * chore: remove legacy container packages + workflows * fix: address review feedback for PR #785 * fix: stabilize VM build DNS in CI * fix: prioritize arm64 build workflows * fix: keep arm64 VM recipe simple * fix: set VM build DNS in apt command * fix: avoid guest DNS for VM package install * fix: limit VM PR checks to build-tools validation	2026-04-22 16:23:11 -07:00
Nikhil	819887a2c5	feat(vm-container): WS1 VM disk image pipeline (#783 ) * feat(vm-container): ship the WS1 VM disk image pipeline New Bun/TS workspace package @browseros/vm-container that produces a reproducible, versioned Debian 12 + Podman qcow2 disk image for arm64 and x64, and publishes it to Cloudflare R2 under vm/<version>/ with a per- version manifest.json and a latest.json pointer. - virt-customize-driven build with a git-tracked recipe DSL. - zstd-compressed artifacts; sha256 sidecars for compressed + uncompressed. - Public surface at @browseros/vm-container/schema exposes zod-validated VmManifest + R2 key helpers for WS4 to import; /download is a stub landing pad for WS4 to fill in. - Rollback on partial upload failure: any exception after the first successful put deletes all previously uploaded keys for that version. - GHA workflow build-vm-container.yml runs a matrix build per arch on native runners, an x64 Lima boot smoke test, and a gated publish job. - Full unit coverage for arch, r2-keys, manifest, recipe parser, and publish (rollback + happy path via aws-sdk-client-mock). * fix(vm-container): address review comments - Split buildDisk into prepareCustomizedDisk + finalizeArtifacts for testability. - Replace resolvePinnedSha's sentinel-prefix check with a positive sha256-hex regex test, switch base-image.ts placeholder to empty string. - Drop unused R2_VM_PREFIX from .env.example; document CDN_BASE_URL override precedence in README. - Replace SSH host-key explicit list in recipe with `ssh_host_` glob so .pub keys and future key types are also removed. - lima-boot: introduce BunRequestInit type for the unix fetch option and reject empty limactlPath loudly. - Extend publish test suite: mid-manifest-upload failure path verifies both arches' qcow+sha are rolled back and latest.json is never written. - Add missing tests: parseArch('ARM64') case-sensitivity rejection, composeVirtCustomizeArgv unresolved-substitution pass-through. fix(vm-container): pin a real Debian snapshot, switch verify to SHA-512, streaming download - Pin Debian base to bookworm/20260413-2447 with real SHA-512 values from upstream SHA512SUMS (the sentinel placeholder never corresponded to a real build). Debian cloud images only publish SHA512SUMS today, so switch base-image verification to SHA-512 throughout: rename BaseImage.sha256 → sha512, manifest field base_image_sha256 → base_image_sha512, base_image.sha256_url → sha512_url, debianSha256SumsUrl → debianSha512SumsUrl. Our own artifact hashes (compressed_sha256, uncompressed_sha256, recipe_sha256) stay SHA-256. - Fix downloadTo: previous Bun.write(dest, response) buffered the entire 300 MB response before writing (100% CPU, empty dir). Replace with a getReader() loop that streams chunks through Bun.file().writer(). - build CLI now auto-derives --version from today's date when omitted (defaults to YYYY.MM.DD-dev1); explicit --version still overrides. Broaden CALVER_REGEX to accept alphanumeric suffixes so -dev1/-rc1 tags are valid. New todayCalver() helper. - Update GHA workflow fallback to github.run_number (shorter) instead of run_id. * fix(vm-container): resolve copy-in paths against recipeDir after substitution The copy-in path resolver checked op.src.startsWith('/') before running the {placeholder} substitution, so an absolute-after-substitution path like {manifest_tmp} → /tmp/vm-dist/manifest-stub-arm64.json was treated as relative and joined against recipeDir, producing a nonexistent path. Check the substituted value for absoluteness via path.isAbsolute. * fix: address review comments for 0422-ws1_vm_disk_pipeline * fix(ci): repair vm-container workflow * fix(ci): expose vm build logs on failure * fix(vm-container): expose base_image_sha256 in manifest per PRD The published manifest contract (consumed by WS4) now uses base_image_sha256 as the PRD specified. Internally the build still verifies the downloaded Debian base against the pinned sha512 (that's what Debian actually signs in SHA512SUMS) — then hashes the same bytes as sha256 and records that in the manifest. One extra digest pass of a ~300 MB file; negligible. - manifest.json: base_image_sha256 replaces base_image_sha512; sha512_url removed (not needed — sha256 is the consumer-facing hash). - CLI: --base-image-sha256 override validates against the locally-computed sha256 after download. - BuildResult.baseImage gains sha256 alongside sha512. - Tests updated to the new field. The auth.json bug (reviewer #2) is resolved: the source file is recipe/auth.json and the recipe emits `copy-in auth.json:/etc/containers/` so libguestfs writes /etc/containers/auth.json. * ci(vm-container): fix supermin kernel-read + rename sha512 inputs to sha256 - Ubuntu 24.04 GHA runners ship /boot/vmlinuz-* as mode 0600, which blocks libguestfs's supermin appliance builder when virt-customize runs as a non-root user. Chmod 0644 before the build — canonical CI workaround. - Rename workflow_dispatch input base_image_sha512 → base_image_sha256 and CLI flag --base-image-sha512 → --base-image-sha256 to match the orchestrator's renamed override. * ci(vm-container): give runner KVM access + install passt for libguestfs The supermin fix got us past appliance-build, but virt-customize then hit "passt exited with status 1". The passt networking helper misbehaves when libguestfs falls back to TCG emulation, which happens because the runner user isn't in the kvm group even though /dev/kvm exists on the GHA host. - chmod 0666 /dev/kvm → libguestfs uses hardware acceleration, avoids TCG. - install passt explicitly so the networking helper is present and current. * ci(vm-container): disable passt to force libguestfs slirp fallback libguestfs 1.54+ prefers passt for guest networking, but the passt binary on GHA ubuntu-24.04 exits with status 1 when invoked from the appliance — an AppArmor/capability issue that doesn't surface a useful diagnostic. The reliable workaround is to remove passt so libguestfs picks QEMU's built-in user-mode SLIRP as the network backend. SLIRP is slower but functional and doesn't require escalated privileges.	2026-04-22 14:04:00 -07:00
Nikhil	114d5e3a9f	feat: add agent container tarball pipeline (#782 ) * feat: add agent container tarball pipeline * docs: add agent-container env sample * refactor: simplify agent container pipeline * fix: address review feedback for PR #782 * fix: emit clean matrix JSON in CI * fix: align agent container artifact paths	2026-04-22 13:14:27 -07:00
Nikhil	f5a2b7315c	fix: run all browseros-agent tests from root (#750 ) * fix: run full browseros-agent test suite * fix: stabilize server test reporting in CI * fix: address PR review feedback * refactor: extract server core test runner * refactor: group server tests by filesystem * fix: align CI suites with server test groups * fix: provision server env for all CI suites * fix: stabilize ci checks * fix: report real test counts in ci	2026-04-17 17:26:44 -07:00
Nikhil	d653883e99	fix(ci): add PR comment with test summary (#724 ) * fix(ci): add PR comment with test summary and block on failure Add a `comment` job to the test workflow that parses JUnit XML artifacts and posts a sticky PR comment showing pass/fail counts per suite, with failed test names listed in a collapsible section and a link to the run. Guards against fork PRs (read-only token) and stale overlapping runs (skips comment if PR head has moved past our SHA). * fix(ci): use payload SHA for staleness check, handle missing artifacts - Replace context.sha (merge commit SHA) with context.payload.pull_request.head.sha so the staleness guard compares the correct values and the comment actually gets posted - Add continue-on-error to download-artifact so cancelled runs gracefully fall through to the "no test results" message * fix(ci): show warning icon for zero-test suites instead of failure	2026-04-15 21:35:58 -07:00
Nikhil	20067d90c7	fix: stabilize root test suite and SDK browser context (#717 ) * fix: isolate ACL semantic tests from Bun teardown crash * fix: time out ACL semantic fixture subprocess * fix: run full root test suite and repair sdk browser context * fix: address PR review comments for 0415-fix_all_tests_and_issues * test: temporarily skip sdk suite * test: clarify sdk suite disable message	2026-04-15 17:28:01 -07:00
Dani Akash	452906d3ca	fix: first time run (#696 ) * fix: openclaw creation * fix: request formats * ci: extend code quality to dev	2026-04-14 12:29:53 +05:30
Nikhil	000429277d	fix: isolate server release packaging to ci mode (#629 ) * fix: relax compile-only release env requirements * refactor: add ci mode for server release builds	2026-03-31 20:57:44 -07:00
Nikhil	f0cbf77924	feat: add server release workflow (#627 ) * feat: add server release workflow * fix: address PR review comments for 0331-add_server_release_workflow * refactor: rework 0331-add_server_release_workflow based on feedback * refactor: rework 0331-add_server_release_workflow based on feedback	2026-03-31 17:37:06 -07:00
shivammittal274	565ce18eba	feat: add npm/npx distribution for BrowserOS CLI (#618 ) * feat(cli): skip self-update prompts for package manager installs Checks BROWSEROS_INSTALL_METHOD env var (npm, brew) and skips automatic update checks. Users should use their package manager's update mechanism. FormatNotice now shows the appropriate upgrade command based on install method. * feat(cli): add npm bin wrapper for browseros-cli * feat(cli): add npm postinstall script to download platform binary Downloads the correct platform binary from GitHub releases during npm install, verifies SHA256 checksums, and extracts to .binary directory. * feat(cli): add npm package metadata and README Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: move npm package files to correct monorepo path The bin wrapper and postinstall were created at apps/cli/npm/ instead of packages/browseros-agent/apps/cli/npm/. Moves them to the correct location. * style: use node: protocol for builtin module imports * feat(cli): add Makefile npm targets and release workflow npm publish step Adds npm-version and npm-publish Makefile targets for version sync. Adds Node.js setup and npm publish step to the release workflow. Adds npm/npx install instructions to release notes template. * fix(cli): fail on missing checksum entry and limit redirect depth - Abort if checksums.txt downloaded but archive entry is missing - Warn if checksums.txt itself failed to download - Cap redirect depth at 5 to prevent stack overflow on circular redirects * fix(cli): match install.sh checksum behavior — warn instead of abort The existing shell installer (install.sh) warns and continues when the checksum entry is missing from checksums.txt. Match that behavior in the npm postinstall to avoid unnecessary install failures. Both files come from the same GitHub release, so the checksum is a corruption check, not a strong security boundary. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 22:30:58 +05:30
Nikhil	ace9307878	feat: add browseros-cli self-updater (#605 ) * feat: add browseros-cli self-updater * fix: address review comments for 0327-cli_self_updater * fix: address PR review comments for 0327-cli_self_updater * fix: replace goreleaser with Makefile-based release build Remove .goreleaser.yml (required Pro license for monorepo field) and consolidate cross-compilation into `make release`. CI now uses the same Makefile target, fixing a bug where POSTHOG_API_KEY was missing from release ldflags. * fix: address critical self-updater bugs from code review - Fix SHA256 checksum mismatch: verify archive checksum before extraction instead of verifying extracted binary against archive hash (was always failing). Add VerifyChecksum() and integration test. - Fix JSON field name mismatch: TypeScript was emitting camelCase (publishedAt, archiveFormat) but Go expected snake_case (published_at, archive_format). Manifest parsing was silently broken. - Add decompression size limit (256 MB) to prevent zip/gzip bombs. - Don't update LastCheckedAt on transient errors so retry happens on next CLI invocation instead of waiting 24h.	2026-03-27 14:52:54 -07:00
Nikhil	42c3e8fe01	fix: standardize release names to "BrowserOS <Product> - vX.Y.Z" format (#604 ) Update workflow release titles for Extension, Agent SDK, and CLI to use consistent branding. Existing GitHub releases also renamed via gh CLI.	2026-03-27 13:17:56 -07:00
Nikhil	6c053a5f29	feat: upload CLI binaries to CDN and gate release to core team (#602 ) * feat: upload CLI binaries to CDN during release and gate workflow to core team - Extend scripts/build/cli/upload.ts with uploadCliRelease() that pushes archives + checksums to R2 under versioned (cli/v{VERSION}/) and latest (cli/latest/) paths, plus a version.txt for lightweight latest resolution - Update scripts/build/cli.ts entry point with --release/--version/--binaries-dir flags (existing no-args behavior preserved for upload:cli-installers) - Rewrite install.sh and install.ps1 to fetch from cdn.browseros.com instead of GitHub releases API — eliminates rate limits and API dependency - Add environment: release-core to release-cli.yml for core-team gating via GitHub environment protection rules - Add Bun setup + CDN upload step to the workflow between build and GitHub release * fix: address review feedback for PR #602 - Make loadProdEnv return empty map when .env.production is absent so pickEnv falls through to process.env in CI (Greptile P1) - Add semver format validation for version string in install.sh and install.ps1 to guard against malformed CDN responses - Pass inputs.version via env var instead of inline ${{ }} interpolation to prevent command injection in workflow shell	2026-03-27 11:47:31 -07:00
Nikhil	b7462aa042	fix(cli): move install instructions below What's Changed in release notes (#591 ) The installer block was appearing above the changelog. Reorder so What's Changed comes first and install instructions follow.	2026-03-26 18:16:23 -07:00
Nikhil	279b41fdc4	feat(cli): add install commands to GitHub release notes (#589 ) * feat(cli): add install commands to release notes * fix(cli): add install header to release workflow	2026-03-26 18:04:58 -07:00
shivammittal274	aa85907212	Feat/cli launch ready v2 (#582 ) * fix(cli): use full path for dist artifacts in release step * test: temporarily allow release workflow on any branch * fix(cli): restore main-only guard, remove goreleaser dependency Replaces GoReleaser (Pro-only monorepo feature) with plain go build. Tested: RC release created successfully on branch with all 6 binaries.	2026-03-27 01:28:04 +05:30
shivammittal274	c0578d0e53	Feat/cli launch ready v2 (#580 ) * fix(cli): update goreleaser tag_prefix to match browseros-cli-v* format * fix(cli): replace goreleaser with plain go build for releases GoReleaser free version cannot parse prefixed tags (browseros-cli-v*). monorepo.tag_prefix is a Pro-only feature. Replaced with direct go build + gh release create: - Builds all 6 targets with go build (verified locally) - Creates tar.gz/zip archives with checksums - Uses gh release create to publish - No external tool dependency	2026-03-27 01:12:25 +05:30
Dani Akash	48727750b4	fix: change CLI tag format from cli/v* to browseros-cli-v* (#578 ) GoReleaser free cannot parse slash-prefixed tags (cli/v0.0.1) as semver. Switch to browseros-cli-v0.0.1 format which is valid semver after stripping the prefix. Remove the monorepo config (GoReleaser Pro only).	2026-03-27 00:58:13 +05:30
shivammittal274	6773ce39da	ci(cli): manual dispatch release workflow (#574 ) * ci(cli): change release workflow to manual dispatch from main - Trigger via Actions UI with a version input (e.g. "0.1.0") - Only runs on main branch - Creates git tag cli/v<version> automatically - Then GoReleaser builds all 6 binaries and creates the GitHub Release * feat: add scoped release notes, changelog PR, and idempotent tags to CLI workflow - Add concurrency group to prevent parallel releases - Add scoped release notes from commits touching the CLI directory - Pass release notes to goreleaser via --release-notes flag - Make tag creation idempotent for safe re-runs - Tag the saved release SHA, not HEAD after branching - Add CHANGELOG.md and auto-update via PR with auto-merge - Add pull-requests: write permission --------- Co-authored-by: Dani Akash <DaniAkash@users.noreply.github.com>	2026-03-27 00:41:08 +05:30
Dani Akash	09406ea794	feat: add release workflow for agent extension (#572 ) * feat: add release workflow for agent extension Adds a workflow_dispatch workflow that builds the WXT extension, creates a .zip for sideloading, generates scoped release notes with contributors and PR links, creates a GitHub release with the zip attached, and opens an auto-merge PR to update CHANGELOG.md. * fix: correct API URL to api.browseros.com * fix: remove duplicate PR numbers and contributors from extension release notes Apply the same fixes from the agent-sdk workflow: - Skip PR number if already in commit subject (squash merges) - Remove custom Contributors section (GitHub auto-generates one) - Clean up unused variables * fix: use absolute path for extension zip in release upload * fix: wxt zip already builds, use correct output path - Remove separate build step since wxt zip runs the build internally - Fix zip path from .output/.zip to dist/-chrome.zip * fix: run codegen before wxt zip to generate graphql types	2026-03-27 00:29:47 +05:30
Dani Akash	1f00cbc9cc	feat: add release workflow for agent extension (#566 ) * feat: add release workflow for agent extension Adds a workflow_dispatch workflow that builds the WXT extension, creates a .zip for sideloading, generates scoped release notes with contributors and PR links, creates a GitHub release with the zip attached, and opens an auto-merge PR to update CHANGELOG.md. * fix: correct API URL to api.browseros.com * fix: remove duplicate PR numbers and contributors from extension release notes Apply the same fixes from the agent-sdk workflow: - Skip PR number if already in commit subject (squash merges) - Remove custom Contributors section (GitHub auto-generates one) - Clean up unused variables * fix: use absolute path for extension zip in release upload * fix: wxt zip already builds, use correct output path - Remove separate build step since wxt zip runs the build internally - Fix zip path from .output/.zip to dist/-chrome.zip	2026-03-27 00:23:04 +05:30
Dani Akash	422a829f5e	fix: remove duplicate PR numbers and contributors from release notes (#571 ) - Skip adding PR number if already present in the commit subject (squash merges include "(#123)" automatically) - Remove custom Contributors section since GitHub auto-generates one with avatars at the bottom of every release	2026-03-27 00:07:13 +05:30
Dani Akash	d79c2a4123	feat: create GitHub release with changelog on agent-sdk publish (#564 ) * feat: create GitHub release with changelog on agent-sdk publish After publishing to npm, the workflow now: - Tags the commit as agent-sdk-v<version> - Generates release notes from commits that modified the agent-sdk directory since the last agent-sdk release tag - Creates a GitHub release with those notes First release will show "Initial release" since no previous tag exists. * feat: update CHANGELOG.md on agent-sdk release Add a CHANGELOG.md for @browseros-ai/agent-sdk and update the release workflow to prepend a versioned entry with the release notes before creating the GitHub release. The changelog is committed to main automatically. * fix: address review issues in agent-sdk release workflow - Add explicit permissions: contents: write - Replace sed with head/tail for safe CHANGELOG insertion (fixes double-quote and backslash corruption in commit messages) - Handle empty release notes with "No notable changes." fallback - Make git tag idempotent for workflow reruns (2>/dev/null \|\| true) * fix: use PR with auto-merge for changelog updates Direct push to main fails due to branch protection requiring PRs. Instead, create a branch, open a PR, and auto-merge via squash. * feat: add contributors and PR links to agent-sdk release notes Release notes now include PR numbers (linked automatically by GitHub), GitHub usernames for each commit author, and a contributors section at the bottom. All scoped to commits that modified the agent-sdk path. * fix: reorder release steps and fix tag/idempotency issues - Capture release SHA before any branching so the tag always points to the main commit that was built and published to npm - Reorder: generate notes → publish → tag/release → changelog PR (changelog is lowest-stakes, runs last) - Make tag push and release create idempotent for safe re-runs (fall back to gh release edit if release already exists) - Add \|\| true to gh pr merge --auto in case auto-merge is not enabled - Explicit git checkout main before creating changelog branch * fix: explicit error handling for tag/release and contributor dedup - Replace silent \|\| true guards with explicit checks that log what's happening (tag exists, remote tag exists, release exists) so errors are visible instead of swallowed - Fix contributor dedup: use grep -qw (word match) instead of grep -qF (substring match) so "dan" isn't excluded when "dansmith" exists * fix: exclude current version tag when finding previous release On re-runs, the current version's tag already exists on the remote, so PREV_TAG resolves to it and git log produces empty output. Filter it out so release notes are generated against the actual previous version. * ci: prevent concurrent agent-sdk release runs Add concurrency group so multiple dispatches queue instead of racing on the same tag/release/PR.	2026-03-26 23:38:14 +05:30
shivammittal274	e3d57e5347	feat(cli): production-ready CLI with auto-launch, install, and cross-platform builds (#555 ) * feat(cli): production-ready CLI with auto-launch, install, and cross-platform builds - init: accept URL argument and --auto flag for non-interactive setup - install: new command to download BrowserOS app for current platform - launch: auto-detect and launch BrowserOS when server is not running - discovery: prefer server.json (live) over config.yaml (may be stale) - errors: actionable messages guiding users to init/install - goreleaser: cross-platform builds for 6 targets (darwin/linux/windows × amd64/arm64) - ci: GitHub Actions workflow to release CLI binaries on cli/v* tag push * fix(cli): check health status code and add progress dots during launch - Health check in newClient() now verifies HTTP 200, not just no error - waitForServer prints dots during the 30s poll so users know it's working * refactor(cli): make launch an explicit command, remove auto-launch from newClient - launch: new explicit command to find and open BrowserOS app - launch: probes server.json, config, and common ports before launching - launch: if already running, reports URL instead of launching again - init --auto: uses port probing to find running servers - install --deb: errors on non-Linux instead of silently downloading DMG - error messages: guide users to launch/install/init explicitly - removed: auto-launch from newClient() — CLI never does something surprising * fix(cli): platform-native detection, launch, and install for all OSes Detection (isBrowserOSInstalled): - macOS: uses `open -Ra` to query Launch Services (no hardcoded paths) - Linux: checks /usr/bin/browseros (.deb), browseros.desktop, AppImage search - Windows: checks %LOCALAPPDATA%\BrowserOS\Application\BrowserOS.exe and HKCU/HKLM uninstall registry keys Launch (startBrowserOS): - macOS: `open -b com.browseros.BrowserOS` (bundle ID, not path) - Linux: `browseros` binary, AppImage, or `gtk-launch browseros` (fixed: was using xdg-open which opens by MIME type, not desktop files) - Windows: runs BrowserOS.exe from known Chromium per-user install path (fixed: was using `cmd /c start BrowserOS` which doesn't resolve) Install (runPostInstall): - macOS: hdiutil attach → cp -R to /Applications → hdiutil detach - Linux: chmod +x for AppImage, dpkg -i instruction for .deb - Windows: launches installer exe - --deb flag now errors on non-Linux platforms Removed auto-launch from newClient() — CLI never does surprising things. Sources verified from: - packages/browseros/build/common/context.py (binary names per platform) - packages/browseros/build/modules/package/linux.py (.deb structure, .desktop file) - packages/browseros/chromium_patches/chrome/install_static/chromium_install_modes.h (Windows base_app_name="BrowserOS", registry GUID, install paths) - /Applications/BrowserOS.app/Contents/Info.plist (bundle ID)	2026-03-26 23:12:55 +05:30
Dani Akash	392312f203	ci: only run PR title validation on open and edit (#565 ) Remove synchronize and reopened triggers since this workflow only validates the PR title, which doesn't change on new commits or reopen.	2026-03-26 23:06:11 +05:30
shivammittal274	0babc05077	feat(eval): NopeCHA CAPTCHA solver integration (#537 ) * feat(eval): show mean score instead of pass/fail in report and viewer * feat(eval): integrate NopeCHA CAPTCHA solver into eval pipeline Add CAPTCHA detection and waiting so screenshots capture post-solve state. Run headed with xvfb on CI since headless breaks extension content scripts. - Add CaptchaWaiter module (detect reCAPTCHA/hCaptcha/Turnstile, poll until solved) - Add optional `captcha` config block to EvalConfigSchema - Wait for CAPTCHA solve before screenshot in single-agent and orchestrator-executor - Patch NopeCHA manifest with API key before launching workers - Fix CAPTCHA_EXT_DIR path (was pointing one level too high) - Remove --incognito (extensions don't run in incognito; fresh user-data-dir isolates) - CI: install xvfb, run headed via xvfb-run, pass NOPECHA_API_KEY secret	2026-03-24 00:14:16 +05:30
shivammittal274	026c6a03a3	feat(eval): auto-trigger eval on agent/tools changes pushed to main (#528 )	2026-03-23 16:52:30 +05:30
Nikhil	3cc946ded8	fix(ci): report test pass/fail status on PRs (#520 ) The test workflow captured exit codes but never failed the job, so PR checks always showed green even when tests failed. Exit with the captured code in the summarize step so each suite properly reports pass/fail. Not a required check, so failures remain non-blocking. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 11:31:23 -07:00
shivammittal274	0f9d93058f	chore(eval): remove unused env vars from workflow (OPENROUTER, OPENAI) (#522 )	2026-03-21 23:22:03 +05:30
shivammittal274	cafed57832	fix(eval): use CLAUDE_CODE_OAUTH_TOKEN for performance grader auth (#521 )	2026-03-21 23:14:23 +05:30
shivammittal274	f157436e7d	feat(eval): switch to Linux GitHub-hosted runner (#519 ) * feat(eval): switch to ubuntu-latest runner, add OE-Clado config - Switch workflow from self-hosted Mac Studio to ubuntu-latest - Install BrowserOS Linux .deb in CI (no self-hosted runner needed) - Add browseros-oe-clado-weekly.json config for orchestrator-executor - Fix report chart to show date+time (not just date) - Make BROWSEROS_BINARY configurable via env var * feat(eval): add NopeCHA captcha solver extension to eval runs - Auto-load NopeCHA extension in eval Chrome instances - Works in incognito + headless mode - CI workflow downloads NopeCHA before eval - extensions/ directory gitignored (downloaded at runtime) * feat(eval): per-config concurrency — different configs run in parallel * feat(eval): remove concurrency limit — all runs execute in parallel	2026-03-21 23:04:45 +05:30
Nikhil	ba7892322b	ci: run BrowserOS test suites on PRs (#514 ) * ci: run browseros tests on pull requests * refactor: rework 0320-github_action_for_tests based on feedback * refactor: rework 0320-github_action_for_tests based on feedback * chore: add CI artifacts to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: remove mikepenz/action-junit-report to fix check suite misattribution The JUnit report action creates check runs that GitHub associates with the CLA check suite instead of the Tests check suite, causing test reports to appear under "CLA Assistant" in the PR checks UI. Remove the action and rely on job status + step summary + artifact upload for test result visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 09:46:36 -07:00
shivammittal274	4e90b4561a	feat(eval): weekly eval pipeline with R2 uploads and trend dashboard (#516 ) * feat(eval): weekly eval pipeline with R2 uploads and trend dashboard Add infrastructure for running weekly evaluations and tracking score trends over time: - Auto-generated output dirs: results/{config-name}/{timestamp}/ Each eval run gets its own timestamped folder, nothing is overwritten. - upload-run.ts: uploads eval results to Cloudflare R2. Supports uploading a specific run or all un-uploaded runs for a config. - weekly-report.ts: generates an interactive HTML dashboard from R2 data. Config dropdown, trend chart with hover tooltips, searchable runs table. Groups runs by config name. - viewer.html: client-facing 3-column run viewer (task list, screenshots with autoplay, agent stream with messages.jsonl). Shows performance grader axis breakdown with per-axis scores. - browseros-agent-weekly.json: weekly benchmark config (kimi-k2p5, webbench-2of4-50, 10 workers, performance grader, headless). - eval-weekly.yml: GitHub Actions workflow with cron (Saturday 6am) and manual trigger. Runs on self-hosted Mac Studio runner. Concurrency group ensures only one eval runs at a time. - Dashboard updates: load previous runs, messages.jsonl viewer, grade badges show percentages, async stream loading. - Grader updates: timeout 30min, max turns 100, DOM content verification guidance for performance grader. * fix(eval): address Greptile review — injection, nested dirs, escaping - Fix script injection in eval-weekly.yml: pass github.event.inputs through env var instead of interpolating into shell - Fix /api/runs to enumerate nested results/{config}/{timestamp}/ dirs - Fix /api/load-run to allow single-slash run names (config/timestamp) - Add HTML escaping for R2-sourced values in weekly-report.ts - Escape axis names in viewer.html renderAxesBreakdown * fix(eval): fix biome lint — non-null assertion, template literals * fix(eval): fix biome errors — replace var with let, fix inner function declaration * fix(eval): address Greptile P2 issues - isRunDir: check all subdirs for metadata.json, not just first 3 - eval-runner: guard configPath for dashboard-driven runs (fallback to 'eval') - load-run: default unknown termination_reason to 'failed' not 'completed' * feat(eval): make BROWSEROS_BINARY configurable via env var	2026-03-21 22:12:52 +05:30

1 2

69 Commits