* chore(eval): instrument server startup to root-cause dev CI health-check timeouts
Three diagnostics + one config swap to investigate why the eval-weekly
workflow has been failing on dev since 2026-04-25 with "Server health
check timed out" (every worker, every retry).
Background:
- Last successful weekly eval on dev: 2026-04-18 (sha f5a2b73)
- Since then, ~30 server commits landed including Lima/VM runtime,
OpenClaw service, ACL system, ACP SDK — 108 server files changed,
~13K LOC added.
- Server process spawns cleanly in CI (PID logged) but never binds
/health within the 30s eval-side timeout. Static analysis finds no
obvious blocker; we need runtime evidence.
Changes:
1. apps/server/package.json — add `start:ci` script (no `--watch`).
The default `start` uses `bun --watch` which forks a child process
that watches every file in the import graph. Dev's graph is ~108
files larger than main's; on a cold CI runner the watcher setup is a
plausible source of multi-second startup overhead.
2. apps/eval/src/runner/browseros-app-manager.ts:
- Use `start:ci` when `process.env.CI` is set (true on
GitHub-hosted runners by default), else `start`.
- Capture per-worker server stderr to /tmp/browseros-server-logs/
instead of ignoring it. Without this we have no visibility into
why the server is hung pre-/health.
- Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module
graph may simply need more cold-start time on CI.
3. .github/workflows/eval-weekly.yml — upload the server logs dir as a
workflow artifact (always, not just on success) so we can post-mortem
any startup failure on the next run.
4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter ->
Fireworks (bypasses the OpenRouter per-key spend cap that has been
eating recent runs) and drop num_workers 10 -> 4 (well below the
Fireworks per-account TPM threshold that overwhelmed the original
2026-04-23 run).
Plan: trigger the eval-weekly workflow on this branch with the agisdk
config and observe (a) whether it gets past server startup, and
(b) if it doesn't, what the captured server stderr says.
* fix(eval): capture stdout too — pino logger writes to stdout, not stderr
Previous diagnostic patch only redirected stderr; the captured per-worker
log files came back as 0 bytes because the server uses pino which writes
all log output to stdout (fd 1), not stderr (fd 2). Capture both into
the same file.
* fix(server): catch sync throw from OpenClaw constructor on Linux
The container runtime constructor in OpenClawService throws synchronously
on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing
.catch() on tryAutoStart() only handles async throws inside auto-start —
the sync throw from configureOpenClawService(...) itself propagates up
through Application.start() and crashes the process via index.ts:48
(process.exit(EXIT_CODES.GENERAL_ERROR)).
This is what's been killing dev's eval-weekly CI: the server crashes in
milliseconds, the eval client polls /health, gets nothing, times out.
Fix: wrap the configureOpenClawService call in try/catch matching the
existing .catch() intent (best-effort, don't crash). Server continues
without OpenClaw on platforms where it can't initialize.
Verified by reading captured server stdout from run 25123195126:
Failed to start server: error: browseros-vm currently supports macOS only
at buildContainerRuntime (container-runtime-factory.ts:54:11)
at new OpenClawService (openclaw-service.ts:652:15)
at configureOpenClawService (openclaw-service.ts:1527:19)
at start (main.ts:127:5)
* fix(server): defer OpenClaw chat client port lookup to request time
apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort()
synchronously when constructing the OpenClawGatewayChatClient inside the
createHttpServer object literal. On non-darwin platforms this throws via
the OpenClawService constructor → buildContainerRuntime, escaping the
try/catch added in 5cf7b765 (which only protected the configureOpenClawService
call further down in main.ts).
Every other getOpenClawService() reference in server.ts is already wrapped
in an arrow function. This was the lone holdout. Make it lazy too: change
the chat client constructor to take getHostPort: () => number instead of
hostPort: number, evaluate it inside streamTurn at request time. Behavior
on darwin is unchanged.
This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't
available — the chat endpoint isn't exercised by the eval, so a deferred
throw is acceptable.
* fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1
Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't
unblock dev's Linux CI — same throw kept reproducing. Whether this is bun
caching stale stack frames or a missed eager call site, the safer move is
to fix it at the root: make buildContainerRuntime never throw on Linux
when the runner has explicitly opted out.
Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test
escape hatch in container-runtime-factory.ts. When set, returns the existing
UnsupportedPlatformTestRuntime stub — server boots normally, /health binds,
any actual OpenClaw API call still fails loudly at request time.
eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and
non-CI Linux behavior unchanged (without the flag they still throw).
* feat(eval): align Clado action executor with new endpoint contract
David Shan shared the updated Clado BrowserOS Action Model spec.
Changes to match it:
- Bump endpoint URL + model id to the 000159-merged checkpoint
(clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef)
in browseros-oe-clado-weekly.json and the README example.
- CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min;
the 2-min ceiling was failing every cold-start request.
- Treat HTTP 200 with action=null / parse_error as an INVALID step
instead of aborting the executor loop. The model can self-correct
on the next call. Cap consecutive parse failures at 3 to avoid
infinite loops.
- Capture final_answer from end actions. Surface it in the observation
back to the orchestrator so its task answer can use the model's
declared result.
- Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X).
- Switch screenshot format from webp → png to match the documented
"PNG or JPEG" contract.
* chore(eval): refresh test-clado-api script for new Clado contract
Updated the local smoke-test to match the new Clado endpoint and
response contract:
- New action + health URLs (000159-merged checkpoint).
- Drop the grounding-model branch (orchestrator-executor doesn't
use it; the README David shared only documents the action model).
- Health-check waits up to 6 minutes for cold start with a 30s
warning so the operator knows it's spinning up.
- Print every documented response field (action, x/y, text, key,
direction, amount, drag start/end, time, final_answer, thinking,
parse_error, inference_time_seconds).
- Three-step run that exercises a click, a typing continuation
with formatted history, and an end+final_answer probe.
* chore(eval): point clado weekly config at agisdk-real
Switches the orchestrator-executor + Clado weekly config to run on the
AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff
grader. Matches the orchestrator-executor smoke target (Fireworks K2.5
orchestrator + Clado action executor) we want to track week-over-week.
* chore(eval): run clado weekly headless
Default to headless so the weekly job (and local repros) don't pop ten
visible Chrome windows. Set headless=false locally if you need to watch
a worker.
* fix(eval): address Greptile P1+P2 on server log fd handling
P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir
failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the
log directory missing and crash the server spawn with ENOENT. Move openSync
into the same try block; fall back to /dev/null so spawn always succeeds.
P2: the log fd was opened on every server start but never closed. Each
restart attempt leaked one fd across all workers — over a long eval run
that could exhaust the process fd limit. Track the fd on the manager and
closeSync it in killApp() right after the server process exits (the child's
dup keeps the file open until it exits, so we don't truncate output).
* refactor(eval): drop unused agents/graders, collapse registries
Sweep of dead code in the eval app: deleted gemini-computer-use and
yutori-navigator agents, fara/webvoyager/mind2web graders, eight
debug/analyze/test scripts, three stale planning docs, and the orphaned
eval-targets/coordinate-click testbed.
With two agents and three graders left, the Map-backed plugin registries
were over-engineered — collapsed both into plain switches. Removed the
now-dead GraderOptions plumbing (no remaining grader takes API keys),
dropped grader_api_key_env/grader_base_url/grader_model from the schema
and configs, and de-duped PASS_FAIL_GRADER_ORDER (was defined in three
places). Replaced the URL-parsing extractCdpPort hack in single-agent
and orchestrator-executor with workerIndex passed cleanly through
AgentContext.
README and --help text rewritten to match reality. Renamed
configs/test_*.json to test-*.json for kebab-case consistency.
Net: ~10,460 LOC removed across 60 files. Typecheck clean, all tests
pass.
* ci(eval): pull BrowserOS from rolling stable CDN URL
The pinned v0.44.0.1 .deb on GitHub releases regressed on Linux —
servers start but never become healthy. Switch to the canonical rolling
URL at cdn.browseros.com/download/BrowserOS.deb so CI tracks the same
stable channel users get from the marketing site.
- Add hover_at, type_at, drag_at coordinate tools to server
- Add hoverAt, typeAt, dragAt methods to Browser class
- Export server internals (browser, tool-loop, registry) for eval imports
- Copy eval app from enterprise repo with agents, graders, runner, dashboard
- Nest eval-targets inside apps/eval
- Adapt sessionExecutionDir → workingDir for current server API
- Add biome ignore for dashboard HTML to prevent lint breaking onclick handlers