mirror of
https://github.com/browseros-ai/BrowserOS.git
synced 2026-05-19 19:41:06 +00:00
* chore(eval): instrument server startup to root-cause dev CI health-check timeouts Three diagnostics + one config swap to investigate why the eval-weekly workflow has been failing on dev since 2026-04-25 with "Server health check timed out" (every worker, every retry). Background: - Last successful weekly eval on dev: 2026-04-18 (shaf5a2b73) - Since then, ~30 server commits landed including Lima/VM runtime, OpenClaw service, ACL system, ACP SDK — 108 server files changed, ~13K LOC added. - Server process spawns cleanly in CI (PID logged) but never binds /health within the 30s eval-side timeout. Static analysis finds no obvious blocker; we need runtime evidence. Changes: 1. apps/server/package.json — add `start:ci` script (no `--watch`). The default `start` uses `bun --watch` which forks a child process that watches every file in the import graph. Dev's graph is ~108 files larger than main's; on a cold CI runner the watcher setup is a plausible source of multi-second startup overhead. 2. apps/eval/src/runner/browseros-app-manager.ts: - Use `start:ci` when `process.env.CI` is set (true on GitHub-hosted runners by default), else `start`. - Capture per-worker server stderr to /tmp/browseros-server-logs/ instead of ignoring it. Without this we have no visibility into why the server is hung pre-/health. - Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module graph may simply need more cold-start time on CI. 3. .github/workflows/eval-weekly.yml — upload the server logs dir as a workflow artifact (always, not just on success) so we can post-mortem any startup failure on the next run. 4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter -> Fireworks (bypasses the OpenRouter per-key spend cap that has been eating recent runs) and drop num_workers 10 -> 4 (well below the Fireworks per-account TPM threshold that overwhelmed the original 2026-04-23 run). Plan: trigger the eval-weekly workflow on this branch with the agisdk config and observe (a) whether it gets past server startup, and (b) if it doesn't, what the captured server stderr says. * fix(eval): capture stdout too — pino logger writes to stdout, not stderr Previous diagnostic patch only redirected stderr; the captured per-worker log files came back as 0 bytes because the server uses pino which writes all log output to stdout (fd 1), not stderr (fd 2). Capture both into the same file. * fix(server): catch sync throw from OpenClaw constructor on Linux The container runtime constructor in OpenClawService throws synchronously on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing .catch() on tryAutoStart() only handles async throws inside auto-start — the sync throw from configureOpenClawService(...) itself propagates up through Application.start() and crashes the process via index.ts:48 (process.exit(EXIT_CODES.GENERAL_ERROR)). This is what's been killing dev's eval-weekly CI: the server crashes in milliseconds, the eval client polls /health, gets nothing, times out. Fix: wrap the configureOpenClawService call in try/catch matching the existing .catch() intent (best-effort, don't crash). Server continues without OpenClaw on platforms where it can't initialize. Verified by reading captured server stdout from run 25123195126: Failed to start server: error: browseros-vm currently supports macOS only at buildContainerRuntime (container-runtime-factory.ts:54:11) at new OpenClawService (openclaw-service.ts:652:15) at configureOpenClawService (openclaw-service.ts:1527:19) at start (main.ts:127:5) * fix(server): defer OpenClaw chat client port lookup to request time apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort() synchronously when constructing the OpenClawGatewayChatClient inside the createHttpServer object literal. On non-darwin platforms this throws via the OpenClawService constructor → buildContainerRuntime, escaping the try/catch added in5cf7b765(which only protected the configureOpenClawService call further down in main.ts). Every other getOpenClawService() reference in server.ts is already wrapped in an arrow function. This was the lone holdout. Make it lazy too: change the chat client constructor to take getHostPort: () => number instead of hostPort: number, evaluate it inside streamTurn at request time. Behavior on darwin is unchanged. This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't available — the chat endpoint isn't exercised by the eval, so a deferred throw is acceptable. * fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1 Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't unblock dev's Linux CI — same throw kept reproducing. Whether this is bun caching stale stack frames or a missed eager call site, the safer move is to fix it at the root: make buildContainerRuntime never throw on Linux when the runner has explicitly opted out. Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test escape hatch in container-runtime-factory.ts. When set, returns the existing UnsupportedPlatformTestRuntime stub — server boots normally, /health binds, any actual OpenClaw API call still fails loudly at request time. eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and non-CI Linux behavior unchanged (without the flag they still throw). * feat(eval): align Clado action executor with new endpoint contract David Shan shared the updated Clado BrowserOS Action Model spec. Changes to match it: - Bump endpoint URL + model id to the 000159-merged checkpoint (clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef) in browseros-oe-clado-weekly.json and the README example. - CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min; the 2-min ceiling was failing every cold-start request. - Treat HTTP 200 with action=null / parse_error as an INVALID step instead of aborting the executor loop. The model can self-correct on the next call. Cap consecutive parse failures at 3 to avoid infinite loops. - Capture final_answer from end actions. Surface it in the observation back to the orchestrator so its task answer can use the model's declared result. - Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X). - Switch screenshot format from webp → png to match the documented "PNG or JPEG" contract. * chore(eval): refresh test-clado-api script for new Clado contract Updated the local smoke-test to match the new Clado endpoint and response contract: - New action + health URLs (000159-merged checkpoint). - Drop the grounding-model branch (orchestrator-executor doesn't use it; the README David shared only documents the action model). - Health-check waits up to 6 minutes for cold start with a 30s warning so the operator knows it's spinning up. - Print every documented response field (action, x/y, text, key, direction, amount, drag start/end, time, final_answer, thinking, parse_error, inference_time_seconds). - Three-step run that exercises a click, a typing continuation with formatted history, and an end+final_answer probe. * chore(eval): point clado weekly config at agisdk-real Switches the orchestrator-executor + Clado weekly config to run on the AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff grader. Matches the orchestrator-executor smoke target (Fireworks K2.5 orchestrator + Clado action executor) we want to track week-over-week. * chore(eval): run clado weekly headless Default to headless so the weekly job (and local repros) don't pop ten visible Chrome windows. Set headless=false locally if you need to watch a worker. * fix(eval): address Greptile P1+P2 on server log fd handling P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the log directory missing and crash the server spawn with ENOENT. Move openSync into the same try block; fall back to /dev/null so spawn always succeeds. P2: the log fd was opened on every server start but never closed. Each restart attempt leaked one fd across all workers — over a long eval run that could exhaust the process fd limit. Track the fd on the manager and closeSync it in killApp() right after the server process exits (the child's dup keeps the file open until it exits, so we don't truncate output).
293 lines
9.0 KiB
TypeScript
Vendored
293 lines
9.0 KiB
TypeScript
Vendored
/**
|
|
* Smoke-test for the Clado BrowserOS Action endpoint.
|
|
*
|
|
* Health-checks the model, then runs a generate call and prints every
|
|
* field the new contract documents (action, coordinates, text, key,
|
|
* direction, scroll/drag fields, wait, end+final_answer, thinking,
|
|
* parse_error, raw_response).
|
|
*
|
|
* Usage:
|
|
* bun apps/eval/scripts/test-clado-api.ts [screenshot-path]
|
|
*
|
|
* If no screenshot path is given, captures one over MCP from a
|
|
* running BrowserOS server (default http://127.0.0.1:9110, override
|
|
* with BROWSEROS_URL).
|
|
*
|
|
* Cold start can take ~5 minutes; the script waits up to 6.
|
|
*/
|
|
|
|
import { readFile } from 'node:fs/promises'
|
|
import { resolve } from 'node:path'
|
|
|
|
const ACTION_URL =
|
|
'https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run'
|
|
const ACTION_HEALTH_URL =
|
|
'https://clado-ai--clado-browseros-action-000159-merged-actionmod-5e5033.modal.run'
|
|
|
|
const COLD_START_BUDGET_MS = 360_000 // 6 min — Clado cold start is ~5 min
|
|
const COLD_START_WARN_MS = 30_000
|
|
|
|
interface CladoResponse {
|
|
action?: string | null
|
|
thinking?: string | null
|
|
raw_response?: string
|
|
parse_error?: string | null
|
|
inference_time_seconds?: number
|
|
x?: number
|
|
y?: number
|
|
text?: string
|
|
key?: string
|
|
direction?: string
|
|
amount?: number
|
|
startX?: number
|
|
startY?: number
|
|
endX?: number
|
|
endY?: number
|
|
time?: number
|
|
final_answer?: string | null
|
|
}
|
|
|
|
async function checkHealth(): Promise<boolean> {
|
|
console.log(`\n--- Action model health ---`)
|
|
console.log(` URL: ${ACTION_HEALTH_URL}`)
|
|
console.log(
|
|
` Note: cold start can take ~5 min; waiting up to ${COLD_START_BUDGET_MS / 1000}s.`,
|
|
)
|
|
const start = performance.now()
|
|
const warn = setTimeout(() => {
|
|
console.log(
|
|
` ...still waiting (${COLD_START_WARN_MS / 1000}s in) — model is likely cold-starting on Modal.`,
|
|
)
|
|
}, COLD_START_WARN_MS)
|
|
|
|
try {
|
|
const resp = await fetch(ACTION_HEALTH_URL, {
|
|
signal: AbortSignal.timeout(COLD_START_BUDGET_MS),
|
|
})
|
|
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
|
|
const body = await resp.text()
|
|
console.log(` Status: ${resp.status} (${elapsed}s)`)
|
|
console.log(` Body: ${body.slice(0, 400)}`)
|
|
return resp.ok
|
|
} catch (err) {
|
|
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
|
|
console.log(
|
|
` FAILED (${elapsed}s): ${err instanceof Error ? err.message : err}`,
|
|
)
|
|
return false
|
|
} finally {
|
|
clearTimeout(warn)
|
|
}
|
|
}
|
|
|
|
async function generate(
|
|
label: string,
|
|
payload: Record<string, unknown>,
|
|
): Promise<CladoResponse | null> {
|
|
console.log(`\n--- ${label} ---`)
|
|
console.log(` URL: ${ACTION_URL}`)
|
|
console.log(` Instruction: ${payload.instruction}`)
|
|
console.log(
|
|
` Image size: ${((payload.image_base64 as string).length / 1024).toFixed(0)} KB (base64)`,
|
|
)
|
|
if (payload.history && payload.history !== 'None') {
|
|
console.log(` History: ${payload.history}`)
|
|
}
|
|
|
|
const start = performance.now()
|
|
let resp: Response
|
|
try {
|
|
resp = await fetch(ACTION_URL, {
|
|
method: 'POST',
|
|
headers: { 'Content-Type': 'application/json' },
|
|
body: JSON.stringify(payload),
|
|
signal: AbortSignal.timeout(COLD_START_BUDGET_MS),
|
|
})
|
|
} catch (err) {
|
|
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
|
|
console.log(
|
|
` FAILED (${elapsed}s): ${err instanceof Error ? err.message : err}`,
|
|
)
|
|
return null
|
|
}
|
|
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
|
|
|
|
if (!resp.ok) {
|
|
const body = await resp.text()
|
|
console.log(` HTTP ${resp.status} ${resp.statusText} (${elapsed}s)`)
|
|
console.log(` Body: ${body.slice(0, 400)}`)
|
|
return null
|
|
}
|
|
|
|
const result = (await resp.json()) as CladoResponse
|
|
console.log(` HTTP ${resp.status} (${elapsed}s)`)
|
|
console.log(` action: ${result.action ?? 'null'}`)
|
|
if (result.parse_error) {
|
|
console.log(` parse_error: ${result.parse_error}`)
|
|
}
|
|
if (result.thinking) {
|
|
const trimmed = result.thinking.replace(/\s+/g, ' ').trim()
|
|
console.log(
|
|
` thinking: ${trimmed.slice(0, 240)}${trimmed.length > 240 ? '…' : ''}`,
|
|
)
|
|
}
|
|
if (typeof result.x === 'number' || typeof result.y === 'number') {
|
|
console.log(` x, y: ${result.x}, ${result.y}`)
|
|
}
|
|
if (typeof result.text === 'string')
|
|
console.log(` text: ${result.text.slice(0, 120)}`)
|
|
if (typeof result.key === 'string')
|
|
console.log(` key: ${result.key}`)
|
|
if (typeof result.direction === 'string')
|
|
console.log(` direction: ${result.direction}`)
|
|
if (typeof result.amount === 'number')
|
|
console.log(` amount: ${result.amount}`)
|
|
if (typeof result.startX === 'number' || typeof result.endX === 'number') {
|
|
console.log(
|
|
` drag: (${result.startX}, ${result.startY}) → (${result.endX}, ${result.endY})`,
|
|
)
|
|
}
|
|
if (typeof result.time === 'number')
|
|
console.log(` time: ${result.time}s`)
|
|
if (result.final_answer)
|
|
console.log(` final_answer: ${result.final_answer.slice(0, 240)}`)
|
|
if (typeof result.inference_time_seconds === 'number')
|
|
console.log(` inference_time_seconds: ${result.inference_time_seconds}`)
|
|
return result
|
|
}
|
|
|
|
async function loadScreenshot(path?: string): Promise<string> {
|
|
if (path) {
|
|
const resolved = resolve(path)
|
|
console.log(`Loading screenshot: ${resolved}`)
|
|
const data = await readFile(resolved)
|
|
return data.toString('base64')
|
|
}
|
|
|
|
const serverUrl = process.env.BROWSEROS_URL || 'http://127.0.0.1:9110'
|
|
console.log(
|
|
`No screenshot path provided. Capturing from ${serverUrl} via MCP...`,
|
|
)
|
|
|
|
const { Client } = await import('@modelcontextprotocol/sdk/client/index.js')
|
|
const { StreamableHTTPClientTransport } = await import(
|
|
'@modelcontextprotocol/sdk/client/streamableHttp.js'
|
|
)
|
|
|
|
const client = new Client({ name: 'clado-test', version: '1.0.0' })
|
|
const transport = new StreamableHTTPClientTransport(
|
|
new URL(`${serverUrl}/mcp`),
|
|
{ requestInit: { headers: { 'X-BrowserOS-Source': 'sdk-internal' } } },
|
|
)
|
|
|
|
try {
|
|
await client.connect(transport)
|
|
const result = (await client.callTool({
|
|
name: 'take_screenshot',
|
|
arguments: { format: 'png', page: 1 },
|
|
})) as { content: Array<{ type: string; data?: string }> }
|
|
|
|
const image = result.content?.find((c) => c.type === 'image')
|
|
if (!image?.data)
|
|
throw new Error('No image data in take_screenshot response')
|
|
|
|
console.log(
|
|
`Captured screenshot (${(image.data.length / 1024).toFixed(0)} KB base64)`,
|
|
)
|
|
return image.data
|
|
} finally {
|
|
try {
|
|
await transport.close()
|
|
} catch {
|
|
/* ignore */
|
|
}
|
|
}
|
|
}
|
|
|
|
function summarize(history: CladoResponse[]): string {
|
|
if (history.length === 0) return 'None'
|
|
return history
|
|
.map((h) => {
|
|
switch (h.action) {
|
|
case 'click':
|
|
case 'double_click':
|
|
case 'right_click':
|
|
case 'hover':
|
|
return `${h.action}(${h.x}, ${h.y})`
|
|
case 'type':
|
|
return `type(${JSON.stringify(h.text ?? '')})`
|
|
case 'press_key':
|
|
return `press_key(${JSON.stringify(h.key ?? '')})`
|
|
case 'scroll':
|
|
return `scroll(${h.direction ?? 'down'})`
|
|
case 'drag':
|
|
return `drag(${h.startX},${h.startY} -> ${h.endX},${h.endY})`
|
|
case 'wait':
|
|
return `wait(${h.time ?? 1}s)`
|
|
case 'end':
|
|
return 'end()'
|
|
default:
|
|
return h.action ?? 'invalid'
|
|
}
|
|
})
|
|
.join(' -> ')
|
|
}
|
|
|
|
async function main() {
|
|
console.log('=== Clado action endpoint smoke test ===')
|
|
|
|
const healthy = await checkHealth()
|
|
if (!healthy) {
|
|
console.log('\nHealth check failed. Exiting.')
|
|
process.exit(1)
|
|
}
|
|
|
|
let imageBase64: string
|
|
try {
|
|
imageBase64 = await loadScreenshot(process.argv[2])
|
|
} catch (err) {
|
|
console.log(
|
|
`\nFailed to load screenshot: ${err instanceof Error ? err.message : err}`,
|
|
)
|
|
console.log(
|
|
'Pass a path: bun apps/eval/scripts/test-clado-api.ts path/to/screenshot.png',
|
|
)
|
|
process.exit(1)
|
|
}
|
|
|
|
const history: CladoResponse[] = []
|
|
|
|
// Step 1: open task — let the model decide what to do.
|
|
const step1 = await generate('Step 1: cold task', {
|
|
instruction: 'Find the search bar and click it',
|
|
image_base64: imageBase64,
|
|
history: 'None',
|
|
})
|
|
if (step1?.action) history.push(step1)
|
|
|
|
// Step 2: continuation with history, asks for typing.
|
|
if (step1?.action) {
|
|
const step2 = await generate('Step 2: with history', {
|
|
instruction: 'Type "hello world" into the search bar',
|
|
image_base64: imageBase64,
|
|
history: summarize(history),
|
|
})
|
|
if (step2?.action) history.push(step2)
|
|
}
|
|
|
|
// Step 3: ask for end with a final answer to exercise that field.
|
|
await generate('Step 3: ask for end+final_answer', {
|
|
instruction:
|
|
'You have completed the task. Reply with end() and final_answer="done".',
|
|
image_base64: imageBase64,
|
|
history: summarize(history),
|
|
})
|
|
|
|
console.log('\n=== Done ===')
|
|
}
|
|
|
|
main().catch((err) => {
|
|
console.error('Fatal:', err)
|
|
process.exit(1)
|
|
})
|