Compare commits

...

13 Commits

Author SHA1 Message Date
shivammittal274
7ee8dedd53 chore(eval): drop the 60-char truncation on grader expected/actual values
Some criteria check long strings (job descriptions, post bodies, etc.) —
truncating to 60 chars hides exactly the bytes you need to diff. The
viewer's reasoning area already has max-height + scroll + word-break so
long content scrolls; nothing renders worse for being full-length.
2026-04-30 02:08:30 +05:30
shivammittal274
a3b5ef4da3 chore(eval): show every criterion in agisdk grader message, not just failures
Listing only failures hid the bigger picture — when 1 of 4 criteria fails
you still want to know which 3 passed and what was checked. Now the
message is the full checklist, ✓/✗ per criterion, with expected vs actual
on the failing lines.

Examples:

  All 4 criteria passed.
  ✓ correct job title
  ✓ includes Java skill
  ✓ includes Spring Boot skill
  ✓ includes Angular skill

  2 of 4 criteria failed:
  ✓ correct job title (softened)
  ✓ includes Java skill
  ✗ includes Spring Boot skill: expected True, got False
  ✗ includes Angular skill: expected True, got False
2026-04-30 02:08:07 +05:30
shivammittal274
3333728e4e fix(eval): surface per-criterion descriptions in agisdk grader output
The viewer's grader-reasoning pill was showing "Task not completed
successfully." for every agisdk_state_diff failure. The rich data was
actually available — agisdk's TaskConfig exposes a 'description' (e.g.
"includes Spring Boot skill") and the JMESPath 'query' for each criterion,
zip-aligned 1:1 with info['results'] — we just weren't extracting it.

Now agisdk-evaluate.py emits per-criterion entries with description,
query, expected_value, actual_value, and builds the message as a useful
multi-line summary:

  2 of 4 criteria failed:
  • includes Spring Boot skill: expected True, got False
  • includes Angular skill: expected True, got False

The viewer's grader-reasoning area already has white-space: pre-wrap so
the multi-line message renders correctly. The structured per_criterion
fields are also stored under details.per_criterion in metadata.json for
anyone who wants to grep R2 artifacts directly.
2026-04-30 02:06:51 +05:30
shivammittal274
5c6fd34d3e fix(eval): address Greptile P1+P2 on server log fd handling
P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir
failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the
log directory missing and crash the server spawn with ENOENT. Move openSync
into the same try block; fall back to /dev/null so spawn always succeeds.

P2: the log fd was opened on every server start but never closed. Each
restart attempt leaked one fd across all workers — over a long eval run
that could exhaust the process fd limit. Track the fd on the manager and
closeSync it in killApp() right after the server process exits (the child's
dup keeps the file open until it exits, so we don't truncate output).
2026-04-30 01:16:20 +05:30
shivammittal274
1a1220dff5 chore(eval): run clado weekly headless
Default to headless so the weekly job (and local repros) don't pop ten
visible Chrome windows. Set headless=false locally if you need to watch
a worker.
2026-04-30 00:37:45 +05:30
shivammittal274
dc98858cc3 chore(eval): point clado weekly config at agisdk-real
Switches the orchestrator-executor + Clado weekly config to run on the
AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff
grader. Matches the orchestrator-executor smoke target (Fireworks K2.5
orchestrator + Clado action executor) we want to track week-over-week.
2026-04-30 00:37:45 +05:30
shivammittal274
72cbffe2bb chore(eval): refresh test-clado-api script for new Clado contract
Updated the local smoke-test to match the new Clado endpoint and
response contract:

- New action + health URLs (000159-merged checkpoint).
- Drop the grounding-model branch (orchestrator-executor doesn't
  use it; the README David shared only documents the action model).
- Health-check waits up to 6 minutes for cold start with a 30s
  warning so the operator knows it's spinning up.
- Print every documented response field (action, x/y, text, key,
  direction, amount, drag start/end, time, final_answer, thinking,
  parse_error, inference_time_seconds).
- Three-step run that exercises a click, a typing continuation
  with formatted history, and an end+final_answer probe.
2026-04-30 00:37:44 +05:30
shivammittal274
34fdf08521 feat(eval): align Clado action executor with new endpoint contract
David Shan shared the updated Clado BrowserOS Action Model spec.
Changes to match it:

- Bump endpoint URL + model id to the 000159-merged checkpoint
  (clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef)
  in browseros-oe-clado-weekly.json and the README example.
- CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min;
  the 2-min ceiling was failing every cold-start request.
- Treat HTTP 200 with action=null / parse_error as an INVALID step
  instead of aborting the executor loop. The model can self-correct
  on the next call. Cap consecutive parse failures at 3 to avoid
  infinite loops.
- Capture final_answer from end actions. Surface it in the observation
  back to the orchestrator so its task answer can use the model's
  declared result.
- Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X).
- Switch screenshot format from webp → png to match the documented
  "PNG or JPEG" contract.
2026-04-30 00:37:44 +05:30
shivammittal274
be6858d589 fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1
Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't
unblock dev's Linux CI — same throw kept reproducing. Whether this is bun
caching stale stack frames or a missed eager call site, the safer move is
to fix it at the root: make buildContainerRuntime never throw on Linux
when the runner has explicitly opted out.

Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test
escape hatch in container-runtime-factory.ts. When set, returns the existing
UnsupportedPlatformTestRuntime stub — server boots normally, /health binds,
any actual OpenClaw API call still fails loudly at request time.

eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and
non-CI Linux behavior unchanged (without the flag they still throw).
2026-04-29 23:18:59 +05:30
shivammittal274
33f68a0d74 fix(server): defer OpenClaw chat client port lookup to request time
apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort()
synchronously when constructing the OpenClawGatewayChatClient inside the
createHttpServer object literal. On non-darwin platforms this throws via
the OpenClawService constructor → buildContainerRuntime, escaping the
try/catch added in 5cf7b765 (which only protected the configureOpenClawService
call further down in main.ts).

Every other getOpenClawService() reference in server.ts is already wrapped
in an arrow function. This was the lone holdout. Make it lazy too: change
the chat client constructor to take getHostPort: () => number instead of
hostPort: number, evaluate it inside streamTurn at request time. Behavior
on darwin is unchanged.

This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't
available — the chat endpoint isn't exercised by the eval, so a deferred
throw is acceptable.
2026-04-29 23:10:48 +05:30
shivammittal274
5cf7b765d0 fix(server): catch sync throw from OpenClaw constructor on Linux
The container runtime constructor in OpenClawService throws synchronously
on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing
.catch() on tryAutoStart() only handles async throws inside auto-start —
the sync throw from configureOpenClawService(...) itself propagates up
through Application.start() and crashes the process via index.ts:48
(process.exit(EXIT_CODES.GENERAL_ERROR)).

This is what's been killing dev's eval-weekly CI: the server crashes in
milliseconds, the eval client polls /health, gets nothing, times out.

Fix: wrap the configureOpenClawService call in try/catch matching the
existing .catch() intent (best-effort, don't crash). Server continues
without OpenClaw on platforms where it can't initialize.

Verified by reading captured server stdout from run 25123195126:
  Failed to start server: error: browseros-vm currently supports macOS only
    at buildContainerRuntime (container-runtime-factory.ts:54:11)
    at new OpenClawService (openclaw-service.ts:652:15)
    at configureOpenClawService (openclaw-service.ts:1527:19)
    at start (main.ts:127:5)
2026-04-29 22:57:03 +05:30
shivammittal274
5ed0879d31 fix(eval): capture stdout too — pino logger writes to stdout, not stderr
Previous diagnostic patch only redirected stderr; the captured per-worker
log files came back as 0 bytes because the server uses pino which writes
all log output to stdout (fd 1), not stderr (fd 2). Capture both into
the same file.
2026-04-29 22:44:07 +05:30
shivammittal274
e136094305 chore(eval): instrument server startup to root-cause dev CI health-check timeouts
Three diagnostics + one config swap to investigate why the eval-weekly
workflow has been failing on dev since 2026-04-25 with "Server health
check timed out" (every worker, every retry).

Background:
- Last successful weekly eval on dev: 2026-04-18 (sha f5a2b73)
- Since then, ~30 server commits landed including Lima/VM runtime,
  OpenClaw service, ACL system, ACP SDK — 108 server files changed,
  ~13K LOC added.
- Server process spawns cleanly in CI (PID logged) but never binds
  /health within the 30s eval-side timeout. Static analysis finds no
  obvious blocker; we need runtime evidence.

Changes:

1. apps/server/package.json — add `start:ci` script (no `--watch`).
   The default `start` uses `bun --watch` which forks a child process
   that watches every file in the import graph. Dev's graph is ~108
   files larger than main's; on a cold CI runner the watcher setup is a
   plausible source of multi-second startup overhead.

2. apps/eval/src/runner/browseros-app-manager.ts:
   - Use `start:ci` when `process.env.CI` is set (true on
     GitHub-hosted runners by default), else `start`.
   - Capture per-worker server stderr to /tmp/browseros-server-logs/
     instead of ignoring it. Without this we have no visibility into
     why the server is hung pre-/health.
   - Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module
     graph may simply need more cold-start time on CI.

3. .github/workflows/eval-weekly.yml — upload the server logs dir as a
   workflow artifact (always, not just on success) so we can post-mortem
   any startup failure on the next run.

4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter ->
   Fireworks (bypasses the OpenRouter per-key spend cap that has been
   eating recent runs) and drop num_workers 10 -> 4 (well below the
   Fireworks per-account TPM threshold that overwhelmed the original
   2026-04-23 run).

Plan: trigger the eval-weekly workflow on this branch with the agisdk
config and observe (a) whether it gets past server startup, and
(b) if it doesn't, what the captured server stderr says.
2026-04-29 22:34:32 +05:30
14 changed files with 407 additions and 152 deletions

View File

@@ -71,6 +71,9 @@ jobs:
NOPECHA_API_KEY: ${{ secrets.NOPECHA_API_KEY }}
BROWSEROS_BINARY: /usr/bin/browseros
WEBARENA_INFINITY_DIR: /tmp/webarena-infinity
# OpenClaw container runtime is macOS-only; opt the Linux runner
# into the no-op stub so the server can boot and the eval can run.
BROWSEROS_SKIP_OPENCLAW: '1'
EVAL_CONFIG: ${{ github.event.inputs.config || 'configs/browseros-agent-weekly.json' }}
run: |
echo "Running eval with config: $EVAL_CONFIG"
@@ -109,3 +112,11 @@ jobs:
with:
name: eval-report-${{ github.run_id }}
path: /tmp/eval-report.html
- name: Upload server stderr logs (for post-mortem on startup failures)
if: always()
uses: actions/upload-artifact@v4
with:
name: browseros-server-logs-${{ github.run_id }}
path: /tmp/browseros-server-logs/
if-no-files-found: ignore

View File

@@ -66,9 +66,9 @@ The orchestrator works with any LLM provider. The executor can be another LLM, o
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"model": "Qwen3.5-35B-A3B-action-000159-merged",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
"baseUrl": "https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run"
}
}
}

View File

@@ -2,13 +2,13 @@
"agent": {
"type": "single",
"provider": "openai-compatible",
"model": "moonshotai/kimi-k2.5",
"apiKey": "OPENROUTER_API_KEY",
"baseUrl": "https://openrouter.ai/api/v1",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1",
"supportsImages": true
},
"dataset": "../data/agisdk-real.jsonl",
"num_workers": 10,
"num_workers": 4,
"restart_server_per_task": true,
"browseros": {
"server_url": "http://127.0.0.1:9110",

View File

@@ -9,12 +9,12 @@
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"model": "Qwen3.5-35B-A3B-action-000159-merged",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
"baseUrl": "https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run"
}
},
"dataset": "../data/webbench-2of4-50.jsonl",
"dataset": "../data/agisdk-real.jsonl",
"num_workers": 10,
"restart_server_per_task": true,
"browseros": {
@@ -23,11 +23,11 @@
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
"headless": true
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"graders": ["agisdk_state_diff"],
"timeout_ms": 1800000
}

View File

@@ -81,13 +81,30 @@ def main():
reward_val = float(reward_val) if reward_val is not None else 0.0
results = info.get("results", [])
# `info["results"]` aligns 1:1 with `tc.task.evals` — zip them so we can
# surface the human-readable description and JMESPath query alongside
# the pass/fail. Without this the only feedback was a stringified dict.
evals = list(getattr(tc.task, "evals", []))
per_criterion = []
softened_count = 0
for r in results:
for idx, r in enumerate(results):
passed = bool(r[0])
detail = r[1] if len(r) > 1 else ""
entry: dict = {"passed": passed, "detail": str(detail)}
detail = r[1] if len(r) > 1 else {}
ev = evals[idx] if idx < len(evals) else None
actual_value = expected_value = None
if isinstance(detail, dict):
actual_value = detail.get("actual_value")
expected_value = detail.get("expected_value")
entry: dict = {
"passed": passed,
"description": getattr(ev, "description", "") or "",
"query": getattr(ev, "query", "") or "",
"expected_value": expected_value,
"actual_value": actual_value,
}
if not _STRICT and not passed and _soft_string_match(detail):
entry["passed"] = True
entry["softened"] = True
@@ -100,9 +117,43 @@ def main():
if all_pass and reward_val != 1.0:
reward_val = 1.0
out_message = str(message)
if softened_count and all_pass:
out_message = f"Task passed (with {softened_count} softened string criterion/criteria)."
# Build a useful message: list every criterion with a pass/fail icon
# so the viewer's grader pill shows the full check-list, not just
# failures. This becomes the `reasoning` shown in the viewer.
if not per_criterion:
# Defensive: agisdk returned no criteria — fall back to its message.
out_message = str(message)
else:
failures = [c for c in per_criterion if not c["passed"]]
if all_pass:
header = (
f"All {len(per_criterion)} criteria passed"
+ (
f" ({softened_count} softened)."
if softened_count
else "."
)
)
else:
header = (
f"{len(failures)} of {len(per_criterion)} criteria failed:"
)
lines = []
for c in per_criterion:
icon = "" if c["passed"] else ""
desc = c["description"] or c["query"] or "<unknown>"
soft = " (softened)" if c.get("softened") else ""
if c["passed"]:
lines.append(f"{icon} {desc}{soft}")
else:
exp_s = repr(c["expected_value"])
act_s = repr(c["actual_value"])
lines.append(
f"{icon} {desc}: expected {exp_s}, got {act_s}"
)
out_message = header + "\n" + "\n".join(lines)
print(
json.dumps(

View File

@@ -1,34 +1,73 @@
/**
* Test script for Clado API endpoints (grounding + action models)
* Smoke-test for the Clado BrowserOS Action endpoint.
*
* Health-checks the model, then runs a generate call and prints every
* field the new contract documents (action, coordinates, text, key,
* direction, scroll/drag fields, wait, end+final_answer, thinking,
* parse_error, raw_response).
*
* Usage:
* bun apps/eval/scripts/test-clado-api.ts [screenshot-path]
*
* If no screenshot provided, captures one from a running BrowserOS server.
* If no screenshot path is given, captures one over MCP from a
* running BrowserOS server (default http://127.0.0.1:9110, override
* with BROWSEROS_URL).
*
* Cold start can take ~5 minutes; the script waits up to 6.
*/
import { readFile } from 'node:fs/promises'
import { resolve } from 'node:path'
const ACTION_URL =
'https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run'
'https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run'
const ACTION_HEALTH_URL =
'https://clado-ai--clado-browseros-action-actionmodel-health.modal.run'
const GROUNDING_URL =
'https://clado-ai--clado-browseros-grounding-groundingmodel-generate.modal.run'
const GROUNDING_HEALTH_URL =
'https://clado-ai--clado-browseros-grounding-groundingmodel-health.modal.run'
'https://clado-ai--clado-browseros-action-000159-merged-actionmod-5e5033.modal.run'
async function checkHealth(name: string, url: string): Promise<boolean> {
console.log(`\n--- ${name} health check ---`)
console.log(` URL: ${url}`)
const COLD_START_BUDGET_MS = 360_000 // 6 min — Clado cold start is ~5 min
const COLD_START_WARN_MS = 30_000
interface CladoResponse {
action?: string | null
thinking?: string | null
raw_response?: string
parse_error?: string | null
inference_time_seconds?: number
x?: number
y?: number
text?: string
key?: string
direction?: string
amount?: number
startX?: number
startY?: number
endX?: number
endY?: number
time?: number
final_answer?: string | null
}
async function checkHealth(): Promise<boolean> {
console.log(`\n--- Action model health ---`)
console.log(` URL: ${ACTION_HEALTH_URL}`)
console.log(
` Note: cold start can take ~5 min; waiting up to ${COLD_START_BUDGET_MS / 1000}s.`,
)
const start = performance.now()
const warn = setTimeout(() => {
console.log(
` ...still waiting (${COLD_START_WARN_MS / 1000}s in) — model is likely cold-starting on Modal.`,
)
}, COLD_START_WARN_MS)
try {
const resp = await fetch(url, { signal: AbortSignal.timeout(30_000) })
const resp = await fetch(ACTION_HEALTH_URL, {
signal: AbortSignal.timeout(COLD_START_BUDGET_MS),
})
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
const body = await resp.text()
console.log(` Status: ${resp.status} (${elapsed}s)`)
console.log(` Body: ${body.slice(0, 200)}`)
console.log(` Body: ${body.slice(0, 400)}`)
return resp.ok
} catch (err) {
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
@@ -36,63 +75,34 @@ async function checkHealth(name: string, url: string): Promise<boolean> {
` FAILED (${elapsed}s): ${err instanceof Error ? err.message : err}`,
)
return false
} finally {
clearTimeout(warn)
}
}
async function testGenerate(
name: string,
url: string,
async function generate(
label: string,
payload: Record<string, unknown>,
): Promise<Record<string, unknown> | null> {
console.log(`\n--- ${name} generate ---`)
console.log(` URL: ${url}`)
): Promise<CladoResponse | null> {
console.log(`\n--- ${label} ---`)
console.log(` URL: ${ACTION_URL}`)
console.log(` Instruction: ${payload.instruction}`)
console.log(
` Image size: ${((payload.image_base64 as string).length / 1024).toFixed(0)} KB (base64)`,
` Image size: ${((payload.image_base64 as string).length / 1024).toFixed(0)} KB (base64)`,
)
if (payload.history) console.log(` History: ${payload.history}`)
if (payload.history && payload.history !== 'None') {
console.log(` History: ${payload.history}`)
}
const start = performance.now()
let resp: Response
try {
const resp = await fetch(url, {
resp = await fetch(ACTION_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
signal: AbortSignal.timeout(120_000),
signal: AbortSignal.timeout(COLD_START_BUDGET_MS),
})
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
if (!resp.ok) {
const body = await resp.text()
console.log(` FAILED: HTTP ${resp.status} (${elapsed}s)`)
console.log(` Body: ${body.slice(0, 400)}`)
return null
}
const result = (await resp.json()) as Record<string, unknown>
console.log(` Status: ${resp.status} (${elapsed}s)`)
console.log(` Action: ${result.action}`)
if (result.x !== null && result.x !== undefined)
console.log(` Coordinates: (${result.x}, ${result.y})`)
if (result.text)
console.log(` Text: ${(result.text as string).slice(0, 100)}`)
if (result.key) console.log(` Key: ${result.key}`)
if (result.inference_time_seconds)
console.log(` Inference: ${result.inference_time_seconds}s`)
// Show thinking if present
const raw = result.raw_response as string | undefined
if (raw) {
const thinkMatch = raw.match(/<thinking>([\s\S]*?)<\/thinking>/)
if (thinkMatch) {
const thinking = thinkMatch[1].trim()
console.log(
` Thinking: ${thinking.slice(0, 200)}${thinking.length > 200 ? '...' : ''}`,
)
}
}
return result
} catch (err) {
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
console.log(
@@ -100,6 +110,50 @@ async function testGenerate(
)
return null
}
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
if (!resp.ok) {
const body = await resp.text()
console.log(` HTTP ${resp.status} ${resp.statusText} (${elapsed}s)`)
console.log(` Body: ${body.slice(0, 400)}`)
return null
}
const result = (await resp.json()) as CladoResponse
console.log(` HTTP ${resp.status} (${elapsed}s)`)
console.log(` action: ${result.action ?? 'null'}`)
if (result.parse_error) {
console.log(` parse_error: ${result.parse_error}`)
}
if (result.thinking) {
const trimmed = result.thinking.replace(/\s+/g, ' ').trim()
console.log(
` thinking: ${trimmed.slice(0, 240)}${trimmed.length > 240 ? '…' : ''}`,
)
}
if (typeof result.x === 'number' || typeof result.y === 'number') {
console.log(` x, y: ${result.x}, ${result.y}`)
}
if (typeof result.text === 'string')
console.log(` text: ${result.text.slice(0, 120)}`)
if (typeof result.key === 'string')
console.log(` key: ${result.key}`)
if (typeof result.direction === 'string')
console.log(` direction: ${result.direction}`)
if (typeof result.amount === 'number')
console.log(` amount: ${result.amount}`)
if (typeof result.startX === 'number' || typeof result.endX === 'number') {
console.log(
` drag: (${result.startX}, ${result.startY}) → (${result.endX}, ${result.endY})`,
)
}
if (typeof result.time === 'number')
console.log(` time: ${result.time}s`)
if (result.final_answer)
console.log(` final_answer: ${result.final_answer.slice(0, 240)}`)
if (typeof result.inference_time_seconds === 'number')
console.log(` inference_time_seconds: ${result.inference_time_seconds}`)
return result
}
async function loadScreenshot(path?: string): Promise<string> {
@@ -110,10 +164,9 @@ async function loadScreenshot(path?: string): Promise<string> {
return data.toString('base64')
}
// Try to capture from a running BrowserOS server
const serverUrl = process.env.BROWSEROS_URL || 'http://127.0.0.1:9110'
console.log(
`No screenshot path provided. Trying to capture from ${serverUrl}...`,
`No screenshot path provided. Capturing from ${serverUrl} via MCP...`,
)
const { Client } = await import('@modelcontextprotocol/sdk/client/index.js')
@@ -134,82 +187,101 @@ async function loadScreenshot(path?: string): Promise<string> {
arguments: { format: 'png', page: 1 },
})) as { content: Array<{ type: string; data?: string }> }
const imageContent = result.content?.find((c) => c.type === 'image')
if (!imageContent?.data)
throw new Error('No image data in screenshot response')
const image = result.content?.find((c) => c.type === 'image')
if (!image?.data)
throw new Error('No image data in take_screenshot response')
console.log(
`Captured screenshot (${(imageContent.data.length / 1024).toFixed(0)} KB base64)`,
`Captured screenshot (${(image.data.length / 1024).toFixed(0)} KB base64)`,
)
return imageContent.data
return image.data
} finally {
try {
await transport.close()
} catch {}
} catch {
/* ignore */
}
}
}
function summarize(history: CladoResponse[]): string {
if (history.length === 0) return 'None'
return history
.map((h) => {
switch (h.action) {
case 'click':
case 'double_click':
case 'right_click':
case 'hover':
return `${h.action}(${h.x}, ${h.y})`
case 'type':
return `type(${JSON.stringify(h.text ?? '')})`
case 'press_key':
return `press_key(${JSON.stringify(h.key ?? '')})`
case 'scroll':
return `scroll(${h.direction ?? 'down'})`
case 'drag':
return `drag(${h.startX},${h.startY} -> ${h.endX},${h.endY})`
case 'wait':
return `wait(${h.time ?? 1}s)`
case 'end':
return 'end()'
default:
return h.action ?? 'invalid'
}
})
.join(' -> ')
}
async function main() {
const screenshotPath = process.argv[2]
console.log('=== Clado action endpoint smoke test ===')
console.log('=== Clado API Test ===\n')
// Health checks (parallel)
const [actionHealthy, groundingHealthy] = await Promise.all([
checkHealth('Action Model', ACTION_HEALTH_URL),
checkHealth('Grounding Model', GROUNDING_HEALTH_URL),
])
if (!actionHealthy && !groundingHealthy) {
console.log('\nBoth endpoints are down. Exiting.')
const healthy = await checkHealth()
if (!healthy) {
console.log('\nHealth check failed. Exiting.')
process.exit(1)
}
// Load screenshot
let imageBase64: string
try {
imageBase64 = await loadScreenshot(screenshotPath)
imageBase64 = await loadScreenshot(process.argv[2])
} catch (err) {
console.log(
`\nFailed to load screenshot: ${err instanceof Error ? err.message : err}`,
)
console.log(
'Provide a screenshot path: bun apps/eval/scripts/test-clado-api.ts path/to/screenshot.png',
'Pass a path: bun apps/eval/scripts/test-clado-api.ts path/to/screenshot.png',
)
process.exit(1)
}
const instruction = 'Click on the search button or search bar'
const history: CladoResponse[] = []
// Test grounding model
if (groundingHealthy) {
await testGenerate('Grounding Model', GROUNDING_URL, {
instruction,
// Step 1: open task — let the model decide what to do.
const step1 = await generate('Step 1: cold task', {
instruction: 'Find the search bar and click it',
image_base64: imageBase64,
history: 'None',
})
if (step1?.action) history.push(step1)
// Step 2: continuation with history, asks for typing.
if (step1?.action) {
const step2 = await generate('Step 2: with history', {
instruction: 'Type "hello world" into the search bar',
image_base64: imageBase64,
history: summarize(history),
})
} else {
console.log('\nSkipping grounding model (unhealthy)')
if (step2?.action) history.push(step2)
}
// Test action model (no history)
if (actionHealthy) {
const result = await testGenerate('Action Model (step 1)', ACTION_URL, {
instruction,
image_base64: imageBase64,
history: 'None',
})
// Test action model with history (simulate multi-turn)
if (result && result.action === 'click') {
await testGenerate('Action Model (step 2, with history)', ACTION_URL, {
instruction: 'Type "hello world" in the search bar',
image_base64: imageBase64,
history: `click(${result.x}, ${result.y})`,
})
}
} else {
console.log('\nSkipping action model (unhealthy)')
}
// Step 3: ask for end with a final answer to exercise that field.
await generate('Step 3: ask for end+final_answer', {
instruction:
'You have completed the task. Reply with end() and final_answer="done".',
image_base64: imageBase64,
history: summarize(history),
})
console.log('\n=== Done ===')
}

View File

@@ -31,7 +31,7 @@ const PAGE_SCOPED_TOOLS = new Set<string>([
])
interface CladoActionResponse {
action?: string
action?: string | null
x?: number
y?: number
text?: string
@@ -43,8 +43,11 @@ interface CladoActionResponse {
endY?: number
amount?: number
time?: number
final_answer?: string | null
inference_time_seconds?: number
raw_response?: string
thinking?: string | null
parse_error?: string | null
}
interface Viewport {
@@ -65,9 +68,14 @@ interface CladoAction {
endY?: number
amount?: number
time?: number
final_answer?: string
}
type RawActionPayload = Partial<CladoAction>
type RawActionPayload = Partial<Omit<CladoAction, 'final_answer'>> & {
final_answer?: string | null
}
const MAX_CONSECUTIVE_PARSE_FAILURES = 3
interface ActionPoint {
x: number
@@ -135,6 +143,8 @@ export class CladoActionExecutor {
const actionHistory: CladoAction[] = []
let predictionCalls = 0
const thinkingTrace: string[] = []
let consecutiveParseFailures = 0
let finalAnswer: string | undefined
let status: ExecutorResult['status'] = 'done'
let reason = 'Goal executed.'
@@ -209,6 +219,17 @@ export class CladoActionExecutor {
const predictedActions = this.parseActions(prediction)
if (predictedActions.length === 0) {
// Per Clado contract: HTTP 200 with action=null on parse failure.
// Count as an invalid step so the model can self-correct on the
// next call instead of dropping the trajectory.
consecutiveParseFailures++
const parseError =
prediction.parse_error ?? 'no parsable <answer> in raw_response'
actionHistory.push({
action: 'invalid',
text: `parse_error: ${parseError}`,
})
this.stepsUsed++
await this.callbacks.onStepFinish?.({
toolCalls: [
{
@@ -224,14 +245,21 @@ export class CladoActionExecutor {
output: {
prediction: this.summarizePrediction(prediction),
parsedActions: [],
parseError,
consecutiveParseFailures,
},
},
],
})
status = 'blocked'
reason = 'Clado action response did not contain a valid action.'
break
if (consecutiveParseFailures >= MAX_CONSECUTIVE_PARSE_FAILURES) {
status = 'blocked'
reason = `Clado returned ${consecutiveParseFailures} consecutive unparseable responses.`
break
}
continue
}
consecutiveParseFailures = 0
let requestedStop = false
const executionNotes: string[] = []
@@ -272,7 +300,12 @@ export class CladoActionExecutor {
actionHistory.push(predictedAction)
if (predictedAction.action === 'end') {
reason = 'Model requested end() and marked task complete.'
if (predictedAction.final_answer) {
finalAnswer = predictedAction.final_answer
reason = `Model requested end() with final_answer: ${predictedAction.final_answer.slice(0, 240)}`
} else {
reason = 'Model requested end() and marked task complete.'
}
requestedStop = true
break
}
@@ -327,6 +360,7 @@ export class CladoActionExecutor {
actions: actionHistory,
url: this.currentUrl,
thinkingTrace,
finalAnswer,
})
return {
@@ -440,6 +474,10 @@ export class CladoActionExecutor {
endY: typeof payload.endY === 'number' ? payload.endY : undefined,
amount: typeof payload.amount === 'number' ? payload.amount : undefined,
time: typeof payload.time === 'number' ? payload.time : undefined,
final_answer:
typeof payload.final_answer === 'string'
? payload.final_answer
: undefined,
}
}
@@ -578,7 +616,9 @@ export class CladoActionExecutor {
}
case 'end': {
return 'Model requested end().'
return action.final_answer
? `Model requested end() with final_answer: ${action.final_answer.slice(0, 240)}`
: 'Model requested end().'
}
default: {
@@ -588,9 +628,10 @@ export class CladoActionExecutor {
}
private async captureScreenshotBase64(signal?: AbortSignal): Promise<string> {
// Clado contract is PNG or JPEG; use PNG for lossless input.
const result = await this.runTool(
'take_screenshot',
{ format: 'webp', quality: 80 },
{ format: 'png' },
signal,
)
@@ -754,6 +795,11 @@ export class CladoActionExecutor {
'C-S-tab': 'Control+Shift+Tab',
'C-S-n': 'Control+Shift+N',
'C-down': 'Control+ArrowDown',
// macOS Cmd shortcuts (Meta in CDP).
'M-a': 'Meta+A',
'M-c': 'Meta+C',
'M-v': 'Meta+V',
'M-x': 'Meta+X',
'M-f4': 'Alt+F4',
}
return map[raw] ?? raw
@@ -841,7 +887,11 @@ export class CladoActionExecutor {
case 'wait':
return `${action.action}:${action.time ?? 1}`
case 'end':
return 'end()'
return action.final_answer
? `end(${action.final_answer.slice(0, 32)})`
: 'end()'
case 'invalid':
return `invalid(${(action.text ?? '').slice(0, 40)})`
default:
return action.action
}
@@ -871,6 +921,8 @@ export class CladoActionExecutor {
return `wait(${Math.round(action.time ?? 1)}s)`
case 'end':
return 'end()'
case 'invalid':
return 'invalid()'
default:
return action.action
}
@@ -885,8 +937,9 @@ export class CladoActionExecutor {
actions: CladoAction[]
url: string
thinkingTrace: string[]
finalAnswer?: string
}): string {
const { status, reason, actions, url, thinkingTrace } = params
const { status, reason, actions, url, thinkingTrace, finalAnswer } = params
const actionSummary =
actions.length === 0
? 'No actions were executed.'
@@ -907,6 +960,7 @@ export class CladoActionExecutor {
`Status: ${status}`,
`Reason: ${reason}`,
`URL: ${url || 'unknown'}`,
finalAnswer ? `Final answer: ${finalAnswer}` : '',
'',
'Recent actions:',
actionSummary,

View File

@@ -5,4 +5,5 @@
export const DEFAULT_TIMEOUT_MS = 30 * 60 * 1000 // 30 minutes
export const SCREENSHOT_TIMEOUT_MS = 65_000 // 65s — ensures we get extension's error (60s)
export const MAX_ACTIONS_PER_DELEGATION = 15
export const CLADO_REQUEST_TIMEOUT_MS = 120_000
// Cold start can take ~5 minutes per Clado; 6 minutes leaves headroom.
export const CLADO_REQUEST_TIMEOUT_MS = 360_000

View File

@@ -14,8 +14,11 @@
*/
import {
closeSync,
existsSync,
mkdirSync,
mkdtempSync,
openSync,
readFileSync,
rmSync,
writeFileSync,
@@ -33,7 +36,17 @@ export interface EvalPorts {
const MAX_RESTART_ATTEMPTS = 3
const CDP_WAIT_TIMEOUT_MS = 30_000
const SERVER_HEALTH_TIMEOUT_MS = 30_000
// Bumped from 30s → 90s while debugging dev-CI startup. Dev's server module
// graph is ~108 files larger than main's; cold-cache module load on a CI
// runner can take much longer than the original 30s budget allowed.
const SERVER_HEALTH_TIMEOUT_MS = 90_000
// Where per-worker server stderr is written. Captured (rather than ignored)
// so eval-weekly.yml can upload these as workflow artifacts on failure for
// post-mortem debugging. Path is also referenced in the workflow's artifact
// upload step.
const SERVER_LOG_DIR =
process.env.BROWSEROS_SERVER_LOG_DIR || '/tmp/browseros-server-logs'
const MONOREPO_ROOT = join(
dirname(fileURLToPath(import.meta.url)),
@@ -53,6 +66,7 @@ export class BrowserOSAppManager {
private ports: EvalPorts
private chromeProc: Subprocess | null = null
private serverProc: Subprocess | null = null
private serverLogFd: number | null = null
private tempDir: string | null = null
private readonly workerIndex: number
private readonly loadExtensions: boolean
@@ -183,15 +197,36 @@ export class BrowserOSAppManager {
VITE_BROWSEROS_SERVER_PORT: String(server),
}
// Capture both stdout and stderr to a per-worker file so we can
// post-mortem startup hangs. The server uses pino which writes logs to
// stdout by default — capturing stderr alone misses everything. The
// eval-weekly workflow uploads /tmp/browseros-server-logs/ as a workflow
// artifact on failure.
// Open the per-worker log file under SERVER_LOG_DIR. If the directory
// can't be created or the file can't be opened (e.g. unwritable custom
// BROWSEROS_SERVER_LOG_DIR), fall back to /dev/null so spawn still works.
const logPath = join(SERVER_LOG_DIR, `server-W${this.workerIndex}.log`)
let logFd: number
try {
mkdirSync(SERVER_LOG_DIR, { recursive: true })
logFd = openSync(logPath, 'a')
} catch {
logFd = openSync('/dev/null', 'w')
}
this.serverLogFd = logFd
// `start:ci` skips `--watch` (no file-watcher overhead in CI). Falls back
// to the regular `start` script outside CI for the dev-watch experience.
const startScript = process.env.CI ? 'start:ci' : 'start'
this.serverProc = spawn({
cmd: ['bun', 'run', '--filter', '@browseros/server', 'start'],
cmd: ['bun', 'run', '--filter', '@browseros/server', startScript],
cwd: MONOREPO_ROOT,
stdout: 'ignore',
stderr: 'ignore',
stdout: logFd,
stderr: logFd,
env: serverEnv,
})
console.log(
` [W${this.workerIndex}] Server started (PID: ${this.serverProc.pid})`,
` [W${this.workerIndex}] Server started (PID: ${this.serverProc.pid}, logs → ${logPath})`,
)
// --- Wait for Server Health ---
@@ -244,6 +279,18 @@ export class BrowserOSAppManager {
await this.killProcess(this.serverProc)
this.serverProc = null
// Close the parent's copy of the server log fd. Child kept its own dup
// until it exited above, so closing here doesn't truncate any output.
// Without this we'd leak one fd per restart attempt across all workers.
if (this.serverLogFd !== null) {
try {
closeSync(this.serverLogFd)
} catch {
// already closed or invalid — ignore
}
this.serverLogFd = null
}
// Kill Chrome (graceful → force)
await this.killProcess(this.chromeProc)
this.chromeProc = null

View File

@@ -9,6 +9,7 @@
},
"scripts": {
"start": "bun --watch --env-file=.env.development src/index.ts",
"start:ci": "bun --env-file=.env.development src/index.ts",
"build": "bun ../../scripts/build/server.ts --target=all",
"test": "bun run test:all",
"test:all": "bun run ./tests/__helpers__/run-test-group.ts all",

View File

@@ -146,7 +146,7 @@ export async function createHttpServer(config: HttpServerConfig) {
getVmName: () => VM_NAME,
},
openclawGatewayChat: new OpenClawGatewayChatClient(
getOpenClawService().getPort(),
() => getOpenClawService().getPort(),
async () => getOpenClawService().getGatewayToken(),
),
openclawProvisioner: {

View File

@@ -48,7 +48,14 @@ export function buildContainerRuntime(
): ContainerRuntime {
const platform = input.platform ?? process.platform
if (platform !== 'darwin') {
if (process.env.NODE_ENV === 'test') {
// BROWSEROS_SKIP_OPENCLAW=1 is the explicit opt-in for non-darwin hosts
// (e.g. Linux CI runners) where OpenClaw can't actually run but the rest
// of the server should still come up. Returns a no-op runtime — any
// OpenClaw API call hitting it will fail loudly at request time.
if (
process.env.NODE_ENV === 'test' ||
process.env.BROWSEROS_SKIP_OPENCLAW === '1'
) {
return new UnsupportedPlatformTestRuntime(input.projectDir)
}
throw unsupportedPlatformError()

View File

@@ -37,7 +37,7 @@ export interface GatewayChatTurnInput {
export class OpenClawGatewayChatClient {
constructor(
private readonly hostPort: number,
private readonly getHostPort: () => number,
private readonly getToken: () => Promise<string>,
) {}
@@ -46,7 +46,7 @@ export class OpenClawGatewayChatClient {
): Promise<ReadableStream<AgentStreamEvent>> {
const token = await this.getToken()
const response = await fetch(
`http://127.0.0.1:${this.hostPort}/v1/chat/completions`,
`http://127.0.0.1:${this.getHostPort()}/v1/chat/completions`,
{
method: 'POST',
headers: {

View File

@@ -126,17 +126,28 @@ export class Application {
this.logStartupSummary()
startSkillSync()
configureOpenClawService({
browserosServerPort: this.config.serverPort,
resourcesDir,
vmCache: this.vmCacheConfig(),
})
.tryAutoStart()
.catch((err) =>
logger.warn('OpenClaw auto-start failed', {
error: err instanceof Error ? err.message : String(err),
}),
)
// OpenClaw is best-effort — a failure here must not crash the server.
// The container runtime constructor throws synchronously on non-darwin
// (e.g. Linux CI runners), and the .catch() on tryAutoStart() only
// handles async throws inside auto-start. Wrap both in try/catch so the
// process keeps running even when OpenClaw can't initialize at all.
try {
configureOpenClawService({
browserosServerPort: this.config.serverPort,
resourcesDir,
vmCache: this.vmCacheConfig(),
})
.tryAutoStart()
.catch((err) =>
logger.warn('OpenClaw auto-start failed', {
error: err instanceof Error ? err.message : String(err),
}),
)
} catch (err) {
logger.warn('OpenClaw configuration failed, continuing without it', {
error: err instanceof Error ? err.message : String(err),
})
}
metrics.log('http_server.started', { version: VERSION })
}