Compare commits

...

18 Commits

Author SHA1 Message Date
Nikhil Sonti
8cd7c985e3 fix: address review feedback for PR #891 2026-04-30 12:54:44 -07:00
Nikhil Sonti
548c9dd996 feat(dev): bootstrap setup from dev watch 2026-04-30 12:49:25 -07:00
Nikhil
d38b01a8c7 feat(dev): add guided cleanup and reset commands (#890)
* feat(dev): add guided cleanup and reset commands

* fix: address cleanup reset review feedback
2026-04-30 12:27:15 -07:00
Nikhil
ff36c8412b fix(dev): use run lock for watch cleanup (#889)
* fix(dev): use run lock for watch cleanup

* fix(dev): address watch lock review comments
2026-04-30 11:46:17 -07:00
Nikhil
fd5aba249b fix: stabilize OpenClaw gateway startup (#888)
* feat(server): add shared process lock helper

* feat(container): add container name reconciliation helpers

* feat(openclaw): serialize lifecycle across processes

* fix(openclaw): reconcile fixed gateway container startup

* test(openclaw): cover lifecycle race recovery

* fix(server): satisfy process lock error override

* fix(openclaw): address review feedback

* test(openclaw): align serialization mock with image check
2026-04-30 11:31:40 -07:00
Nikhil
492f3fcdf2 feat(openclaw): prewarm ghcr image in vm (#887)
* feat(openclaw): add gateway image inspection

* feat(openclaw): pull gateway image from registry

* refactor(vm): decouple readiness from image cache

* refactor(openclaw): remove vm cache from runtime factory

* feat(openclaw): detect current gateway image

* feat(openclaw): prewarm vm runtime and reuse current gateway

* feat(openclaw): prewarm runtime on server startup

* refactor(vm): remove browseros image cache runtime

* refactor(build-tools): remove openclaw tarball pipeline

* chore: self-review fixes

* fix(openclaw): suppress prewarm pull progress logs

* fix(openclaw): address review feedback

* fix(openclaw): resolve review findings

* fix(dev): stop stale watch supervisors
2026-04-30 11:18:11 -07:00
Nikhil
cb0c0dd0c1 chore: simplify root test scripts (#886)
* chore: simplify root test scripts

* fix: avoid chained root test scripts

* fix: update test workflow commands

* fix: move app test commands into packages
2026-04-30 10:58:08 -07:00
Dani Akash
8712f89f18 feat(agents): durable per-agent chat message queue + composer Stop (#880)
* feat(agents): durable per-agent chat message queue + composer Stop button

* fix(agents): tighten queue UI — smaller Stop, drop empty indicator, live drain attach

User feedback round 1 on the message-queue UX:

1) The Stop button matched the send/voice mics at h-10 w-10 with a
   solid destructive fill, which read as alarming. Shrunk to h-8 w-8,
   ghost variant with a soft destructive/10 background, smaller
   filled square glyph. Reads as a calm 'stop' affordance instead of
   a panic button.

2) The QueueItem's leading <QueueItemIndicator> dot was decorative
   only — no state, no interaction. Dropped it from QueuePanel along
   with the import; queue items now render as a clean preview line
   with the trailing X remove action.

3) When the server drained the queue and started the next turn, the
   chat panel didn't pick up the live stream until the user
   navigated away and back. The hook's resume effect previously
   only fired on agent change, not on listing-observed activeTurnId
   change. Surface activeTurnId from useHarnessAgents into
   useAgentConversation; effect now re-runs when the id changes,
   calls /chat/active, and attaches to the new turn — so a queued
   message starts streaming the moment the server drain pops it.

* fix(agents): don't reset streaming state from the resume effect's no-op paths

The Stop button was disappearing while the agent was actively
streaming, even though events were still flowing into the chat. Root
cause: the resume effect's `finally` block reset `streaming`,
`turnIdRef`, and `lastSeqRef` unconditionally — including on the
early-return paths (no active turn, or another mechanism already
owns the stream).

Sequence that triggered it:
  1) User sends a message → send() sets streamAbortRef + streaming=true
     and starts consuming the SSE.
  2) User enqueues another message → enqueue mutation invalidates the
     listing query.
  3) Listing refetches with the live activeTurnId → the resume
     effect re-fires (deps include activeTurnIdDep).
  4) attemptResume hits `if (streamAbortRef.current) return` because
     send() owns it.
  5) The finally clause fires anyway and calls setStreaming(false),
     clobbering the live state set by send(). The SSE consumer keeps
     running (refs are intact) so text keeps streaming, but the React
     flag is wrong, so the Stop button gates off.

Fix: track whether *this* run actually started a stream
(`weStartedStream`). The finally only resets state when it does.
Early-return / no-active-turn paths now leave streaming/turnIdRef/
lastSeqRef alone for whoever does own them.

Also widens the Stop button's visibility (`canStop` prop on
ConversationInput) so it stays steady across the brief gap between
turns when a queue drain is mid-flight; the parent computes
`streaming || activeTurnId !== null || queue.length > 0`. The
visibility widening is independent of the streaming-state fix above
— both are now in place.

* revert: drop canStop widening — Stop only shows while streaming

Reverts the canStop prop on ConversationInput and the OR-with-queue
visibility from AgentCommandConversation. Stop is gated solely on
`streaming` again. Between turns (queue draining) the button stays
hidden — only the actively-streaming turn is interruptible from the
composer, which matches what the user actually expects.

* fix(agents): persist the kicking-off prompt on active turns so the resume placeholder isn't empty

When a queued message drained and started a new turn, the chat
panel's resume effect staged a placeholder turn with userText: ''
because the hook had no way to know what message kicked off the
turn — only the agent-side stream was visible, and the user bubble
above it was blank until the user navigated away and back (at which
point the session record's history loaded normally).

Fix: ActiveTurnRegistry.register now accepts an optional `prompt`
that's stashed on the turn and surfaced via describe() / the
ActiveTurnInfo response. AgentHarnessService.startTurn passes the
incoming message into register. /chat/active returns it. The chat
hook's resume effect uses active.prompt as the placeholder
turn's userText, so the user bubble shows the queued message text
the moment streaming begins. Falls back to '' for older clients
that haven't been refetched yet.

* fix(agents): always release streamAbortRef on resume cleanup, even when cancelled

Greptile P1 follow-up. The previous `weStartedStream` guard correctly
stopped the resume effect's no-op early-returns from clobbering an
in-flight `send()` stream — but it also stopped a *cancelled*
mid-stream resume from clearing its own `streamAbortRef`. When the
cleanup fires (e.g. the 5s listing poll captures a new queue-drain
turn id while the SSE for the prior turn is still finishing), the
next effect run hits the `if (streamAbortRef.current) return` guard
against the now-aborted controller and never reattaches, leaving
`streaming === true` with no live stream until the user navigates
away.

Split the finally block: always release `streamAbortRef` when we
owned the controller (so the next run can take over), but only
reset the streaming flag / turn id / lastSeq on a clean exit (the
new run will set those itself, so resetting on cancel would just
flicker).
2026-04-30 18:26:56 +05:30
Dani Akash
ba60bf466f feat(agents): rich command-center rows + home grid + dead-code sweep (#879)
* feat(agents): rich-info command center rows + pin/PATCH/adapter-health backbone

Splits AgentRowCard from a 271-line monolith into a shallow tree of
single-responsibility sub-components under `agent-row/`:

  AgentTile, AdapterHealthDot, PinToggle, AgentTitleRow,
  AgentSparkline, AgentSummaryChips, AgentLastMessage, CwdChip,
  AgentTokenSummary, AgentMetaRow, AgentErrorPanel, AgentActions

Adds the data each row consumes:

- pinned: boolean field on AgentDefinition + FileAgentStore.update
  + new PATCH /agents/:id route. useUpdateHarnessAgent mutation
  optimistically updates the listing cache so the star flips
  instantly; rolls back on error.
- Listing payload extended with lastUserMessage, cwd, tokens
  (cumulative + last7d shape — last7d zero-filled until the
  activity ledger lands), turnsByDay/failedByDay (zero-filled),
  lastError/lastErrorAt, activeTurnId. AcpxRuntime grows a
  getRowSnapshot() that reads cwd + cumulative tokens + last user
  message from the session record in one pass.
- Adapter health: in-memory AdapterHealthChecker probes
  `claude --version` / `codex --version` with a 2s timeout and
  caches results for 5 min. /adapters response carries
  { healthy, reason?, checkedAt }. Tile-corner dot exposes the
  state via HoverCard; openclaw inherits health from the gateway
  snapshot already on the page.

Sub-components are pure: card itself owns no state. Sort order
becomes pinned-first, then recency. HoverCard is the workhorse for
keeping rows compact while exposing depth (full message, token
breakdown, daily turn list, error stack, adapter reason).

* refactor(agents): tighten command-center row design + cut redundant affordances

User feedback round 1:
1) Two green dots on the tile (health + liveness) was confusing. Health
   moves out of the tile entirely and surfaces as an inline 'Unavailable'
   chip in the model line — silent when the adapter is healthy, with a
   warning amber chip + HoverCard reason when not. The tile now shows
   one signal: liveness.
2) The last-user-message HoverCard wasn't telegraphing intent. Drop the
   HoverCard. The line is informational, italic, with a leading quote
   glyph so the row reads like a conversation snippet. To see the full
   message the user opens the chat (which is the action they want next
   anyway).
3) Resume + Chat were duplicate CTAs. Single primary action per row:
   Resume (filled, accent-orange, with a pulsing dot) replaces Chat
   when there's an active turn. Both navigate to /agents/:id but the
   row tells the user which action they're taking.
4) Tokens weren't visible because the row gated on last7d.requestCount,
   which is zero until the activity ledger ships. Switch to lifetime
   tokens (which we have today). Drop the '7d stats:' framing — talking
   about a window we can't compute would be misleading. The HoverCard
   surfaces input/output split + a footnote that per-window stats land
   in a follow-up.
5) CWD was rendering the server's own running directory, which is
   meaningless to users. Hide it from the row entirely. The cwd field
   still rides in the listing payload for future surfaces (chat panel,
   debug view) — only the row stops rendering it.

Aesthetic refinements while we're here:
- Whole card carries state, not just the tile: working rows get an
  accent-orange tinted border with a soft glow, error rows tint
  destructive, idle rows lift on hover.
- Pin star fades in on hover (group-hover) when unpinned and stays
  solid amber when pinned — keeps the rail calm by default.
- Tabular-nums on token figures so columns visually align across rows.
- Drop CwdChip and AdapterHealthDot files: no callers left.

* fix(agents): align row title flush-left whether pinned or not

Pin star moved from leading the title to trailing the badges, and
hidden from layout entirely (`hidden group-hover:inline-flex`) when
unpinned. The previous `opacity-0` rule kept the star reserving its
`size-6` slot, which left every unpinned title indented relative to
the model / preview / meta lines underneath it. Title now flushes
left in both states; pinned star stays solid amber so the signal
isn't hidden, and unpinned reveals an outline star on row hover for
the toggle affordance.

* fix(agents): keep pin-toggle slot reserved so row height is constant

Switching the unpinned star from `hidden group-hover:inline-flex`
to `opacity-0 group-hover:opacity-100`. The hidden/show variant was
collapsing the title row's height when the star wasn't rendered,
which made every card below visibly shift on hover. Always rendering
the button (with opacity-only visibility) keeps the row's vertical
metrics constant; the title still flushes left because the slot is
trailing, not leading.

Card hover effect (-translate-y + shadow-md) restored — the layout
shift wasn't coming from the card hover; it was the pin slot
appearing and disappearing.

* fix(agents): quieten row hover — border-tint only, no lift, no shadow

Drop the `-translate-y-px` and `hover:shadow-md` from the row card
plus the working-state inner ring. The translate + shadow grow
combination was visibly noisy as the cursor moved through the rail —
each row 'lifted' as you passed over it. Hover now just tints the
border in accent-orange/30; working and error states keep their
distinct border colours but no inner ring. Card height and shadow
stay constant in every state, so the rail reads as a calm vertical
list of cards.

* feat(home): rich Recent Agents grid + dead-code sweep

The /home Recent Agents grid was a placeholder shell. Every 'rich'
field on the card (lastMessage, lastMessageTimestamp, activitySummary,
currentTool, costUsd) was wired to undefined because AgentCommandHome
called `buildAgentCardData(agents, status?.status, undefined)` — the
dashboard arg has been hard-coded undefined since the harness
migration. Repointing the grid at `useHarnessAgents` + `useAgentAdapters`
gives every card the same enriched data the rail uses.

What the new card shows per agent:
  • Adapter glyph tile + liveness dot (working pulses; asleep is
    hollow; error is red)
  • Name + Working pill (when active)
  • Adapter · model · reasoning summary line, with an inline
    Unavailable chip + HoverCard reason when the adapter binary
    isn't on $PATH
  • Italic last-user-message preview (line-clamp-2, leading quote
    glyph) — same visual language as the rail
  • Footer: 'X ago' + state chip (Asleep / Attention) OR a Resume
    button (orange, with pulsing dot) when activeTurnId is non-null

Sort on the home grid is active-turn → recency. Pinning is NOT a
sort key here (and there's no pin indicator on the card) — pinning
belongs to the rail at /agents; the home page is action-oriented
and trusts active-turn + recency to surface the right agent.

Dead code removed:
  • useAgentDashboard.ts (96 lines, no callers; subscribed to the
    dead /claw/dashboard/stream from the OpenClaw-only era)
  • useAgentCardData.ts (the dashboard-merge shim; passed undefined
    every call so all enriched fields landed as undefined)
  • AgentCard.tsx (AgentCardExpanded replaced by HomeAgentCard;
    AgentCardCompact had no callers — the dock's compact mode was
    never used)
  • AgentCardData interface dropped from lib/agent-conversations/
    types.ts; the new card consumes HarnessAgent directly

Visual language stays continuous between rail and grid: same
<AgentTile>, same <LivenessDot>, same italic-quote message
preview, same orange Resume button with a pulsing dot.
2026-04-30 16:36:22 +05:30
Nikhil
26afb826c6 feat(eval): add viewer manifest contract (#878)
* refactor(eval): canonicalize viewer manifest contract

* refactor(eval): publish canonical viewer manifests

* feat(eval): make r2 viewer use manifest artifact paths

* fix(eval): keep weekly report compatible with viewer manifests

* docs(eval): document r2 viewer manifest contract

* chore: self-review fixes

* fix: address review feedback for PR #878
2026-04-29 20:50:35 -07:00
Nikhil
b2340c8afa refactor(eval): split orchestrated executor backends (#876)
* refactor(eval): split orchestrated executor backends

* fix(eval): address executor backend review comments
2026-04-29 18:02:32 -07:00
Felarof
790a270f47 Update README.md (#877) 2026-04-29 17:35:15 -07:00
Nikhil
84a79ba0a1 feat: refactor eval pipeline workflow (#875)
* feat(eval): add suite variant config bridge

* feat(eval): add stable run artifacts

* refactor(eval): add shared grader contract

* feat(eval): persist grader artifacts

* refactor(eval): rename runner layers

* refactor(eval): add executor backend boundary

* refactor(eval): split clado backend

* feat(eval): add workflow compatible cli

* feat(eval): add r2 publisher module

* ci(eval): migrate weekly workflow to eval cli

* docs(eval): document suite pipeline

* chore(eval): verify pipeline refactor

* fix: address review feedback for PR #875

* docs(eval): add env example

* docs(eval): explain suites and variants

* chore(eval): organize config layouts

* chore(eval): colocate grader python evaluators
2026-04-29 17:21:02 -07:00
Nikhil
6e3306f5e5 fix: make R2 uploads retryable (#874)
* fix: make R2 uploads retryable

* fix: address review feedback for PR #874
2026-04-29 16:43:33 -07:00
Nikhil
c244462b29 fix: use Node 24 GitHub actions (#872) 2026-04-29 15:31:23 -07:00
Nikhil
ebf97f74f6 fix: bound VM agent cache smoke test (#870)
* fix: bound VM agent cache smoke test

* fix: address review feedback for PR #870
2026-04-29 13:43:37 -07:00
Nikhil
561f2baf97 fix(eval): split AGISDK smoke and full configs (#871)
* fix(eval): split agisdk smoke and full configs

* fix(eval): default agisdk smoke to openrouter
2026-04-29 13:38:55 -07:00
shivammittal274
df0f45dd29 Feat: eval debug dev ci (#869)
* chore(eval): instrument server startup to root-cause dev CI health-check timeouts

Three diagnostics + one config swap to investigate why the eval-weekly
workflow has been failing on dev since 2026-04-25 with "Server health
check timed out" (every worker, every retry).

Background:
- Last successful weekly eval on dev: 2026-04-18 (sha f5a2b73)
- Since then, ~30 server commits landed including Lima/VM runtime,
  OpenClaw service, ACL system, ACP SDK — 108 server files changed,
  ~13K LOC added.
- Server process spawns cleanly in CI (PID logged) but never binds
  /health within the 30s eval-side timeout. Static analysis finds no
  obvious blocker; we need runtime evidence.

Changes:

1. apps/server/package.json — add `start:ci` script (no `--watch`).
   The default `start` uses `bun --watch` which forks a child process
   that watches every file in the import graph. Dev's graph is ~108
   files larger than main's; on a cold CI runner the watcher setup is a
   plausible source of multi-second startup overhead.

2. apps/eval/src/runner/browseros-app-manager.ts:
   - Use `start:ci` when `process.env.CI` is set (true on
     GitHub-hosted runners by default), else `start`.
   - Capture per-worker server stderr to /tmp/browseros-server-logs/
     instead of ignoring it. Without this we have no visibility into
     why the server is hung pre-/health.
   - Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module
     graph may simply need more cold-start time on CI.

3. .github/workflows/eval-weekly.yml — upload the server logs dir as a
   workflow artifact (always, not just on success) so we can post-mortem
   any startup failure on the next run.

4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter ->
   Fireworks (bypasses the OpenRouter per-key spend cap that has been
   eating recent runs) and drop num_workers 10 -> 4 (well below the
   Fireworks per-account TPM threshold that overwhelmed the original
   2026-04-23 run).

Plan: trigger the eval-weekly workflow on this branch with the agisdk
config and observe (a) whether it gets past server startup, and
(b) if it doesn't, what the captured server stderr says.

* fix(eval): capture stdout too — pino logger writes to stdout, not stderr

Previous diagnostic patch only redirected stderr; the captured per-worker
log files came back as 0 bytes because the server uses pino which writes
all log output to stdout (fd 1), not stderr (fd 2). Capture both into
the same file.

* fix(server): catch sync throw from OpenClaw constructor on Linux

The container runtime constructor in OpenClawService throws synchronously
on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing
.catch() on tryAutoStart() only handles async throws inside auto-start —
the sync throw from configureOpenClawService(...) itself propagates up
through Application.start() and crashes the process via index.ts:48
(process.exit(EXIT_CODES.GENERAL_ERROR)).

This is what's been killing dev's eval-weekly CI: the server crashes in
milliseconds, the eval client polls /health, gets nothing, times out.

Fix: wrap the configureOpenClawService call in try/catch matching the
existing .catch() intent (best-effort, don't crash). Server continues
without OpenClaw on platforms where it can't initialize.

Verified by reading captured server stdout from run 25123195126:
  Failed to start server: error: browseros-vm currently supports macOS only
    at buildContainerRuntime (container-runtime-factory.ts:54:11)
    at new OpenClawService (openclaw-service.ts:652:15)
    at configureOpenClawService (openclaw-service.ts:1527:19)
    at start (main.ts:127:5)

* fix(server): defer OpenClaw chat client port lookup to request time

apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort()
synchronously when constructing the OpenClawGatewayChatClient inside the
createHttpServer object literal. On non-darwin platforms this throws via
the OpenClawService constructor → buildContainerRuntime, escaping the
try/catch added in 5cf7b765 (which only protected the configureOpenClawService
call further down in main.ts).

Every other getOpenClawService() reference in server.ts is already wrapped
in an arrow function. This was the lone holdout. Make it lazy too: change
the chat client constructor to take getHostPort: () => number instead of
hostPort: number, evaluate it inside streamTurn at request time. Behavior
on darwin is unchanged.

This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't
available — the chat endpoint isn't exercised by the eval, so a deferred
throw is acceptable.

* fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1

Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't
unblock dev's Linux CI — same throw kept reproducing. Whether this is bun
caching stale stack frames or a missed eager call site, the safer move is
to fix it at the root: make buildContainerRuntime never throw on Linux
when the runner has explicitly opted out.

Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test
escape hatch in container-runtime-factory.ts. When set, returns the existing
UnsupportedPlatformTestRuntime stub — server boots normally, /health binds,
any actual OpenClaw API call still fails loudly at request time.

eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and
non-CI Linux behavior unchanged (without the flag they still throw).

* feat(eval): align Clado action executor with new endpoint contract

David Shan shared the updated Clado BrowserOS Action Model spec.
Changes to match it:

- Bump endpoint URL + model id to the 000159-merged checkpoint
  (clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef)
  in browseros-oe-clado-weekly.json and the README example.
- CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min;
  the 2-min ceiling was failing every cold-start request.
- Treat HTTP 200 with action=null / parse_error as an INVALID step
  instead of aborting the executor loop. The model can self-correct
  on the next call. Cap consecutive parse failures at 3 to avoid
  infinite loops.
- Capture final_answer from end actions. Surface it in the observation
  back to the orchestrator so its task answer can use the model's
  declared result.
- Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X).
- Switch screenshot format from webp → png to match the documented
  "PNG or JPEG" contract.

* chore(eval): refresh test-clado-api script for new Clado contract

Updated the local smoke-test to match the new Clado endpoint and
response contract:

- New action + health URLs (000159-merged checkpoint).
- Drop the grounding-model branch (orchestrator-executor doesn't
  use it; the README David shared only documents the action model).
- Health-check waits up to 6 minutes for cold start with a 30s
  warning so the operator knows it's spinning up.
- Print every documented response field (action, x/y, text, key,
  direction, amount, drag start/end, time, final_answer, thinking,
  parse_error, inference_time_seconds).
- Three-step run that exercises a click, a typing continuation
  with formatted history, and an end+final_answer probe.

* chore(eval): point clado weekly config at agisdk-real

Switches the orchestrator-executor + Clado weekly config to run on the
AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff
grader. Matches the orchestrator-executor smoke target (Fireworks K2.5
orchestrator + Clado action executor) we want to track week-over-week.

* chore(eval): run clado weekly headless

Default to headless so the weekly job (and local repros) don't pop ten
visible Chrome windows. Set headless=false locally if you need to watch
a worker.

* fix(eval): address Greptile P1+P2 on server log fd handling

P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir
failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the
log directory missing and crash the server spawn with ENOENT. Move openSync
into the same try block; fall back to /dev/null so spawn always succeeds.

P2: the log fd was opened on every server start but never closed. Each
restart attempt leaked one fd across all workers — over a long eval run
that could exhaust the process fd limit. Track the fd on the manager and
closeSync it in killApp() right after the server process exits (the child's
dup keeps the file open until it exits, so we don't truncate output).
2026-04-30 01:33:49 +05:30
224 changed files with 12563 additions and 6783 deletions

View File

@@ -14,7 +14,7 @@ on:
config:
description: 'Eval config file (relative to apps/eval/)'
required: false
default: 'configs/browseros-agent-weekly.json'
default: 'configs/legacy/browseros-agent-weekly.json'
permissions:
contents: read
@@ -62,33 +62,27 @@ jobs:
curl -sL -o /tmp/nopecha.zip https://github.com/NopeCHALLC/nopecha-extension/releases/latest/download/chromium_automation.zip
unzip -qo /tmp/nopecha.zip -d extensions/nopecha
- name: Run eval
- name: Run eval and publish to R2
working-directory: packages/browseros-agent/apps/eval
env:
FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
NOPECHA_API_KEY: ${{ secrets.NOPECHA_API_KEY }}
BROWSEROS_BINARY: /usr/bin/browseros
WEBARENA_INFINITY_DIR: /tmp/webarena-infinity
EVAL_CONFIG: ${{ github.event.inputs.config || 'configs/browseros-agent-weekly.json' }}
run: |
echo "Running eval with config: $EVAL_CONFIG"
xvfb-run --auto-servernum --server-args="-screen 0 1440x900x24" bun run src/index.ts -c "$EVAL_CONFIG"
- name: Upload runs to R2
if: success()
working-directory: packages/browseros-agent/apps/eval
env:
EVAL_R2_ACCOUNT_ID: ${{ secrets.EVAL_R2_ACCOUNT_ID }}
EVAL_R2_ACCESS_KEY_ID: ${{ secrets.EVAL_R2_ACCESS_KEY_ID }}
EVAL_R2_SECRET_ACCESS_KEY: ${{ secrets.EVAL_R2_SECRET_ACCESS_KEY }}
EVAL_R2_BUCKET: ${{ secrets.EVAL_R2_BUCKET }}
EVAL_R2_CDN_BASE_URL: ${{ secrets.EVAL_R2_CDN_BASE_URL }}
EVAL_CONFIG: ${{ github.event.inputs.config || 'configs/browseros-agent-weekly.json' }}
BROWSEROS_BINARY: /usr/bin/browseros
WEBARENA_INFINITY_DIR: /tmp/webarena-infinity
# OpenClaw container runtime is macOS-only; opt the Linux runner
# into the no-op stub so the server can boot and the eval can run.
BROWSEROS_SKIP_OPENCLAW: '1'
EVAL_CONFIG: ${{ github.event.inputs.config || 'configs/legacy/browseros-agent-weekly.json' }}
run: |
CONFIG_NAME=$(basename "$EVAL_CONFIG" .json)
bun scripts/upload-run.ts "results/$CONFIG_NAME"
echo "Running eval with config: $EVAL_CONFIG"
xvfb-run --auto-servernum --server-args="-screen 0 1440x900x24" bun run src/index.ts suite --config "$EVAL_CONFIG" --publish r2
- name: Generate trend report
if: success()
@@ -109,3 +103,11 @@ jobs:
with:
name: eval-report-${{ github.run_id }}
path: /tmp/eval-report.html
- name: Upload server stderr logs (for post-mortem on startup failures)
if: always()
uses: actions/upload-artifact@v4
with:
name: browseros-server-logs-${{ github.run_id }}
path: /tmp/browseros-server-logs/
if-no-files-found: ignore

View File

@@ -1,167 +0,0 @@
name: Publish VM Agent Cache
on:
workflow_dispatch:
inputs:
agent:
description: "Agent name from bundle.json"
required: true
type: string
default: openclaw
publish:
description: "Upload to R2 and merge manifest slice"
required: false
default: false
type: boolean
pull_request:
paths:
- "packages/browseros-agent/packages/build-tools/**"
- ".github/workflows/publish-vm-agent-cache.yml"
env:
BUN_VERSION: "1.3.6"
PKG_DIR: packages/browseros-agent/packages/build-tools
permissions:
contents: read
jobs:
check:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v2
with:
bun-version: ${{ env.BUN_VERSION }}
- working-directory: packages/browseros-agent
run: bun install --frozen-lockfile
- working-directory: packages/browseros-agent
run: bun run --filter @browseros/build-tools typecheck
- working-directory: packages/browseros-agent
run: bun run --filter @browseros/build-tools test
build:
needs: check
strategy:
fail-fast: false
matrix:
include:
- arch: arm64
runner: ubuntu-24.04-arm
- arch: x64
runner: ubuntu-24.04
runs-on: ${{ matrix.runner }}
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v2
with:
bun-version: ${{ env.BUN_VERSION }}
- name: Install podman
run: |
sudo apt-get update
sudo apt-get install -y podman
- working-directory: packages/browseros-agent
run: bun install --frozen-lockfile
- name: Build tarball
working-directory: ${{ env.PKG_DIR }}
env:
AGENT: ${{ inputs.agent || 'openclaw' }}
OUT: ${{ github.workspace }}/dist/images
run: bun run build:tarball -- --agent "$AGENT" --arch "${{ matrix.arch }}" --output-dir "$OUT"
- uses: actions/upload-artifact@v4
with:
name: tarball-${{ inputs.agent || 'openclaw' }}-${{ matrix.arch }}
path: dist/images/
retention-days: 7
smoke:
needs: build
strategy:
fail-fast: false
matrix:
include:
- arch: arm64
runner: ubuntu-24.04-arm
- arch: x64
runner: ubuntu-24.04
runs-on: ${{ matrix.runner }}
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v2
with:
bun-version: ${{ env.BUN_VERSION }}
- uses: actions/download-artifact@v4
with:
name: tarball-${{ inputs.agent || 'openclaw' }}-${{ matrix.arch }}
path: dist/images
- name: Install podman
run: |
sudo apt-get update
sudo apt-get install -y podman
- working-directory: packages/browseros-agent
run: bun install --frozen-lockfile
- name: Smoke test tarball
working-directory: ${{ env.PKG_DIR }}
env:
AGENT: ${{ inputs.agent || 'openclaw' }}
run: |
set -euo pipefail
tarball="$(find "$GITHUB_WORKSPACE/dist/images" -name "${AGENT}-*-${{ matrix.arch }}.tar.gz" -print -quit)"
if [ -z "$tarball" ]; then
echo "missing ${{ matrix.arch }} tarball artifact for ${AGENT}" >&2
exit 1
fi
bun run smoke:tarball -- --agent "$AGENT" --arch "${{ matrix.arch }}" --tarball "$tarball"
publish:
needs: [build, smoke]
if: ${{ github.event_name == 'workflow_dispatch' && inputs.publish == true }}
runs-on: ubuntu-24.04
environment: release
concurrency:
group: r2-manifest-publish
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v2
with:
bun-version: ${{ env.BUN_VERSION }}
- uses: actions/download-artifact@v4
with:
pattern: tarball-*
path: dist/images
merge-multiple: true
- working-directory: packages/browseros-agent
run: bun install --frozen-lockfile
- name: Upload tarballs to R2
working-directory: ${{ env.PKG_DIR }}
env:
R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }}
R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
R2_BUCKET: ${{ secrets.R2_BUCKET }}
run: |
set -euo pipefail
for file in "$GITHUB_WORKSPACE"/dist/images/*.tar.gz; do
base="$(basename "$file")"
bun run upload -- --file "$file" --key "vm/images/$base" --content-type "application/gzip" --sidecar-sha
done
- name: Merge agent slice into manifest
working-directory: ${{ env.PKG_DIR }}
env:
AGENT: ${{ inputs.agent || 'openclaw' }}
R2_ACCOUNT_ID: ${{ secrets.R2_ACCOUNT_ID }}
R2_ACCESS_KEY_ID: ${{ secrets.R2_ACCESS_KEY_ID }}
R2_SECRET_ACCESS_KEY: ${{ secrets.R2_SECRET_ACCESS_KEY }}
R2_BUCKET: ${{ secrets.R2_BUCKET }}
run: |
set -euo pipefail
mkdir -p dist/images
cp -R "$GITHUB_WORKSPACE"/dist/images/* dist/images/
bun run download -- --key vm/manifest.json --out dist/baseline-manifest.json
bun run emit-manifest -- \
--slice "agents:${AGENT}" \
--dist-dir dist \
--merge-from dist/baseline-manifest.json \
--out dist/manifest.json
bun run upload -- --file dist/manifest.json --key vm/manifest.json --content-type "application/json"

View File

@@ -63,15 +63,15 @@ jobs:
junit_path: test-results/server-root.xml
needs_browser: false
- suite: agent
command: bun run test:agent
command: (cd apps/agent && bun run test)
junit_path: test-results/agent.xml
needs_browser: false
- suite: eval
command: bun run test:eval
command: (cd apps/eval && bun run test)
junit_path: test-results/eval.xml
needs_browser: false
- suite: build
command: bun run test:build
command: bun run ./scripts/run-bun-test.ts ./scripts/build
junit_path: test-results/build.xml
needs_browser: false

View File

@@ -188,6 +188,21 @@ We'd love your help making BrowserOS better! See our [Contributing Guide](CONTRI
- [ungoogled-chromium](https://github.com/ungoogled-software/ungoogled-chromium) — BrowserOS uses some patches for enhanced privacy. Thanks to everyone behind this project!
- [The Chromium Project](https://www.chromium.org/) — at the core of BrowserOS, making it possible to exist in the first place.
## Citation
If you use BrowserOS in your research or project, please cite:
```bibtex
@software{browseros2025,
author = {Nithin Sonti and Nikhil Sonti and {BrowserOS-team}},
title = {BrowserOS: The open-source Agentic browser},
url = {https://github.com/browseros-ai/BrowserOS},
year = {2025},
publisher = {GitHub},
license = {AGPL-3.0},
}
```
## License
BrowserOS is open source under the [AGPL-3.0 license](LICENSE).

View File

@@ -79,14 +79,15 @@ cp apps/server/.env.example apps/server/.env.development
cp apps/agent/.env.example apps/agent/.env.development
cp apps/server/.env.production.example apps/server/.env.production
# Install deps, generate agent code, and sync the VM cache
# Install deps and generate agent code
bun run dev:setup
# Start the full dev environment
bun run dev:watch
```
`dev:watch` exits when the VM cache manifest is missing, but setup stays in `dev:setup`.
`dev:watch` starts the server immediately. OpenClaw VM/image prewarm runs from
the server startup path and pulls the configured GHCR image on demand.
### Environment Variables
@@ -156,9 +157,14 @@ bun run build:server # Build production server resource artifacts and u
bun run build:agent # Build agent extension
# Test
bun run test # Run standard tests
bun run test:cdp # Run CDP-based tests
bun run test:integration # Run integration tests
bun run test # Run all tests
bun run test:all # Run all tests
bun run test:main # Run key server tools and integration tests
# App-specific test groups (from packages/browseros-agent)
cd apps/server && bun run test:tools
cd apps/server && bun run test:cdp
cd apps/server && bun run test:integration
# Quality
bun run lint # Check with Biome

View File

@@ -1,136 +0,0 @@
import { Bot, Loader2, Wrench } from 'lucide-react'
import type { FC } from 'react'
import type { AgentCardData } from '@/lib/agent-conversations/types'
import { cn } from '@/lib/utils'
interface AgentCardProps {
agent: AgentCardData
onClick: () => void
active?: boolean
}
function formatTimestamp(timestamp?: number): string {
if (!timestamp) return 'No activity yet'
const diff = Date.now() - timestamp
const minutes = Math.floor(diff / 60000)
if (minutes < 1) return 'just now'
if (minutes < 60) return `${minutes}m ago`
const hours = Math.floor(minutes / 60)
if (hours < 24) return `${hours}h ago`
return `${Math.floor(hours / 24)}d ago`
}
function getStatusLabel(status: AgentCardData['status']): string {
if (status === 'working') return 'Working'
if (status === 'error') return 'Error'
return 'Ready'
}
function getStatusTone(status: AgentCardData['status']): string {
if (status === 'working') return 'bg-amber-500'
if (status === 'error') return 'bg-destructive'
return 'bg-emerald-500'
}
function formatCost(usd: number): string {
if (usd < 0.005) return `$${usd.toFixed(4)}`
return `$${usd.toFixed(2)}`
}
export const AgentCardExpanded: FC<AgentCardProps> = ({
agent,
onClick,
active,
}) => (
<button
type="button"
onClick={onClick}
className={cn(
'group flex min-h-32 w-full min-w-0 flex-col rounded-2xl border p-4 text-left shadow-sm transition-all duration-200',
active
? 'border-border/80 bg-card shadow-md ring-1 ring-[var(--accent-orange)]/20'
: 'border-border/60 bg-card/85 hover:border-border hover:bg-card hover:shadow-md',
)}
>
<div className="flex items-start justify-between gap-3">
<div className="flex min-w-0 items-center gap-3">
<div
className={cn(
'flex size-10 shrink-0 items-center justify-center rounded-xl',
active
? 'bg-[var(--accent-orange)]/10 text-[var(--accent-orange)]'
: 'bg-muted text-muted-foreground',
)}
>
<Bot className="size-5" />
</div>
<div className="min-w-0">
<div className="truncate font-semibold text-sm">{agent.name}</div>
<div className="truncate text-muted-foreground text-xs">
{agent.model ?? 'OpenClaw agent'}
</div>
</div>
</div>
<div className="flex items-center gap-2 rounded-full border border-border/60 bg-background/70 px-2.5 py-1 text-[11px] text-muted-foreground">
<span
className={cn('size-2 rounded-full', getStatusTone(agent.status))}
/>
<span>{getStatusLabel(agent.status)}</span>
</div>
</div>
<div className="mt-4 flex-1">
<p className="line-clamp-2 text-foreground/90 text-sm">
{agent.lastMessage ??
'Start a conversation to see recent work and summaries.'}
</p>
</div>
<div className="mt-4 space-y-1.5 text-muted-foreground text-xs">
<div className="flex items-center justify-between gap-3">
<span>{formatTimestamp(agent.lastMessageTimestamp)}</span>
{agent.costUsd ? (
<span className="tabular-nums opacity-70">
{formatCost(agent.costUsd)}
</span>
) : null}
</div>
{agent.status === 'working' && agent.currentTool ? (
<div className="flex items-center gap-1.5 text-[var(--accent-orange)]/70">
<Loader2 className="size-3 shrink-0 animate-spin" />
<span className="truncate">{agent.currentTool}</span>
</div>
) : agent.activitySummary ? (
<div className="flex items-center gap-1.5 text-muted-foreground/60">
<Wrench className="size-3 shrink-0" />
<span className="truncate">{agent.activitySummary}</span>
</div>
) : null}
</div>
</button>
)
export const AgentCardCompact: FC<AgentCardProps> = ({
agent,
onClick,
active,
}) => (
<button
type="button"
onClick={onClick}
className={cn(
'inline-flex items-center gap-2 rounded-full border px-3 py-2 text-sm transition-colors',
active
? 'border-border bg-card shadow-sm ring-1 ring-[var(--accent-orange)]/20'
: 'border-border/60 bg-card/85 text-foreground hover:border-border hover:bg-card',
)}
>
<span
className={cn(
'size-2 rounded-full',
active ? 'bg-[var(--accent-orange)]' : getStatusTone(agent.status),
)}
/>
<span className="truncate">{agent.name}</span>
</button>
)

View File

@@ -1,70 +1,71 @@
import { Plus } from 'lucide-react'
import type { FC } from 'react'
import type { AgentCardData } from '@/lib/agent-conversations/types'
import type {
HarnessAdapterDescriptor,
HarnessAdapterHealth,
HarnessAgent,
HarnessAgentAdapter,
} from '@/entrypoints/app/agents/agent-harness-types'
import { cn } from '@/lib/utils'
import { AgentCardCompact, AgentCardExpanded } from './AgentCard'
import { HomeAgentCard } from './HomeAgentCard'
interface AgentCardDockProps {
agents: AgentCardData[]
agents: HarnessAgent[]
adapters: HarnessAdapterDescriptor[]
activeAgentId?: string
onSelectAgent: (agentId: string) => void
onCreateAgent?: () => void
compact?: boolean
}
function CreateAgentButton({
compact,
onCreateAgent,
}: {
compact?: boolean
onCreateAgent: () => void
}) {
function CreateAgentButton({ onCreateAgent }: { onCreateAgent: () => void }) {
return (
<button
type="button"
onClick={onCreateAgent}
className={cn(
'flex shrink-0 items-center justify-center gap-2 border border-dashed text-muted-foreground transition-colors hover:border-[var(--accent-orange)] hover:text-[var(--accent-orange)]',
compact
? 'rounded-full px-3 py-2 text-sm'
: 'min-h-32 rounded-2xl px-5 py-4',
'flex min-h-32 shrink-0 items-center justify-center gap-2 rounded-2xl border border-dashed px-5 py-4 text-muted-foreground transition-colors',
'hover:border-[var(--accent-orange)] hover:text-[var(--accent-orange)]',
)}
>
<Plus className={compact ? 'size-3.5' : 'size-5'} />
<span>{compact ? 'New' : 'Create agent'}</span>
<Plus className="size-5" />
<span>Create agent</span>
</button>
)
}
/**
* 3-column grid of HomeAgentCards plus a trailing "Create agent"
* tile. The previous `compact` mode (rendered a horizontal pill rail)
* had no callers and was dropped along with the legacy AgentCard.
*/
export const AgentCardDock: FC<AgentCardDockProps> = ({
agents,
adapters,
activeAgentId,
onSelectAgent,
onCreateAgent,
compact,
}) => {
if (agents.length === 0 && !onCreateAgent) return null
const Card = compact ? AgentCardCompact : AgentCardExpanded
const adapterHealth = new Map<HarnessAgentAdapter, HarnessAdapterHealth>()
for (const descriptor of adapters) {
if (descriptor.health) adapterHealth.set(descriptor.id, descriptor.health)
}
return (
<div
className={cn(
compact
? 'flex items-center gap-2 overflow-x-auto pb-1'
: 'grid gap-4 md:grid-cols-3',
)}
>
<div className="grid gap-4 md:grid-cols-3">
{agents.map((agent) => (
<Card
key={agent.agentId}
<HomeAgentCard
key={agent.id}
agent={agent}
active={agent.agentId === activeAgentId}
onClick={() => onSelectAgent(agent.agentId)}
adapter={agent.adapter}
adapterHealth={adapterHealth.get(agent.adapter) ?? null}
active={agent.id === activeAgentId}
onClick={() => onSelectAgent(agent.id)}
/>
))}
{onCreateAgent ? (
<CreateAgentButton compact={compact} onCreateAgent={onCreateAgent} />
<CreateAgentButton onCreateAgent={onCreateAgent} />
) : null}
</div>
)

View File

@@ -2,6 +2,12 @@ import { ArrowLeft, Bot, Home } from 'lucide-react'
import { type FC, useEffect, useMemo, useRef } from 'react'
import { Navigate, useNavigate, useParams, useSearchParams } from 'react-router'
import { Button } from '@/components/ui/button'
import {
cancelHarnessTurn,
useEnqueueHarnessMessage,
useHarnessAgents,
useRemoveHarnessQueuedMessage,
} from '@/entrypoints/app/agents/useAgents'
import {
type AgentEntry,
getModelDisplayName,
@@ -15,6 +21,7 @@ import {
filterTurnsPersistedInHistory,
flattenHistoryPages,
} from './claw-chat-types'
import { QueuePanel } from './QueuePanel'
import { useAgentConversation } from './useAgentConversation'
import { useHarnessChatHistory } from './useHarnessChatHistory'
@@ -212,15 +219,33 @@ function AgentConversationController({
[historyMessages],
)
// Listing query feeds queue + active-turn state for this agent. We
// already poll it every 5s for the rail; reusing the same cache
// keeps cross-tab queue state in sync without a second poll.
const { harnessAgents } = useHarnessAgents()
const harnessAgent = harnessAgents.find((entry) => entry.id === agentId)
const queue = harnessAgent?.queue ?? []
const activeTurnId = harnessAgent?.activeTurnId ?? null
const { turns, streaming, send } = useAgentConversation(agentId, {
runtime: 'agent-harness',
sessionKey: null,
history: chatHistory,
activeTurnId,
onComplete: () => {
void harnessHistoryQuery.refetch()
},
onSessionKeyChange: () => {},
})
const enqueueMessage = useEnqueueHarnessMessage()
const removeQueuedMessage = useRemoveHarnessQueuedMessage()
const handleStop = () => {
void cancelHarnessTurn(agentId, {
turnId: activeTurnId ?? undefined,
reason: 'user pressed stop',
})
}
const visibleTurns = useMemo(
() => filterTurnsPersistedInHistory(turns, historyMessages),
[historyMessages, turns],
@@ -281,7 +306,15 @@ function AgentConversationController({
/>
<div className="border-border/50 border-t bg-background/88 px-4 py-3 backdrop-blur-md">
<div className="mx-auto max-w-3xl">
<div className="mx-auto max-w-3xl space-y-3">
{queue.length > 0 ? (
<QueuePanel
queue={queue}
onRemove={(messageId) =>
removeQueuedMessage.mutate({ agentId, messageId })
}
/>
) : null}
<ConversationInput
variant="conversation"
agents={agents}
@@ -296,14 +329,31 @@ function AgentConversationController({
name: a.name,
dataUrl: a.dataUrl,
}))
// When the agent already has an in-flight turn, route
// the new message into the durable queue instead of
// starting a parallel turn. Drains automatically as
// soon as the active turn ends.
if (streaming || activeTurnId) {
enqueueMessage.mutate({
agentId,
message: input.text,
attachments,
})
return
}
void send({ text: input.text, attachments, attachmentPreviews })
}}
onCreateAgent={() => navigate(createAgentPath)}
onStop={handleStop}
streaming={streaming}
disabled={disabled}
status="running"
attachmentsEnabled={true}
placeholder={`Message ${agentName}...`}
placeholder={
streaming
? `Type to queue another message for ${agentName}...`
: `Message ${agentName}...`
}
/>
</div>
</div>

View File

@@ -1,18 +1,25 @@
import { Plus } from 'lucide-react'
import { type FC, useEffect, useState } from 'react'
import { type FC, useEffect, useMemo, useState } from 'react'
import { useNavigate } from 'react-router'
import { Button } from '@/components/ui/button'
import { Card, CardContent } from '@/components/ui/card'
import { Separator } from '@/components/ui/separator'
import type {
HarnessAdapterDescriptor,
HarnessAgent,
} from '@/entrypoints/app/agents/agent-harness-types'
import {
useAgentAdapters,
useHarnessAgents,
} from '@/entrypoints/app/agents/useAgents'
import type { AgentEntry } from '@/entrypoints/app/agents/useOpenClaw'
import { ImportDataHint } from '@/entrypoints/newtab/index/ImportDataHint'
import { SignInHint } from '@/entrypoints/newtab/index/SignInHint'
import { useActiveHint } from '@/entrypoints/newtab/index/useActiveHint'
import type { AgentCardData } from '@/lib/agent-conversations/types'
import { AgentCardDock } from './AgentCardDock'
import { useAgentCommandData } from './agent-command-layout'
import { ConversationInput } from './ConversationInput'
import { buildAgentCardData } from './useAgentCardData'
import { orderHomeAgents } from './home-agent-card.helpers'
function EmptyAgentsState({ onOpenAgents }: { onOpenAgents: () => void }) {
return (
@@ -38,11 +45,13 @@ function EmptyAgentsState({ onOpenAgents }: { onOpenAgents: () => void }) {
function RecentThreads({
activeAgentId,
agents,
adapters,
onOpenAgents,
onSelectAgent,
}: {
activeAgentId?: string | null
agents: AgentCardData[]
agents: HarnessAgent[]
adapters: HarnessAdapterDescriptor[]
onOpenAgents: () => void
onSelectAgent: (agentId: string) => void
}) {
@@ -68,6 +77,7 @@ function RecentThreads({
</div>
<AgentCardDock
agents={agents}
adapters={adapters}
activeAgentId={activeAgentId ?? undefined}
onSelectAgent={onSelectAgent}
onCreateAgent={onOpenAgents}
@@ -79,25 +89,32 @@ function RecentThreads({
export const AgentCommandHome: FC = () => {
const navigate = useNavigate()
const activeHint = useActiveHint()
const { agents, status } = useAgentCommandData()
// The conversation input still consumes the merged AgentEntry list
// from the layout context (handles legacy /claw/agents entries that
// haven't yet been backfilled into the harness store). The Recent
// Agents grid below reads the richer harness payload directly.
const { agents: legacyAgents, status } = useAgentCommandData()
const { harnessAgents } = useHarnessAgents()
const { adapters } = useAgentAdapters()
const [selectedAgentId, setSelectedAgentId] = useState<string | null>(null)
const cardData = buildAgentCardData(agents, status?.status, undefined)
const orderedAgents = useMemo(
() => orderHomeAgents(harnessAgents),
[harnessAgents],
)
useEffect(() => {
if (agents.length === 0) {
if (selectedAgentId) {
setSelectedAgentId(null)
}
if (legacyAgents.length === 0) {
if (selectedAgentId) setSelectedAgentId(null)
return
}
if (
!selectedAgentId ||
!agents.some((agent) => agent.agentId === selectedAgentId)
!legacyAgents.some((agent) => agent.agentId === selectedAgentId)
) {
setSelectedAgentId(agents[0].agentId)
setSelectedAgentId(legacyAgents[0].agentId)
}
}, [agents, selectedAgentId])
}, [legacyAgents, selectedAgentId])
const handleSend = (input: { text: string }) => {
if (!selectedAgentId) return
@@ -110,7 +127,7 @@ export const AgentCommandHome: FC = () => {
setSelectedAgentId(agent.agentId)
}
const selectedAgent = agents.find(
const selectedAgent = legacyAgents.find(
(agent) => agent.agentId === selectedAgentId,
)
const selectedAgentReady = selectedAgent
@@ -118,13 +135,15 @@ export const AgentCommandHome: FC = () => {
: false
const selectedAgentStatus =
selectedAgent?.source === 'agent-harness' ? 'running' : status?.status
const selectedCard =
cardData.find((agent) => agent.agentId === selectedAgentId) ?? cardData[0]
const selectedAgentName =
selectedAgent?.name ?? orderedAgents[0]?.name ?? 'your agent'
const hasAgents = legacyAgents.length > 0
return (
<div className="min-h-full px-4 py-6">
<div className="mx-auto flex w-full max-w-5xl flex-col gap-8">
{cardData.length > 0 ? (
{hasAgents ? (
<>
<div className="flex flex-col items-center gap-5 pt-[max(10vh,24px)] text-center">
<div className="space-y-3">
@@ -140,7 +159,7 @@ export const AgentCommandHome: FC = () => {
<div className="w-full max-w-3xl">
<ConversationInput
variant="home"
agents={agents}
agents={legacyAgents}
selectedAgentId={selectedAgentId}
onSelectAgent={handleSelectAgent}
onSend={handleSend}
@@ -151,7 +170,7 @@ export const AgentCommandHome: FC = () => {
attachmentsEnabled={false}
placeholder={
selectedAgentReady
? `Ask ${selectedCard?.name ?? 'your agent'} to handle a task...`
? `Ask ${selectedAgentName} to handle a task...`
: 'Agent runtime is not running...'
}
/>
@@ -162,7 +181,8 @@ export const AgentCommandHome: FC = () => {
<RecentThreads
activeAgentId={selectedAgentId}
agents={cardData}
agents={orderedAgents}
adapters={adapters}
onOpenAgents={() => navigate('/agents')}
onSelectAgent={(agentId) => navigate(`/home/agents/${agentId}`)}
/>

View File

@@ -54,25 +54,40 @@ interface ConversationInputProps {
placeholder?: string
attachmentsEnabled?: boolean
variant?: 'home' | 'conversation'
/**
* When set, a Stop button surfaces to the left of the voice mic
* while `streaming === true`. Click cancels the active turn
* server-side via the chat-cancel endpoint. Absent → no Stop
* button (legacy behaviour for the home composer).
*/
onStop?: () => void
}
function InputActionButton({
disabled,
onClick,
streaming,
hasContent,
}: {
disabled: boolean
onClick: () => void
streaming: boolean
hasContent: boolean
}) {
// Show the spinner while streaming only when there's nothing to
// send — once the user types something, the icon flips back to the
// paper-plane so it reads as "queue this message" instead of
// "still working".
const showSpinner = streaming && !hasContent
return (
<Button
onClick={onClick}
size="icon"
disabled={disabled}
title={streaming && hasContent ? 'Queue message' : undefined}
className="h-10 w-10 flex-shrink-0 rounded-xl bg-primary text-primary-foreground hover:bg-primary/90"
>
{streaming ? (
{showSpinner ? (
<Loader2 className="h-5 w-5 animate-spin" />
) : (
<ArrowRight className="h-5 w-5" />
@@ -81,6 +96,22 @@ function InputActionButton({
)
}
function StopButton({ onStop }: { onStop: () => void }) {
return (
<Button
type="button"
size="icon"
variant="ghost"
onClick={onStop}
title="Stop current turn — queued messages will start next."
aria-label="Stop current turn"
className="h-8 w-8 flex-shrink-0 rounded-lg bg-destructive/10 text-destructive transition-colors hover:bg-destructive/15 hover:text-destructive"
>
<Square className="h-3.5 w-3.5 fill-current" />
</Button>
)
}
function VoiceButton({
isRecording,
isTranscribing,
@@ -299,6 +330,7 @@ export const ConversationInput: FC<ConversationInputProps> = ({
placeholder,
attachmentsEnabled = true,
variant = 'conversation',
onStop,
}) => {
const [input, setInput] = useState('')
const [selectedTabs, setSelectedTabs] = useState<chrome.tabs.Tab[]>([])
@@ -379,10 +411,17 @@ export const ConversationInput: FC<ConversationInputProps> = ({
}
const hasContent = input.trim().length > 0 || attachments.length > 0
// Queue-aware composers (the conversation panel passes `onStop`)
// accept input while streaming — the parent decides whether the
// submission opens a new turn or enqueues onto the active one.
// Surfaces without a Stop hook (home) keep the legacy behaviour
// and block input until the current turn finishes.
const queueAware = Boolean(onStop)
const handleSend = () => {
const text = input.trim()
if (disabled || isStaging || streaming) return
if (disabled || isStaging) return
if (streaming && !queueAware) return
if (!text && attachments.length === 0) return
onSend({ text, attachments })
setInput('')
@@ -512,6 +551,7 @@ export const ConversationInput: FC<ConversationInputProps> = ({
)}
/>
</div>
{streaming && onStop ? <StopButton onStop={onStop} /> : null}
<VoiceButton
isRecording={voice.isRecording}
isTranscribing={voice.isTranscribing}
@@ -529,12 +569,13 @@ export const ConversationInput: FC<ConversationInputProps> = ({
!!disabled ||
voice.isRecording ||
voice.isTranscribing ||
streaming
(streaming && !queueAware)
}
onClick={handleSend}
// Spinner stays the user-facing "agent is busy" hint; with the
// queue active we still spin while a turn is in flight.
streaming={streaming}
hasContent={hasContent}
/>
</div>
{voice.error ? (

View File

@@ -0,0 +1,243 @@
import { Quote, TriangleAlert } from 'lucide-react'
import type { FC } from 'react'
import { Badge } from '@/components/ui/badge'
import {
HoverCard,
HoverCardContent,
HoverCardTrigger,
} from '@/components/ui/hover-card'
import { adapterLabel } from '@/entrypoints/app/agents/AdapterIcon'
import { formatRelativeTime } from '@/entrypoints/app/agents/agent-display.helpers'
import type {
HarnessAdapterHealth,
HarnessAgent,
HarnessAgentAdapter,
} from '@/entrypoints/app/agents/agent-harness-types'
import { AgentTile } from '@/entrypoints/app/agents/agent-row/AgentTile'
import {
firstNonBlankLine,
truncate,
} from '@/entrypoints/app/agents/agent-row/agent-row.helpers'
import type { AgentLiveness } from '@/entrypoints/app/agents/LivenessDot'
import { cn } from '@/lib/utils'
interface HomeAgentCardProps {
agent: HarnessAgent
adapter: HarnessAgentAdapter | 'unknown'
/** Per-adapter health snapshot, shared across cards rendering the
* same adapter. `null` when the /adapters response hasn't surfaced
* health yet (we treat that as healthy until proven otherwise). */
adapterHealth: HarnessAdapterHealth | null
/** Highlights the card with an accent ring; tells the user which
* agent the conversation input is bound to. */
active?: boolean
onClick: () => void
}
const PREVIEW_CHARS = 100
/**
* Grid-shaped card for the /home Recent agents section. Composition
* mirrors the rail's `AgentRowCard` but the layout is a vertical
* column sized for a 1/3-width tile rather than a full-width row.
*
* Reuses `<AgentTile>`, `<LivenessDot>`, `livenessDetail`,
* `formatRelativeTime`, `firstNonBlankLine`, `truncate`, and the
* inline `Unavailable` chip pattern so the visual language is
* continuous between rail and grid.
*/
export const HomeAgentCard: FC<HomeAgentCardProps> = ({
agent,
adapter,
adapterHealth,
active,
onClick,
}) => {
const status = agent.status ?? 'unknown'
const lastUsedAt = agent.lastUsedAt ?? null
const isWorking = status === 'working'
const isAsleep = status === 'asleep'
const isError = status === 'error'
const hasActiveTurn = Boolean(agent.activeTurnId)
return (
<button
type="button"
onClick={onClick}
className={cn(
'group flex min-h-32 w-full min-w-0 flex-col rounded-2xl border bg-card p-4 text-left shadow-sm transition-colors',
active && 'ring-1 ring-[var(--accent-orange)]/30',
isWorking
? 'border-[var(--accent-orange)]/40'
: isError
? 'border-destructive/30'
: 'border-border/60 hover:border-[var(--accent-orange)]/30',
)}
>
<div className="flex items-start gap-3">
<AgentTile adapter={adapter} status={status} lastUsedAt={lastUsedAt} />
<div className="min-w-0 flex-1">
<div className="flex items-center gap-1.5">
<span className="truncate font-semibold text-sm">
{displayName(agent)}
</span>
{isWorking && (
<Badge
variant="secondary"
className="ml-auto bg-amber-50 text-amber-900 hover:bg-amber-50"
>
Working
</Badge>
)}
</div>
<SummaryLine
adapter={adapter}
modelId={agent.modelId ?? null}
reasoningEffort={agent.reasoningEffort ?? null}
adapterHealth={adapterHealth}
/>
</div>
</div>
<LastMessage message={agent.lastUserMessage ?? null} />
<div className="mt-3 flex items-center justify-between gap-2 text-muted-foreground text-xs">
<span>{statusFootnote(status, lastUsedAt)}</span>
{hasActiveTurn ? (
<ResumeChip />
) : isAsleep ? (
<Badge variant="outline" className="text-muted-foreground">
Asleep
</Badge>
) : isError ? (
<ErrorChip lastError={agent.lastError ?? null} />
) : null}
</div>
</button>
)
}
const SummaryLine: FC<{
adapter: HarnessAgentAdapter | 'unknown'
modelId: string | null
reasoningEffort: string | null
adapterHealth: HarnessAdapterHealth | null
}> = ({ adapter, modelId, reasoningEffort, adapterHealth }) => {
const parts = [adapterLabel(adapter)]
if (modelId) parts.push(modelId)
if (reasoningEffort) parts.push(reasoningEffort)
const unhealthy = adapterHealth?.healthy === false
return (
<div
className={cn(
'mt-0.5 flex items-center gap-1.5 text-muted-foreground text-xs',
unhealthy && 'text-muted-foreground/70',
)}
>
<span className="truncate">{parts.join(' · ')}</span>
{unhealthy && (
<HoverCard openDelay={200}>
<HoverCardTrigger asChild>
<Badge
variant="outline"
className="h-5 cursor-default gap-1 border-amber-500/40 bg-amber-50 px-1.5 text-amber-900 hover:bg-amber-50"
>
<TriangleAlert className="size-2.5" />
<span className="font-normal">Unavailable</span>
</Badge>
</HoverCardTrigger>
<HoverCardContent side="right" className="w-72 text-sm">
<div className="font-medium">
{adapterLabel(adapter)} CLI not available
</div>
<div className="mt-1 text-muted-foreground text-xs">
{adapterHealth?.reason ??
'Adapter binary missing on $PATH. Install it from the adapter docs to use this agent.'}
</div>
</HoverCardContent>
</HoverCard>
)}
</div>
)
}
const LastMessage: FC<{ message: string | null }> = ({ message }) => {
if (!message) {
return (
<p className="mt-3 flex-1 text-muted-foreground/70 text-xs italic">
No messages yet start a chat
</p>
)
}
return (
<p className="mt-3 line-clamp-2 flex flex-1 items-start gap-1.5 text-foreground/85 text-sm italic leading-snug">
<Quote
className="mt-1 size-3 shrink-0 text-muted-foreground/60"
aria-hidden
/>
<span className="line-clamp-2">
{truncate(firstNonBlankLine(message), PREVIEW_CHARS)}
</span>
</p>
)
}
const ResumeChip: FC = () => (
<span className="inline-flex items-center gap-1.5 rounded-full bg-[var(--accent-orange)] px-2.5 py-0.5 font-medium text-[11px] text-white shadow-sm">
<span className="relative flex size-1.5">
<span className="absolute inline-flex h-full w-full animate-ping rounded-full bg-white/70 opacity-75" />
<span className="relative inline-flex size-1.5 rounded-full bg-white" />
</span>
Resume
</span>
)
const ErrorChip: FC<{ lastError: string | null }> = ({ lastError }) => {
if (!lastError) {
return <Badge variant="destructive">Attention</Badge>
}
return (
<HoverCard openDelay={200}>
<HoverCardTrigger asChild>
<Badge variant="destructive" className="cursor-default">
Attention
</Badge>
</HoverCardTrigger>
<HoverCardContent
side="left"
className="max-w-xs whitespace-pre-wrap font-mono text-xs"
>
{lastError}
</HoverCardContent>
</HoverCard>
)
}
/**
* Footer left side: relative time on every state EXCEPT working,
* which shows `now` (the dot is already pulsing — restating it as
* "Working" would duplicate the pill in the title row).
*/
function statusFootnote(
status: AgentLiveness,
lastUsedAt: number | null,
): string {
if (status === 'working') return 'now'
return formatRelativeTime(lastUsedAt)
}
const UUID_PATTERN =
/^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i
const OC_UUID_PATTERN =
/^oc-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i
function displayName(agent: HarnessAgent): string {
const name = agent.name?.trim()
const id = agent.id
if (!name || name === id) {
if (OC_UUID_PATTERN.test(id)) return id.slice(0, 11)
if (UUID_PATTERN.test(id)) return id.slice(0, 8)
return id
}
return name
}

View File

@@ -0,0 +1,94 @@
import { ListPlus, X } from 'lucide-react'
import type { FC } from 'react'
import {
Queue,
QueueItem,
QueueItemAction,
QueueItemActions,
QueueItemAttachment,
QueueItemContent,
QueueItemFile,
QueueItemImage,
QueueList,
QueueSection,
QueueSectionContent,
QueueSectionLabel,
QueueSectionTrigger,
} from '@/components/ai-elements/queue'
import type {
HarnessQueuedMessage,
HarnessQueuedMessageAttachment,
} from '@/entrypoints/app/agents/agent-harness-types'
import { firstNonBlankLine } from '@/entrypoints/app/agents/agent-row/agent-row.helpers'
interface QueuePanelProps {
queue: HarnessQueuedMessage[]
onRemove: (messageId: string) => void
}
/**
* Renders the agent's pending message queue using the shared AI
* Elements `Queue` primitives. Caller is expected to gate render on
* `queue.length > 0` — when empty, this returns null so the panel
* disappears cleanly between turns.
*/
export const QueuePanel: FC<QueuePanelProps> = ({ queue, onRemove }) => {
if (queue.length === 0) return null
return (
<Queue>
<QueueSection>
<QueueSectionTrigger>
<QueueSectionLabel
count={queue.length}
label={queue.length === 1 ? 'queued message' : 'queued messages'}
icon={<ListPlus className="size-3.5" />}
/>
</QueueSectionTrigger>
<QueueSectionContent>
<QueueList>
{queue.map((entry) => (
<QueueItem key={entry.id}>
<div className="flex items-center gap-2">
<QueueItemContent>
{firstNonBlankLine(entry.message)}
</QueueItemContent>
<QueueItemActions>
<QueueItemAction
aria-label="Remove from queue"
onClick={() => onRemove(entry.id)}
>
<X className="size-3" />
</QueueItemAction>
</QueueItemActions>
</div>
{entry.attachments && entry.attachments.length > 0 ? (
<QueueItemAttachment>
{entry.attachments.map((attachment, idx) =>
renderAttachment(entry.id, attachment, idx),
)}
</QueueItemAttachment>
) : null}
</QueueItem>
))}
</QueueList>
</QueueSectionContent>
</QueueSection>
</Queue>
)
}
function renderAttachment(
messageId: string,
attachment: HarnessQueuedMessageAttachment,
idx: number,
) {
if (attachment.mediaType.startsWith('image/')) {
const src = `data:${attachment.mediaType};base64,${attachment.data}`
return <QueueItemImage key={`${messageId}-${idx}`} src={src} />
}
return (
<QueueItemFile key={`${messageId}-${idx}`}>
{attachment.mediaType}
</QueueItemFile>
)
}

View File

@@ -0,0 +1,69 @@
import { describe, expect, it } from 'bun:test'
import type { HarnessAgent } from '@/entrypoints/app/agents/agent-harness-types'
import { orderHomeAgents } from './home-agent-card.helpers'
function agent(overrides: Partial<HarnessAgent>): HarnessAgent {
return {
id: overrides.id ?? 'agent-x',
name: overrides.name ?? overrides.id ?? 'agent-x',
adapter: overrides.adapter ?? 'codex',
permissionMode: 'approve-all',
sessionKey: `agent:${overrides.id ?? 'agent-x'}:main`,
createdAt: 1000,
updatedAt: 1000,
...overrides,
}
}
describe('orderHomeAgents', () => {
it('places active-turn agents before everyone else', () => {
const sorted = orderHomeAgents([
agent({ id: 'a', lastUsedAt: 5000 }),
agent({ id: 'b', lastUsedAt: 9000, activeTurnId: 'turn-1' }),
agent({ id: 'c', lastUsedAt: 7000 }),
])
expect(sorted.map((a) => a.id)).toEqual(['b', 'c', 'a'])
})
it('orders non-active agents by lastUsedAt desc', () => {
const sorted = orderHomeAgents([
agent({ id: 'old', lastUsedAt: 1000 }),
agent({ id: 'new', lastUsedAt: 9000 }),
agent({ id: 'mid', lastUsedAt: 5000 }),
])
expect(sorted.map((a) => a.id)).toEqual(['new', 'mid', 'old'])
})
it('puts the gateway `main` seed agent above other never-used agents', () => {
const sorted = orderHomeAgents([
agent({ id: 'oc-aaaaaa', lastUsedAt: null }),
agent({ id: 'main', lastUsedAt: null }),
agent({ id: 'oc-bbbbbb', lastUsedAt: null }),
])
expect(sorted.map((a) => a.id)).toEqual(['main', 'oc-aaaaaa', 'oc-bbbbbb'])
})
it('sends never-used agents to the bottom even when `main` is among them', () => {
const sorted = orderHomeAgents([
agent({ id: 'main', lastUsedAt: null }),
agent({ id: 'used', lastUsedAt: 5000 }),
])
expect(sorted.map((a) => a.id)).toEqual(['used', 'main'])
})
it('does NOT sort by pinned — pinned agents are treated like any other', () => {
const sorted = orderHomeAgents([
agent({ id: 'unpinned-recent', lastUsedAt: 9000, pinned: false }),
agent({ id: 'pinned-old', lastUsedAt: 1000, pinned: true }),
])
expect(sorted.map((a) => a.id)).toEqual(['unpinned-recent', 'pinned-old'])
})
it('falls back to id-stable ordering when lastUsedAt ties', () => {
const sorted = orderHomeAgents([
agent({ id: 'b', lastUsedAt: 5000 }),
agent({ id: 'a', lastUsedAt: 5000 }),
])
expect(sorted.map((a) => a.id)).toEqual(['a', 'b'])
})
})

View File

@@ -0,0 +1,42 @@
import type { HarnessAgent } from '@/entrypoints/app/agents/agent-harness-types'
/**
* Order for the /home Recent agents grid.
*
* 1. Active turn first — agents mid-turn float to the top so the
* Resume affordance is the first thing the user sees on /home.
* 2. The protected gateway-side `main` agent stays pinned-to-top in
* the never-used group on a fresh install (mirrors the rail).
* 3. Recency (`lastUsedAt` desc).
* 4. `id` tiebreaker for stability so the grid doesn't reshuffle on
* every 5-second poll.
*
* Pin is NOT a sort key. The home grid is action-oriented and trusts
* recency + active-turn to surface the right agent; pinning is an
* organisation tool that lives on the rail at /agents.
*/
export function orderHomeAgents(agents: HarnessAgent[]): HarnessAgent[] {
return [...agents].sort((a, b) => {
const aActive = a.activeTurnId != null
const bActive = b.activeTurnId != null
if (aActive !== bActive) return aActive ? -1 : 1
// Recency wins outright. Never-used agents (`lastUsedAt == null`)
// both fall to the same `-Infinity` bucket and the seed/id rules
// below decide their order — but a used agent always beats any
// never-used agent regardless of id.
const aValue = a.lastUsedAt ?? Number.NEGATIVE_INFINITY
const bValue = b.lastUsedAt ?? Number.NEGATIVE_INFINITY
if (aValue !== bValue) return bValue - aValue
// Inside the never-used (or exact-tie) group: pin the gateway
// `main` seed to the top of the group on a fresh install, then
// fall back to id-stable order so the grid doesn't reshuffle on
// every poll.
const aSeed = a.id === 'main' && a.lastUsedAt == null
const bSeed = b.id === 'main' && b.lastUsedAt == null
if (aSeed !== bSeed) return aSeed ? -1 : 1
return a.id.localeCompare(b.id)
})
}

View File

@@ -1,53 +0,0 @@
import {
type AgentEntry,
getModelDisplayName,
type OpenClawStatus,
} from '@/entrypoints/app/agents/useOpenClaw'
import type { AgentCardData } from '@/lib/agent-conversations/types'
import type { AgentOverview } from './useAgentDashboard'
function resolveAgentStatus(
gatewayStatus: OpenClawStatus['status'] | undefined,
liveStatus: AgentOverview['status'] | undefined,
): AgentCardData['status'] {
// Gateway-level errors take precedence
if (gatewayStatus === 'error') return 'error'
if (gatewayStatus === 'starting') return 'working'
// Per-agent live status from the WS observer
if (liveStatus === 'working') return 'working'
if (liveStatus === 'error') return 'error'
return 'idle'
}
/**
* Build agent card display data by merging the raw agent entries from
* the gateway with enriched overview data from the dashboard API.
*
* Pure function — no hooks, no IndexedDB, no async.
*/
export function buildAgentCardData(
agents: AgentEntry[],
status: OpenClawStatus['status'] | undefined,
dashboard: AgentOverview[] | undefined,
): AgentCardData[] {
return agents.map((agent) => {
const overview = dashboard?.find((d) => d.agentId === agent.agentId)
return {
agentId: agent.agentId,
name: agent.name,
model: getModelDisplayName(agent.model),
status:
agent.source === 'agent-harness'
? 'idle'
: resolveAgentStatus(status, overview?.status),
lastMessage: overview?.latestMessage?.slice(0, 200) ?? undefined,
lastMessageTimestamp: overview?.latestMessageAt ?? undefined,
activitySummary: overview?.activitySummary ?? undefined,
currentTool: overview?.currentTool ?? undefined,
costUsd: overview?.totalCostUsd ?? undefined,
}
})
}

View File

@@ -36,6 +36,15 @@ interface UseAgentConversationOptions {
history?: OpenClawChatHistoryMessage[]
onComplete?: () => void
onSessionKeyChange?: (sessionKey: string) => void
/**
* Server-side active turn id, surfaced via the listing query. When
* this changes from null/<id> to a different non-null id while we
* aren't already streaming (e.g. the server just popped a queued
* message and started a new turn), the hook reattaches via
* /chat/active so the chat panel picks up the live stream without
* waiting for a remount.
*/
activeTurnId?: string | null
}
export function useAgentConversation(
@@ -211,31 +220,46 @@ export function useAgentConversation(
}
processEventRef.current = processAgentHarnessStreamEvent
// On mount (and whenever the agent changes), check whether the
// server has an in-flight turn for this agent and reattach to it.
// This is what makes the chat resilient across tab close/reopen,
// refresh, and navigation: the runtime call kept running on the
// server while we were away. Effect only depends on `agentId` —
// the event handler is read off a ref so this doesn't re-subscribe
// every render.
const activeTurnIdDep = options.activeTurnId ?? null
// On mount, on agent change, and whenever the listing reports a
// *new* active turn id, check whether the server has an in-flight
// turn for this agent and reattach to it. This catches three
// cases at once: the chat resilience flow (tab close/reopen),
// navigation between agents, AND queue drain (the server starts a
// new turn from a queued message → activeTurnId flips → attach).
useEffect(() => {
let cancelled = false
const abortController = new AbortController()
// Reference the dep inside the body so biome's exhaustive-deps
// rule sees it consumed; the value is just an "any non-null
// active turn id" trigger — the actual id we attach to comes
// from the fresh fetchActiveHarnessTurn call below.
void activeTurnIdDep
const attemptResume = async () => {
// Track whether *we* started a stream in this run. When the
// early-return paths fire (no active turn, or a `send()` /
// earlier resume already owns `streamAbortRef`), the finally
// block must NOT touch streaming/turnIdRef/lastSeqRef —
// otherwise we clobber the in-flight stream's state and the
// Stop button drops out mid-turn while events keep arriving.
let weStartedStream = false
try {
const active = await fetchActiveHarnessTurn(agentId)
if (cancelled || !active || active.status !== 'running') return
if (streamAbortRef.current) return // a fresh send already in flight
if (streamAbortRef.current) return // someone else already owns the stream
// Stage a placeholder turn so the streamed events have a row
// to render into. We don't have the user message text on
// resume; the assistant turn is what we're catching up on.
// to render into. The server now persists the kicking-off
// prompt on the active turn, so we render it as the user
// bubble immediately — no empty-bubble flicker when a queued
// message starts running.
setTurns((prev) => [
...prev,
{
id: crypto.randomUUID(),
userText: '',
userText: active.prompt ?? '',
parts: [],
done: false,
timestamp: active.startedAt,
@@ -247,6 +271,7 @@ export function useAgentConversation(
lastSeqRef.current = null
streamAbortRef.current = abortController
setStreaming(true)
weStartedStream = true
const response = await attachToHarnessTurn(agentId, {
turnId: active.turnId,
@@ -265,10 +290,20 @@ export function useAgentConversation(
// Resume is best-effort; transient errors fall back to the
// user starting a new turn manually.
} finally {
if (!cancelled) {
if (streamAbortRef.current === abortController) {
streamAbortRef.current = null
}
// Always release `streamAbortRef` if we owned it — even when
// the effect was cancelled mid-stream (a listing poll
// captured the next queue-drain turn id, for example). If we
// don't, the next effect run hits `if (streamAbortRef.current)
// return` against our now-aborted controller and never
// reattaches, leaving `streaming === true` with no live stream.
if (weStartedStream && streamAbortRef.current === abortController) {
streamAbortRef.current = null
}
// The other state (streaming flag, turn id, lastSeq) is the
// *current run's* lifecycle: only reset it on a clean exit.
// When `cancelled` is true the next run will set these
// itself, so resetting here would only cause a brief flicker.
if (!cancelled && weStartedStream) {
turnIdRef.current = null
lastSeqRef.current = null
setStreaming(false)
@@ -281,7 +316,7 @@ export function useAgentConversation(
cancelled = true
abortController.abort()
}
}, [agentId])
}, [agentId, activeTurnIdDep])
const send = async (input: string | SendInput) => {
const normalized: SendInput =

View File

@@ -1,95 +0,0 @@
import { useQuery, useQueryClient } from '@tanstack/react-query'
import { useEffect } from 'react'
import { useAgentServerUrl } from '@/lib/browseros/useBrowserOSProviders'
export interface AgentOverview {
agentId: string
status: 'working' | 'idle' | 'error' | 'unknown'
latestMessage: string | null
latestMessageAt: number | null
activitySummary: string | null
currentTool: string | null
totalCostUsd: number
sessionCount: number
}
export interface DashboardResponse {
agents: AgentOverview[]
summary: {
totalAgents: number
totalCostUsd: number
}
}
interface StatusEvent {
agentId: string
status: AgentOverview['status']
currentTool: string | null
error: string | null
timestamp: number
}
const DASHBOARD_QUERY_KEY = ['claw', 'dashboard']
export function useAgentDashboard(enabled: boolean) {
const { baseUrl, isLoading: urlLoading } = useAgentServerUrl()
const queryClient = useQueryClient()
const ready = enabled && Boolean(baseUrl) && !urlLoading
// Initial data load + periodic refresh as fallback
const query = useQuery<DashboardResponse>({
queryKey: [...DASHBOARD_QUERY_KEY, baseUrl],
queryFn: async () => {
const url = new URL('/claw/dashboard', baseUrl as string)
const response = await fetch(url.toString())
if (!response.ok) throw new Error('Failed to fetch dashboard')
return response.json()
},
enabled: ready,
})
// SSE subscription for real-time status patches
useEffect(() => {
if (!ready || !baseUrl) return
const streamUrl = new URL('/claw/dashboard/stream', baseUrl)
const eventSource = new EventSource(streamUrl.toString())
eventSource.addEventListener('snapshot', (event) => {
try {
const dashboard = JSON.parse(event.data) as DashboardResponse
queryClient.setQueryData([...DASHBOARD_QUERY_KEY, baseUrl], dashboard)
} catch {}
})
eventSource.addEventListener('status', (event) => {
try {
const status = JSON.parse(event.data) as StatusEvent
queryClient.setQueryData<DashboardResponse>(
[...DASHBOARD_QUERY_KEY, baseUrl],
(prev) => {
if (!prev) return prev
return {
...prev,
agents: prev.agents.map((agent) =>
agent.agentId === status.agentId
? {
...agent,
status: status.status,
currentTool: status.currentTool,
}
: agent,
),
}
},
)
} catch {}
})
return () => {
eventSource.close()
}
}, [ready, baseUrl, queryClient])
return query
}

View File

@@ -2,67 +2,87 @@ import { Loader2 } from 'lucide-react'
import { type FC, useMemo } from 'react'
import { AgentRowCard } from './AgentRowCard'
import { AgentsEmptyState } from './AgentsEmptyState'
import type { HarnessAgent, HarnessAgentAdapter } from './agent-harness-types'
import type {
HarnessAdapterDescriptor,
HarnessAgent,
HarnessAgentAdapter,
} from './agent-harness-types'
import type {
AgentAdapterHealth,
AgentRowData,
} from './agent-row/agent-row.types'
import type { AgentListItem } from './agents-page-types'
import type { AgentLiveness } from './LivenessDot'
interface AgentListProps {
agents: AgentListItem[]
/**
* Optional per-agent activity metadata. Keyed by `agentId`. Missing
* entries fall back to status='unknown' / lastUsedAt=null and the
* row renders an "unknown" dot. The server will populate this once
* the activity tracker ships; the page works without it.
*/
/** Optional per-agent activity metadata, keyed by `agentId`. */
activity?: Record<
string,
{ status: AgentLiveness; lastUsedAt: number | null }
>
/**
* Lookup table from harness agent id → adapter + reasoning effort,
* sourced from `useHarnessAgents`. Lets the row card render the
* correct adapter icon and chips for harness agents (legacy
* /claw/agents entries fall back to inferring from `runtimeLabel`).
*/
/** Lookup table from harness id → enriched agent record. */
harnessAgentLookup?: Map<string, HarnessAgent>
/** Adapter catalog (carries per-adapter health). */
adapters: HarnessAdapterDescriptor[]
loading: boolean
deletingAgentKey: string | null
onCreateAgent: () => void
onDeleteAgent: (agent: AgentListItem) => void
onPinToggle: (agent: AgentListItem, next: boolean) => void
}
export const AgentList: FC<AgentListProps> = ({
agents,
activity,
harnessAgentLookup,
adapters,
loading,
deletingAgentKey,
onCreateAgent,
onDeleteAgent,
onPinToggle,
}) => {
// Sort by recency: most recently used first; never-used agents drop
// to the bottom in id-stable order so the list doesn't reshuffle on
// every refresh. The pinned exception is the gateway's `main` agent
// when it's never been touched — keep it at the top so a fresh
// install has an obvious starting point.
const adapterHealth = useMemo(() => {
const map = new Map<HarnessAgentAdapter, AgentAdapterHealth>()
for (const adapter of adapters) {
if (adapter.health) {
map.set(adapter.id, {
healthy: adapter.health.healthy,
reason: adapter.health.reason,
})
}
}
return map
}, [adapters])
// Sort: pinned rows first, then most recently used, then never-used
// agents in id-stable order. The gateway's `main` agent stays
// pinned-to-top when never touched so a fresh install has an
// obvious starting point.
const ordered = useMemo(() => {
const withScore = agents.map((agent) => {
const lastUsedAt = activity?.[agent.agentId]?.lastUsedAt ?? null
return { agent, lastUsedAt }
const withMeta = agents.map((agent) => {
const harness = harnessAgentLookup?.get(agent.agentId)
return {
agent,
pinned: harness?.pinned ?? false,
lastUsedAt: activity?.[agent.agentId]?.lastUsedAt ?? null,
}
})
return withScore
return withMeta
.sort((a, b) => {
const aPinned = a.agent.agentId === 'main' && a.lastUsedAt === null
const bPinned = b.agent.agentId === 'main' && b.lastUsedAt === null
if (aPinned && !bPinned) return -1
if (!aPinned && bPinned) return 1
if (a.pinned !== b.pinned) return a.pinned ? -1 : 1
const aSeed = a.agent.agentId === 'main' && a.lastUsedAt === null
const bSeed = b.agent.agentId === 'main' && b.lastUsedAt === null
if (aSeed && !bSeed) return -1
if (!aSeed && bSeed) return 1
const aValue = a.lastUsedAt ?? -Infinity
const bValue = b.lastUsedAt ?? -Infinity
if (aValue !== bValue) return bValue - aValue
return a.agent.agentId.localeCompare(b.agent.agentId)
})
.map((entry) => entry.agent)
}, [activity, agents])
}, [activity, agents, harnessAgentLookup])
if (loading && agents.length === 0) {
return (
@@ -80,18 +100,23 @@ export const AgentList: FC<AgentListProps> = ({
<div className="grid gap-3">
{ordered.map((agent) => {
const harness = harnessAgentLookup?.get(agent.agentId)
const adapter: HarnessAgentAdapter | undefined =
const adapter: HarnessAgentAdapter | 'unknown' =
harness?.adapter ?? inferAdapterFromLabel(agent.runtimeLabel)
const data = buildRowData({
agent,
adapter,
harness,
activity: activity?.[agent.agentId],
adapterHealth:
adapterHealth.get(adapter as HarnessAgentAdapter) ?? null,
})
return (
<AgentRowCard
key={agent.key}
agent={agent}
status={activity?.[agent.agentId]?.status}
lastUsedAt={activity?.[agent.agentId]?.lastUsedAt}
adapter={adapter}
reasoningEffort={harness?.reasoningEffort ?? null}
onDelete={onDeleteAgent}
data={data}
deleting={deletingAgentKey === agent.key}
onDelete={onDeleteAgent}
onPinToggle={onPinToggle}
/>
)
})}
@@ -99,10 +124,53 @@ export const AgentList: FC<AgentListProps> = ({
)
}
function inferAdapterFromLabel(label: string): HarnessAgentAdapter | undefined {
function inferAdapterFromLabel(label: string): HarnessAgentAdapter | 'unknown' {
const lower = label?.toLowerCase()
if (lower === 'claude code') return 'claude'
if (lower === 'codex') return 'codex'
if (lower === 'openclaw') return 'openclaw'
return undefined
return 'unknown'
}
const ZERO_BUCKETS = (): number[] => Array.from({ length: 14 }, () => 0)
function buildRowData(input: {
agent: AgentListItem
adapter: HarnessAgentAdapter | 'unknown'
harness: HarnessAgent | undefined
activity: { status: AgentLiveness; lastUsedAt: number | null } | undefined
adapterHealth: AgentAdapterHealth | null
}): AgentRowData {
const { agent, adapter, harness, activity, adapterHealth } = input
return {
agent,
adapter,
modelLabel: deriveModelLabel(agent, harness),
reasoningEffort: harness?.reasoningEffort ?? null,
status: activity?.status ?? 'unknown',
lastUsedAt: activity?.lastUsedAt ?? harness?.lastUsedAt ?? null,
pinned: harness?.pinned ?? false,
cwd: harness?.cwd ?? null,
lastUserMessage: harness?.lastUserMessage ?? null,
tokens: harness?.tokens ?? null,
turnsByDay: harness?.turnsByDay ?? ZERO_BUCKETS(),
failedByDay: harness?.failedByDay ?? ZERO_BUCKETS(),
lastError: harness?.lastError ?? null,
lastErrorAt: harness?.lastErrorAt ?? null,
activeTurnId: harness?.activeTurnId ?? null,
adapterHealth,
}
}
function deriveModelLabel(
agent: AgentListItem,
harness: HarnessAgent | undefined,
): string | null {
// Prefer the agent rail's modelLabel when meaningful; harness's
// modelId is a stable identifier but the rail's `modelLabel`
// already maps to a friendly display string.
if (agent.modelLabel && agent.modelLabel !== 'default') {
return agent.modelLabel
}
return harness?.modelId ?? null
}

View File

@@ -1,270 +1,99 @@
import {
Copy,
Loader2,
MessageSquare,
MoreHorizontal,
Pencil,
RotateCcw,
Trash2,
} from 'lucide-react'
import type { FC } from 'react'
import { useNavigate } from 'react-router'
import { toast } from 'sonner'
import { Badge } from '@/components/ui/badge'
import { Button } from '@/components/ui/button'
import {
DropdownMenu,
DropdownMenuContent,
DropdownMenuItem,
DropdownMenuSeparator,
DropdownMenuTrigger,
} from '@/components/ui/dropdown-menu'
import {
Tooltip,
TooltipContent,
TooltipProvider,
TooltipTrigger,
} from '@/components/ui/tooltip'
import { cn } from '@/lib/utils'
import { AdapterIcon, adapterLabel } from './AdapterIcon'
import {
canDelete as canDeleteAgent,
canRename as canRenameAgent,
displayName,
formatRelativeTime,
workspaceLabel,
} from './agent-display.helpers'
import type { HarnessAgentAdapter } from './agent-harness-types'
import type { AgentListItem } from './agents-page-types'
import { type AgentLiveness, LivenessDot } from './LivenessDot'
import { AgentActions } from './agent-row/AgentActions'
import { AgentErrorPanel } from './agent-row/AgentErrorPanel'
import { AgentLastMessage } from './agent-row/AgentLastMessage'
import { AgentMetaRow } from './agent-row/AgentMetaRow'
import { AgentSummaryChips } from './agent-row/AgentSummaryChips'
import { AgentTile } from './agent-row/AgentTile'
import { AgentTitleRow } from './agent-row/AgentTitleRow'
import type {
AgentRowCallbacks,
AgentRowData,
} from './agent-row/agent-row.types'
interface AgentRowCardProps {
agent: AgentListItem
/**
* Per-agent extras the listing surface provides on top of the
* minimal `AgentListItem` shape. `lastUsedAt` survives server
* restart (sourced from acpx session record); `status` is in-memory
* server-side.
*/
status?: AgentLiveness
lastUsedAt?: number | null
/** Adapter the agent belongs to. Drives icon + label. */
adapter?: HarnessAgentAdapter
/** Reasoning effort chip (claude/codex/openclaw catalog). */
reasoningEffort?: string | null
/** Modeled directly off the inbound delete handler so the parent owns the dialog. */
onDelete: (agent: AgentListItem) => void
/** Whether THIS agent is mid-delete; renders a spinner in place of the trash icon. */
interface AgentRowCardProps extends AgentRowCallbacks {
data: AgentRowData
/** Whether THIS agent is mid-delete; renders a spinner in the menu. */
deleting?: boolean
}
/**
* Composition shell for the agent rail. Owns no state; sub-components
* each handle their own micro-state (error-panel collapse, etc.) and
* emit callbacks (delete, pin/unpin) for the page to act on.
*
* The whole card carries state — not just the tile — so the row's
* border subtly tells the user what's going on at a glance:
* working → accent-orange border with a soft glow
* error → destructive border
* idle → muted border, lifts on hover
*/
export const AgentRowCard: FC<AgentRowCardProps> = ({
agent,
status = 'unknown',
lastUsedAt,
adapter,
reasoningEffort,
onDelete,
data,
deleting,
onDelete,
onPinToggle,
}) => {
const navigate = useNavigate()
const adapterId = adapter ?? inferAdapterFromListItem(agent)
const workspace = workspaceLabel(agent)
const lastUsedLabel = formatRelativeTime(lastUsedAt ?? null)
const allowDelete = canDeleteAgent(agent)
const allowRename = canRenameAgent(agent)
const handleChat = () => navigate(`/agents/${agent.agentId}`)
const handleCopyId = async () => {
try {
await navigator.clipboard.writeText(agent.agentId)
toast.success('Agent id copied')
} catch {
toast.error('Could not copy agent id')
}
}
return (
<div
className={cn(
'group rounded-xl border border-border bg-card p-4 shadow-sm transition-all',
'hover:border-[var(--accent-orange)]/50 hover:shadow-sm',
// Layout-stable hover. No translate, no shadow change — both
// visibly perturb neighbouring rows. Only the border tint
// shifts on hover, and the rail's vertical rhythm stays
// exactly the same in every state.
'group rounded-xl border bg-card p-4 shadow-sm transition-colors',
data.status === 'working'
? 'border-[var(--accent-orange)]/40'
: data.status === 'error'
? 'border-destructive/40'
: 'border-border hover:border-[var(--accent-orange)]/30',
)}
>
<div className="flex items-start gap-4">
{/* Adapter tile + liveness dot in the corner. */}
<div className="relative shrink-0">
<div className="flex h-12 w-12 items-center justify-center rounded-xl bg-muted text-muted-foreground">
<AdapterIcon adapter={adapterId} className="h-6 w-6" />
</div>
<LivenessDot
status={status}
detail={livenessDetail(status, lastUsedAt)}
className="absolute -right-0.5 -bottom-0.5"
/>
</div>
<AgentTile
adapter={data.adapter}
status={data.status}
lastUsedAt={data.lastUsedAt}
/>
<div className="min-w-0 flex-1">
<div className="mb-1 flex items-center gap-2">
<span className="truncate font-semibold">{displayName(agent)}</span>
{status === 'working' && (
<Badge
variant="secondary"
className="bg-amber-50 text-amber-900 hover:bg-amber-50"
>
Working
</Badge>
)}
{status === 'asleep' && (
<Badge variant="outline" className="text-muted-foreground">
Asleep
</Badge>
)}
{status === 'error' && (
<Badge variant="destructive">Attention</Badge>
)}
</div>
<AgentTitleRow
agent={data.agent}
status={data.status}
pinned={data.pinned}
turnsByDay={data.turnsByDay}
failedByDay={data.failedByDay}
onPinToggle={(next) => onPinToggle(data.agent, next)}
/>
<div className="mb-2 flex flex-wrap items-center gap-1.5 text-xs">
<Badge variant="secondary" className="font-normal">
{adapterLabel(adapterId)}
</Badge>
{agent.modelLabel && agent.modelLabel !== 'default' && (
<Badge variant="outline" className="font-normal">
{agent.modelLabel}
</Badge>
)}
{reasoningEffort && reasoningEffort !== 'medium' && (
<Badge variant="outline" className="font-normal">
{reasoningEffort}
</Badge>
)}
</div>
<AgentSummaryChips
adapter={data.adapter}
modelLabel={data.modelLabel}
reasoningEffort={data.reasoningEffort}
adapterHealth={data.adapterHealth}
/>
<div className="flex flex-wrap items-center gap-2 text-muted-foreground text-xs">
<span>Last used {lastUsedLabel}</span>
{workspace && (
<>
<span aria-hidden></span>
<span className="truncate font-mono" title={workspace}>
{workspace}
</span>
</>
)}
</div>
<AgentLastMessage message={data.lastUserMessage} />
<AgentMetaRow lastUsedAt={data.lastUsedAt} tokens={data.tokens} />
{data.status === 'error' && data.lastError && (
<AgentErrorPanel
agentId={data.agent.agentId}
message={data.lastError}
errorAt={data.lastErrorAt}
/>
)}
</div>
<div className="flex shrink-0 items-center gap-2">
<Button variant="outline" size="sm" onClick={handleChat}>
<MessageSquare className="mr-1.5 h-3 w-3" />
Chat
</Button>
<DropdownMenu>
<DropdownMenuTrigger asChild>
<Button
variant="ghost"
size="icon"
aria-label={`More actions for ${displayName(agent)}`}
className="h-8 w-8"
>
<MoreHorizontal className="h-4 w-4" />
</Button>
</DropdownMenuTrigger>
<DropdownMenuContent align="end" className="w-44">
<DropdownMenuItem onSelect={() => void handleCopyId()}>
<Copy className="mr-2 h-3.5 w-3.5" />
Copy id
</DropdownMenuItem>
<RenameMenuItem disabled={!allowRename} />
<ResetHistoryMenuItem />
<DropdownMenuSeparator />
<DropdownMenuItem
onSelect={() => onDelete(agent)}
disabled={!allowDelete || deleting}
className="text-destructive focus:text-destructive"
>
{deleting ? (
<Loader2 className="mr-2 h-3.5 w-3.5 animate-spin" />
) : (
<Trash2 className="mr-2 h-3.5 w-3.5" />
)}
Delete
</DropdownMenuItem>
</DropdownMenuContent>
</DropdownMenu>
</div>
<AgentActions
agent={data.agent}
activeTurnId={data.activeTurnId}
deleting={deleting}
onDelete={onDelete}
/>
</div>
</div>
)
}
const RenameMenuItem: FC<{ disabled: boolean }> = ({ disabled }) => {
const item = (
<DropdownMenuItem disabled className="text-muted-foreground">
<Pencil className="mr-2 h-3.5 w-3.5" />
Rename
</DropdownMenuItem>
)
if (!disabled) return item
// Disabled but with a hint so users know it's coming, not broken.
return (
<TooltipProvider delayDuration={300}>
<Tooltip>
<TooltipTrigger asChild>
<span className="block w-full">{item}</span>
</TooltipTrigger>
<TooltipContent side="left" className="text-xs">
Rename coming soon
</TooltipContent>
</Tooltip>
</TooltipProvider>
)
}
const ResetHistoryMenuItem: FC = () => {
const item = (
<DropdownMenuItem disabled className="text-muted-foreground">
<RotateCcw className="mr-2 h-3.5 w-3.5" />
Reset history
</DropdownMenuItem>
)
return (
<TooltipProvider delayDuration={300}>
<Tooltip>
<TooltipTrigger asChild>
<span className="block w-full">{item}</span>
</TooltipTrigger>
<TooltipContent side="left" className="text-xs">
Reset history coming soon
</TooltipContent>
</Tooltip>
</TooltipProvider>
)
}
function inferAdapterFromListItem(
agent: AgentListItem,
): HarnessAgentAdapter | 'unknown' {
const label = agent.runtimeLabel?.toLowerCase()
if (label?.includes('claude')) return 'claude'
if (label?.includes('codex')) return 'codex'
if (label?.includes('openclaw')) return 'openclaw'
return 'unknown'
}
function livenessDetail(
status: AgentLiveness,
lastUsedAt: number | null | undefined,
): string | undefined {
if (lastUsedAt == null) return undefined
const diffMin = Math.floor((Date.now() - lastUsedAt) / 60_000)
if (status === 'idle') return `Idle for ${Math.max(0, diffMin)} min`
if (status === 'asleep') {
if (diffMin < 60) return `Asleep — quiet for ${diffMin} min`
const hr = Math.floor(diffMin / 60)
return `Asleep — quiet for ${hr} hr`
}
if (status === 'working') return 'Working on a turn'
if (status === 'error') return 'Attention — last turn failed'
return undefined
}

View File

@@ -44,6 +44,7 @@ import {
useCreateHarnessAgent,
useDeleteHarnessAgent,
useHarnessAgents,
useUpdateHarnessAgent,
} from './useAgents'
import { useOpenClawAgents, useOpenClawMutations } from './useOpenClaw'
@@ -76,6 +77,7 @@ export const AgentsPage: FC = () => {
} = useOpenClawAgents(openClawAgentsEnabled)
const createHarnessAgent = useCreateHarnessAgent()
const deleteHarnessAgent = useDeleteHarnessAgent()
const updateHarnessAgent = useUpdateHarnessAgent()
const {
setupOpenClaw,
createAgent: createOpenClawAgent,
@@ -342,12 +344,24 @@ export const AgentsPage: FC = () => {
agents={agentListItems}
activity={agentActivity}
harnessAgentLookup={harnessAgentLookup}
adapters={adapters}
loading={agentsLoading}
deletingAgentKey={deletingAgent ? deletingAgentKey : null}
onCreateAgent={() => setCreateOpen(true)}
onDeleteAgent={(agent) => {
void handleDelete(agent)
}}
onPinToggle={(agent, next) => {
// Optimistic mutation; harness-only — gateway-original
// OpenClaw entries are gated server-side via the harness
// backfill, so we only fire when the row maps to a
// harness agent record.
if (!harnessAgentLookup.has(agent.agentId)) return
updateHarnessAgent.mutate({
agentId: agent.agentId,
patch: { pinned: next },
})
}}
/>
<SetupOpenClawDialog

View File

@@ -1,4 +1,5 @@
import type { AgentListItem } from './agents-page-types'
import type { AgentLiveness } from './LivenessDot'
/**
* Display rules for the redesigned agent rows. Pure helpers — no React,
@@ -82,3 +83,25 @@ export function formatRelativeTime(epochMs: number | null): string {
const d = Math.floor(diff / ONE_DAY)
return d === 1 ? '1 day ago' : `${d} days ago`
}
/**
* Tooltip-friendly description of a row's current liveness state.
* Returns `undefined` when the state has nothing extra to add (e.g.
* `unknown` with no timestamp).
*/
export function livenessDetail(
status: AgentLiveness,
lastUsedAt: number | null | undefined,
): string | undefined {
if (lastUsedAt == null) return undefined
const diffMin = Math.floor((Date.now() - lastUsedAt) / 60_000)
if (status === 'idle') return `Idle for ${Math.max(0, diffMin)} min`
if (status === 'asleep') {
if (diffMin < 60) return `Asleep — quiet for ${diffMin} min`
const hr = Math.floor(diffMin / 60)
return `Asleep — quiet for ${hr} hr`
}
if (status === 'working') return 'Working on a turn'
if (status === 'error') return 'Attention — last turn failed'
return undefined
}

View File

@@ -56,6 +56,43 @@ export interface HarnessAgent {
* agents. Drives the recency sort and the "Last used X min ago" copy.
*/
lastUsedAt?: number | null
/** Pinned agents float to the top of the list. Defaults to `false`. */
pinned?: boolean
/** First non-blank line of the most recent user message; null if none. */
lastUserMessage?: string | null
/** Working directory the agent runs in; null when no session record yet. */
cwd?: string | null
/** Cumulative + 7-day rolling token usage; null when no record. */
tokens?: {
last7d: { input: number; output: number; requestCount: number }
cumulative: { input: number; output: number }
} | null
turnsByDay?: number[]
failedByDay?: number[]
lastError?: string | null
lastErrorAt?: number | null
/** When non-null, an in-flight turn this row can be resumed from. */
activeTurnId?: string | null
/** Persistent FIFO queue of messages waiting for this agent. */
queue?: HarnessQueuedMessage[]
}
export interface HarnessQueuedMessageAttachment {
mediaType: string
data: string
}
export interface HarnessQueuedMessage {
id: string
createdAt: number
message: string
attachments?: ReadonlyArray<HarnessQueuedMessageAttachment>
}
export interface HarnessAdapterHealth {
healthy: boolean
reason?: string
checkedAt: number
}
export interface HarnessAdapterDescriptor {
@@ -66,6 +103,7 @@ export interface HarnessAdapterDescriptor {
modelControl: 'runtime-supported' | 'best-effort'
models: Array<{ id: string; label: string; recommended?: boolean }>
reasoningEfforts: Array<{ id: string; label: string; recommended?: boolean }>
health?: HarnessAdapterHealth
}
export interface CreateHarnessAgentInput {

View File

@@ -0,0 +1,160 @@
import {
Copy,
Loader2,
MessageSquare,
MoreHorizontal,
Pencil,
RotateCcw,
Trash2,
} from 'lucide-react'
import type { FC } from 'react'
import { useNavigate } from 'react-router'
import { toast } from 'sonner'
import { Button } from '@/components/ui/button'
import {
DropdownMenu,
DropdownMenuContent,
DropdownMenuItem,
DropdownMenuSeparator,
DropdownMenuTrigger,
} from '@/components/ui/dropdown-menu'
import {
Tooltip,
TooltipContent,
TooltipProvider,
TooltipTrigger,
} from '@/components/ui/tooltip'
import {
canDelete as canDeleteAgent,
canRename as canRenameAgent,
displayName,
} from '../agent-display.helpers'
import type { AgentListItem } from '../agents-page-types'
interface AgentActionsProps {
agent: AgentListItem
activeTurnId: string | null
deleting?: boolean
onDelete: (agent: AgentListItem) => void
}
/**
* Single primary CTA per row: `Resume` (filled, accent-orange, with a
* pulsing dot) when an active turn exists; otherwise `Chat` (outline).
* Both navigate to the same place — the chat hook auto-attaches via
* `/chat/active` when there's a live turn — but the row signals which
* action the user is actually taking.
*/
export const AgentActions: FC<AgentActionsProps> = ({
agent,
activeTurnId,
deleting,
onDelete,
}) => {
const navigate = useNavigate()
const allowDelete = canDeleteAgent(agent)
const allowRename = canRenameAgent(agent)
const handleChat = () => navigate(`/agents/${agent.agentId}`)
const handleCopyId = async () => {
try {
await navigator.clipboard.writeText(agent.agentId)
toast.success('Agent id copied')
} catch {
toast.error('Could not copy agent id')
}
}
return (
<div className="flex shrink-0 items-center gap-1.5">
{activeTurnId ? (
<Button
variant="default"
size="sm"
onClick={handleChat}
className="gap-2 bg-[var(--accent-orange)] text-white shadow-sm hover:bg-[var(--accent-orange)]/90"
>
<span className="relative flex size-2">
<span className="absolute inline-flex h-full w-full animate-ping rounded-full bg-white/70 opacity-75" />
<span className="relative inline-flex size-2 rounded-full bg-white" />
</span>
Resume
</Button>
) : (
<Button variant="outline" size="sm" onClick={handleChat}>
<MessageSquare className="mr-1.5 size-3" />
Chat
</Button>
)}
<DropdownMenu>
<DropdownMenuTrigger asChild>
<Button
variant="ghost"
size="icon"
aria-label={`More actions for ${displayName(agent)}`}
className="size-8 text-muted-foreground hover:text-foreground"
>
<MoreHorizontal className="size-4" />
</Button>
</DropdownMenuTrigger>
<DropdownMenuContent align="end" className="w-44">
<DropdownMenuItem onSelect={() => void handleCopyId()}>
<Copy className="mr-2 size-3.5" />
Copy id
</DropdownMenuItem>
<ComingSoonItem
icon={Pencil}
label="Rename"
disabled={!allowRename}
/>
<ComingSoonItem icon={RotateCcw} label="Reset history" disabled />
<DropdownMenuSeparator />
<DropdownMenuItem
onSelect={() => onDelete(agent)}
disabled={!allowDelete || deleting}
className="text-destructive focus:text-destructive"
>
{deleting ? (
<Loader2 className="mr-2 size-3.5 animate-spin" />
) : (
<Trash2 className="mr-2 size-3.5" />
)}
Delete
</DropdownMenuItem>
</DropdownMenuContent>
</DropdownMenu>
</div>
)
}
interface ComingSoonItemProps {
icon: typeof Pencil
label: string
disabled: boolean
}
const ComingSoonItem: FC<ComingSoonItemProps> = ({
icon: Icon,
label,
disabled,
}) => {
const item = (
<DropdownMenuItem disabled className="text-muted-foreground">
<Icon className="mr-2 size-3.5" />
{label}
</DropdownMenuItem>
)
if (!disabled) return item
return (
<TooltipProvider delayDuration={300}>
<Tooltip>
<TooltipTrigger asChild>
<span className="block w-full">{item}</span>
</TooltipTrigger>
<TooltipContent side="left" className="text-xs">
{label} coming soon
</TooltipContent>
</Tooltip>
</TooltipProvider>
)
}

View File

@@ -0,0 +1,96 @@
import { AlertTriangle, ChevronDown } from 'lucide-react'
import { type FC, useEffect, useState } from 'react'
import { Button } from '@/components/ui/button'
import {
Collapsible,
CollapsibleContent,
CollapsibleTrigger,
} from '@/components/ui/collapsible'
import {
HoverCard,
HoverCardContent,
HoverCardTrigger,
} from '@/components/ui/hover-card'
import { cn } from '@/lib/utils'
import { truncate } from './agent-row.helpers'
interface AgentErrorPanelProps {
agentId: string
message: string
errorAt: number | null
}
const STORAGE_PREFIX = 'agent-row:lastErrorSeenAt:'
const PREVIEW_CHARS = 200
export const AgentErrorPanel: FC<AgentErrorPanelProps> = ({
agentId,
message,
errorAt,
}) => {
const storageKey = `${STORAGE_PREFIX}${agentId}`
// Open if we've never seen this `errorAt` for this agent. Once the
// user collapses the panel (or refreshes after seeing it), we mark
// it seen so it doesn't re-pop on every poll.
const [open, setOpen] = useState<boolean>(() => {
if (typeof window === 'undefined' || !errorAt) return true
const seen = Number(window.localStorage.getItem(storageKey) ?? 0)
return !Number.isFinite(seen) || errorAt > seen
})
useEffect(() => {
if (!open && errorAt && typeof window !== 'undefined') {
window.localStorage.setItem(storageKey, String(errorAt))
}
}, [open, errorAt, storageKey])
const preview = truncate(message, PREVIEW_CHARS)
const truncated = preview.length < message.length
return (
<Collapsible open={open} onOpenChange={setOpen} className="mt-3">
<div className="flex items-center justify-between rounded-md border border-destructive/30 bg-destructive/5 px-3 py-2">
<div className="flex items-center gap-2 font-medium text-destructive text-xs">
<AlertTriangle className="size-3.5" />
Last error
</div>
<CollapsibleTrigger asChild>
<Button
variant="ghost"
size="sm"
className="h-6 px-2 text-muted-foreground"
>
<span className="text-xs">{open ? 'hide' : 'show'}</span>
<ChevronDown
className={cn(
'ml-1 size-3 transition-transform',
open && 'rotate-180',
)}
/>
</Button>
</CollapsibleTrigger>
</div>
<CollapsibleContent>
<div className="mt-1 rounded-md border-destructive/30 border-x border-b bg-destructive/5 px-3 pb-2 text-xs">
{truncated ? (
<HoverCard openDelay={300}>
<HoverCardTrigger asChild>
<span className="cursor-default font-mono text-foreground/80">
{preview}
</span>
</HoverCardTrigger>
<HoverCardContent
side="bottom"
className="max-w-md whitespace-pre-wrap font-mono text-xs"
>
{message}
</HoverCardContent>
</HoverCard>
) : (
<span className="font-mono text-foreground/80">{message}</span>
)}
</div>
</CollapsibleContent>
</Collapsible>
)
}

View File

@@ -0,0 +1,35 @@
import { Quote } from 'lucide-react'
import type { FC } from 'react'
import { firstNonBlankLine, truncate } from './agent-row.helpers'
interface AgentLastMessageProps {
message: string | null
}
const PREVIEW_CHARS = 110
/**
* Inline preview of the most recent user message. Renders as a quoted,
* italic line so the row reads like a conversation snippet rather than
* a label-and-value pair. No hover-card — opening the agent's chat is
* the canonical way to read the full message.
*/
export const AgentLastMessage: FC<AgentLastMessageProps> = ({ message }) => {
if (!message) {
return (
<p className="mt-1 text-muted-foreground/70 text-xs italic">
No messages yet start a chat
</p>
)
}
const preview = truncate(firstNonBlankLine(message), PREVIEW_CHARS)
return (
<p className="mt-1.5 flex items-start gap-1.5 text-foreground/85 text-sm italic leading-snug">
<Quote
className="mt-1 size-3 shrink-0 text-muted-foreground/60"
aria-hidden
/>
<span className="truncate">{preview}</span>
</p>
)
}

View File

@@ -0,0 +1,37 @@
import type { FC } from 'react'
import { formatRelativeTime } from '../agent-display.helpers'
import { AgentTokenSummary } from './AgentTokenSummary'
import type { AgentTokenUsage } from './agent-row.types'
interface AgentMetaRowProps {
lastUsedAt: number | null
tokens: AgentTokenUsage | null
}
/**
* Bottom-of-row meta line. Intentionally sparse — last activity time
* and lifetime tokens. CWD is no longer surfaced here because the path
* the server happens to be running from isn't actionable; if a future
* surface needs the cwd (chat panel, debug view) it reads from the
* listing payload directly.
*/
export const AgentMetaRow: FC<AgentMetaRowProps> = ({ lastUsedAt, tokens }) => {
const lastUsedLabel = formatRelativeTime(lastUsedAt)
const tokensTotal =
(tokens?.cumulative.input ?? 0) + (tokens?.cumulative.output ?? 0)
const showTokens = tokensTotal > 0
return (
<div className="mt-2 flex flex-wrap items-center gap-x-2 text-muted-foreground text-xs">
<span>{lastUsedLabel}</span>
{showTokens && (
<>
<span aria-hidden className="text-muted-foreground/50">
·
</span>
<AgentTokenSummary tokens={tokens} />
</>
)}
</div>
)
}

View File

@@ -0,0 +1,92 @@
import type { FC } from 'react'
import {
HoverCard,
HoverCardContent,
HoverCardTrigger,
} from '@/components/ui/hover-card'
import { cn } from '@/lib/utils'
import { formatLocalDate, ROW_BAR_COUNT } from './agent-row.helpers'
interface AgentSparklineProps {
/** 14 entries, oldest → newest. Today's bucket is the last index. */
turnsByDay: number[]
/** Same length, same order. Failed turns counted separately. */
failedByDay: number[]
className?: string
}
const MIN_BAR_HEIGHT_PX = 2
const MAX_BAR_HEIGHT_PX = 18
export const AgentSparkline: FC<AgentSparklineProps> = ({
turnsByDay,
failedByDay,
className,
}) => {
if (turnsByDay.length === 0 || turnsByDay.every((n) => n === 0)) return null
const max = Math.max(1, ...turnsByDay)
return (
<HoverCard openDelay={250}>
<HoverCardTrigger asChild>
<div
role="img"
aria-label={`Last ${ROW_BAR_COUNT} days of activity`}
className={cn('flex h-5 items-end gap-px', className)}
>
{turnsByDay.map((count, idx) => {
const ratio = count / max
const height = Math.max(
MIN_BAR_HEIGHT_PX,
Math.round(ratio * MAX_BAR_HEIGHT_PX),
)
const isToday = idx === ROW_BAR_COUNT - 1
const failed = failedByDay[idx] ?? 0
return (
<div
// biome-ignore lint/suspicious/noArrayIndexKey: fixed-length sparkline buckets keyed by day position
key={`bar-${idx}`}
className={cn(
'w-1.5 rounded-sm',
count === 0
? 'bg-muted-foreground/15'
: failed > 0
? 'bg-destructive/50'
: 'bg-[var(--accent-orange)]/50',
isToday && 'ring-1 ring-foreground/30',
)}
style={{ height }}
/>
)
})}
</div>
</HoverCardTrigger>
<HoverCardContent side="left" className="w-56 text-xs">
<div className="mb-2 font-medium text-sm">Last 14 days</div>
<ul className="space-y-0.5">
{turnsByDay.map((count, idx) => {
const failed = failedByDay[idx] ?? 0
const dayLabel = formatLocalDate(idx)
return (
<li
// biome-ignore lint/suspicious/noArrayIndexKey: fixed-length list keyed by day position
key={`day-${idx}`}
className="flex items-center justify-between text-muted-foreground"
>
<span>{dayLabel}</span>
<span>
{count}
{failed > 0 && (
<span className="ml-1 text-destructive">
({failed} failed)
</span>
)}
</span>
</li>
)
})}
</ul>
</HoverCardContent>
</HoverCard>
)
}

View File

@@ -0,0 +1,71 @@
import { TriangleAlert } from 'lucide-react'
import type { FC } from 'react'
import { Badge } from '@/components/ui/badge'
import {
HoverCard,
HoverCardContent,
HoverCardTrigger,
} from '@/components/ui/hover-card'
import { cn } from '@/lib/utils'
import { adapterLabel } from '../AdapterIcon'
import type { HarnessAgentAdapter } from '../agent-harness-types'
import type { AgentAdapterHealth } from './agent-row.types'
interface AgentSummaryChipsProps {
adapter: HarnessAgentAdapter | 'unknown'
modelLabel: string | null
reasoningEffort: string | null
/** When unhealthy, the adapter label dims and a warning chip appears. */
adapterHealth: AgentAdapterHealth | null
}
/**
* Adapter / model / reasoning summary line. Always rendered (so OpenClaw
* rows that fall back to defaults still expose what they're set up to do)
* and surfaces adapter-health *only when unhealthy* — keeping the calm
* default state silent and reserving visual noise for things the user
* needs to act on.
*/
export const AgentSummaryChips: FC<AgentSummaryChipsProps> = ({
adapter,
modelLabel,
reasoningEffort,
adapterHealth,
}) => {
const parts = [adapterLabel(adapter)]
if (modelLabel) parts.push(modelLabel)
if (reasoningEffort) parts.push(reasoningEffort)
const unhealthy = adapterHealth?.healthy === false
return (
<div
className={cn(
'flex items-center gap-1.5 text-muted-foreground text-xs',
unhealthy && 'text-muted-foreground/70',
)}
>
<span className="truncate">{parts.join(' · ')}</span>
{unhealthy && adapterHealth && (
<HoverCard openDelay={200}>
<HoverCardTrigger asChild>
<Badge
variant="outline"
className="h-5 cursor-default gap-1 border-amber-500/40 bg-amber-50 px-1.5 text-amber-900 hover:bg-amber-50"
>
<TriangleAlert className="size-2.5" />
<span className="font-normal">Unavailable</span>
</Badge>
</HoverCardTrigger>
<HoverCardContent side="right" className="w-72 text-sm">
<div className="font-medium">
{adapterLabel(adapter)} CLI not available
</div>
<div className="mt-1 text-muted-foreground text-xs">
{adapterHealth.reason ??
'Adapter binary missing on $PATH. Install it from the adapter docs to use this agent.'}
</div>
</HoverCardContent>
</HoverCard>
)}
</div>
)
}

View File

@@ -0,0 +1,37 @@
import type { FC } from 'react'
import { cn } from '@/lib/utils'
import { AdapterIcon } from '../AdapterIcon'
import { livenessDetail } from '../agent-display.helpers'
import type { HarnessAgentAdapter } from '../agent-harness-types'
import { type AgentLiveness, LivenessDot } from '../LivenessDot'
export interface AgentTileProps {
adapter: HarnessAgentAdapter | 'unknown'
status: AgentLiveness
lastUsedAt: number | null
}
/**
* Adapter glyph + a single liveness dot. Adapter health is no longer
* surfaced here — it lives as an inline pill inside `AgentSummaryChips`
* so the user isn't asked to disambiguate two dots on the same tile.
*/
export const AgentTile: FC<AgentTileProps> = ({
adapter,
status,
lastUsedAt,
}) => (
<div className="relative shrink-0">
<div className="flex h-12 w-12 items-center justify-center rounded-xl bg-muted text-muted-foreground">
<AdapterIcon adapter={adapter} className="h-6 w-6" />
</div>
<LivenessDot
status={status}
detail={livenessDetail(status, lastUsedAt)}
className={cn(
'absolute -right-0.5 -bottom-0.5',
status === 'working' && 'animate-pulse',
)}
/>
</div>
)

View File

@@ -0,0 +1,55 @@
import type { FC } from 'react'
import { Badge } from '@/components/ui/badge'
import { displayName } from '../agent-display.helpers'
import type { AgentListItem } from '../agents-page-types'
import type { AgentLiveness } from '../LivenessDot'
import { AgentSparkline } from './AgentSparkline'
import { PinToggle } from './PinToggle'
interface AgentTitleRowProps {
agent: AgentListItem
status: AgentLiveness
pinned: boolean
turnsByDay: number[]
failedByDay: number[]
onPinToggle: (next: boolean) => void
}
/**
* Title strip: name + status badge + (right-aligned) sparkline. The
* pin toggle sits trailing the title so the title always flushes left
* regardless of pin state — moving the star left of the title indents
* the row's first line off-axis from the model/preview/meta lines
* below it. When unpinned and not hovered, the toggle is removed from
* layout entirely so it reserves no space at all.
*/
export const AgentTitleRow: FC<AgentTitleRowProps> = ({
agent,
status,
pinned,
turnsByDay,
failedByDay,
onPinToggle,
}) => (
<div className="mb-1 flex items-center gap-2">
<span className="truncate font-semibold">{displayName(agent)}</span>
{status === 'working' && (
<Badge
variant="secondary"
className="bg-amber-50 text-amber-900 hover:bg-amber-50"
>
Working
</Badge>
)}
{status === 'asleep' && (
<Badge variant="outline" className="text-muted-foreground">
Asleep
</Badge>
)}
{status === 'error' && <Badge variant="destructive">Attention</Badge>}
<PinToggle pinned={pinned} onToggle={onPinToggle} />
<div className="ml-auto">
<AgentSparkline turnsByDay={turnsByDay} failedByDay={failedByDay} />
</div>
</div>
)

View File

@@ -0,0 +1,63 @@
import type { FC } from 'react'
import {
HoverCard,
HoverCardContent,
HoverCardTrigger,
} from '@/components/ui/hover-card'
import { Progress } from '@/components/ui/progress'
import { formatTokens } from './agent-row.helpers'
import type { AgentTokenUsage } from './agent-row.types'
interface AgentTokenSummaryProps {
tokens: AgentTokenUsage | null
}
/**
* Inline token total + a HoverCard breakdown. Surfaces lifetime tokens
* (the only window we can compute reliably from the session record).
* Per-window stats land in a follow-up once the activity ledger ships.
*/
export const AgentTokenSummary: FC<AgentTokenSummaryProps> = ({ tokens }) => {
if (!tokens) return null
const { input, output } = tokens.cumulative
const total = input + output
if (total === 0) return null
const inputPct = (input / total) * 100
return (
<HoverCard openDelay={200}>
<HoverCardTrigger asChild>
<span className="cursor-default text-muted-foreground tabular-nums transition-colors hover:text-foreground">
{formatTokens(total)} tokens
</span>
</HoverCardTrigger>
<HoverCardContent side="top" align="end" className="w-72 text-sm">
<div className="mb-3 flex items-center justify-between">
<span className="font-medium">Lifetime tokens</span>
<span className="text-muted-foreground text-xs tabular-nums">
{formatTokens(total)} total
</span>
</div>
<div className="space-y-2">
<div className="flex items-center justify-between text-xs">
<span className="text-muted-foreground">Input</span>
<span className="tabular-nums">{formatTokens(input)}</span>
</div>
<Progress value={inputPct} className="h-1.5" />
<div className="mt-2 flex items-center justify-between text-xs">
<span className="text-muted-foreground">Output</span>
<span className="tabular-nums">{formatTokens(output)}</span>
</div>
<Progress value={100 - inputPct} className="h-1.5" />
</div>
<p className="mt-3 border-t pt-2 text-muted-foreground text-xs leading-snug">
Cumulative across every turn this agent has run. Per-window stats
arrive in a future release.
</p>
</HoverCardContent>
</HoverCard>
)
}

View File

@@ -0,0 +1,60 @@
import { Star } from 'lucide-react'
import type { FC } from 'react'
import { Button } from '@/components/ui/button'
import {
Tooltip,
TooltipContent,
TooltipProvider,
TooltipTrigger,
} from '@/components/ui/tooltip'
import { cn } from '@/lib/utils'
interface PinToggleProps {
pinned: boolean
onToggle: (next: boolean) => void
}
/**
* Trailing star toggle. The button is *always rendered* — only its
* opacity changes between pinned/unpinned/hover states — so the title
* row's height is constant. Hiding the slot via `display: none` would
* collapse the row's vertical metrics on hover and shift every card
* below in the rail.
*
* Placement is trailing the title (after the status badge) so the
* title itself flushes left regardless of pin state — leading the
* row with the star would indent the title relative to the model /
* preview / meta lines beneath it.
*/
export const PinToggle: FC<PinToggleProps> = ({ pinned, onToggle }) => (
<TooltipProvider delayDuration={300}>
<Tooltip>
<TooltipTrigger asChild>
<Button
variant="ghost"
size="icon"
className={cn(
'size-6 text-muted-foreground transition-opacity hover:text-foreground',
pinned ? 'opacity-100' : 'opacity-0 group-hover:opacity-100',
)}
aria-pressed={pinned}
aria-label={pinned ? 'Unpin agent' : 'Pin agent'}
onClick={(event) => {
event.stopPropagation()
onToggle(!pinned)
}}
>
<Star
className={cn(
'size-3.5',
pinned && 'fill-amber-400 text-amber-500',
)}
/>
</Button>
</TooltipTrigger>
<TooltipContent side="top" className="text-xs">
{pinned ? 'Unpin' : 'Pin to top'}
</TooltipContent>
</Tooltip>
</TooltipProvider>
)

View File

@@ -0,0 +1,73 @@
import { describe, expect, it } from 'bun:test'
import {
firstNonBlankLine,
formatLocalDate,
formatTokens,
ROW_BAR_COUNT,
truncate,
} from './agent-row.helpers'
describe('formatTokens', () => {
it('renders zero / NaN as "0"', () => {
expect(formatTokens(0)).toBe('0')
expect(formatTokens(Number.NaN)).toBe('0')
})
it('renders sub-1K as integer', () => {
expect(formatTokens(142)).toBe('142')
})
it('renders K with one decimal under 10', () => {
expect(formatTokens(8_400)).toBe('8.4K')
})
it('drops the decimal at >=10K', () => {
expect(formatTokens(120_000)).toBe('120K')
})
it('renders M with one decimal under 10', () => {
expect(formatTokens(1_200_000)).toBe('1.2M')
})
})
describe('firstNonBlankLine', () => {
it('returns the first non-blank line', () => {
expect(firstNonBlankLine('\n\nhello\nworld')).toBe('hello')
})
it('skips USER_QUERY envelope tags', () => {
expect(firstNonBlankLine('<USER_QUERY>\nfix tests\n</USER_QUERY>')).toBe(
'fix tests',
)
})
it('falls back to the trimmed input when nothing matches', () => {
expect(firstNonBlankLine(' single ')).toBe('single')
})
})
describe('truncate', () => {
it('returns input unchanged when within limit', () => {
expect(truncate('hello', 10)).toBe('hello')
})
it('appends an ellipsis when over limit', () => {
expect(truncate('hello world', 6)).toBe('hello…')
})
})
describe('formatLocalDate', () => {
const today = new Date('2026-04-30T12:00:00Z')
it('labels today and yesterday explicitly', () => {
expect(formatLocalDate(ROW_BAR_COUNT - 1, today)).toBe('today')
expect(formatLocalDate(ROW_BAR_COUNT - 2, today)).toBe('yesterday')
})
it('returns a "Mon D" format for older days', () => {
const label = formatLocalDate(0, today)
// "Apr 17" or "Apr 17," depending on locale; just assert it
// contains a month abbreviation and a day number.
expect(label).toMatch(/[A-Za-z]+ \d+/)
})
})

View File

@@ -0,0 +1,64 @@
/**
* Pure formatters consumed by row sub-components. Kept distinct from
* `agent-display.helpers.ts` (page-level helpers) so the row internals
* have an obvious single home.
*/
const TOKEN_THRESHOLDS: Array<[number, string]> = [
[1_000_000, 'M'],
[1_000, 'K'],
]
/** `1.2M`, `820K`, `8.4K`, `142`, `0`. */
export function formatTokens(n: number): string {
if (!Number.isFinite(n) || n <= 0) return '0'
for (const [threshold, suffix] of TOKEN_THRESHOLDS) {
if (n >= threshold) {
const value = n / threshold
const decimal = value < 10 ? value.toFixed(1) : value.toFixed(0)
return `${decimal}${suffix}`
}
}
return String(Math.round(n))
}
const USER_QUERY_OPEN = /^<USER_QUERY>$/i
const USER_QUERY_CLOSE = /^<\/USER_QUERY>$/i
/**
* First non-blank line, with the BrowserOS user-system-prompt
* `<USER_QUERY>` envelope tags stripped so previews don't show
* structural noise.
*/
export function firstNonBlankLine(text: string): string {
const lines = text.split('\n').map((line) => line.trim())
for (const line of lines) {
if (!line) continue
if (USER_QUERY_OPEN.test(line) || USER_QUERY_CLOSE.test(line)) continue
return line
}
return text.trim()
}
export function truncate(text: string, max: number): string {
if (text.length <= max) return text
return `${text.slice(0, max - 1).trimEnd()}`
}
const SPARKLINE_DAYS = 14
/**
* "today" / "yesterday" / "Apr 17" — given an index 0..13 from
* oldest → newest. `today` defaults to `new Date()` so callers don't
* have to thread a clock through.
*/
export function formatLocalDate(idx: number, today: Date = new Date()): string {
if (idx === SPARKLINE_DAYS - 1) return 'today'
if (idx === SPARKLINE_DAYS - 2) return 'yesterday'
const offset = SPARKLINE_DAYS - 1 - idx
const date = new Date(today)
date.setDate(date.getDate() - offset)
return date.toLocaleDateString(undefined, { month: 'short', day: 'numeric' })
}
export const ROW_BAR_COUNT = SPARKLINE_DAYS

View File

@@ -0,0 +1,51 @@
import type { HarnessAgentAdapter } from '../agent-harness-types'
import type { AgentListItem } from '../agents-page-types'
import type { AgentLiveness } from '../LivenessDot'
/**
* Window-bounded token usage. Server returns `null` when no session
* record exists yet for the agent.
*/
export interface AgentTokenUsage {
last7d: { input: number; output: number; requestCount: number }
cumulative: { input: number; output: number }
}
export interface AgentAdapterHealth {
healthy: boolean
reason?: string
}
/**
* Everything an `AgentRowCard` needs to render. Mirrors the shape
* `useHarnessAgents` exposes; the page assembles one entry per row in
* `AgentList` and passes it down. Sub-components only see slices of
* this object — no prop drilling beyond two levels.
*/
export interface AgentRowData {
agent: AgentListItem
adapter: HarnessAgentAdapter | 'unknown'
modelLabel: string | null
reasoningEffort: string | null
status: AgentLiveness
lastUsedAt: number | null
pinned: boolean
cwd: string | null
lastUserMessage: string | null
tokens: AgentTokenUsage | null
/** 14 entries, oldest → newest. Today is the last index. */
turnsByDay: number[]
/** Same length and ordering as `turnsByDay`. */
failedByDay: number[]
lastError: string | null
lastErrorAt: number | null
/** When non-null, an in-flight turn this row can be resumed from. */
activeTurnId: string | null
/** Adapter-level health, shared across rows for the same adapter. */
adapterHealth: AgentAdapterHealth | null
}
export interface AgentRowCallbacks {
onDelete: (agent: AgentListItem) => void
onPinToggle: (agent: AgentListItem, next: boolean) => void
}

View File

@@ -8,6 +8,7 @@ import {
type HarnessAdapterDescriptor,
type HarnessAgent,
type HarnessAgentHistoryPage,
type HarnessQueuedMessage,
mapHarnessAgentToEntry,
} from './agent-harness-types'
import type { OpenClawStatus } from './useOpenClaw'
@@ -135,6 +136,63 @@ export function useCreateHarnessAgent() {
})
}
/**
* Apply a partial update to a harness agent. Used by the pin-toggle
* star and (eventually) the inline rename UI. Optimistically writes
* the patch into the listing query cache so the row updates instantly,
* then rolls back if the server rejects the change.
*/
export function useUpdateHarnessAgent() {
const { baseUrl, isLoading: urlLoading } = useAgentServerUrl()
const queryClient = useQueryClient()
return useMutation({
mutationFn: async (input: {
agentId: string
patch: { name?: string; pinned?: boolean }
}) => {
if (!baseUrl || urlLoading) {
throw new Error('BrowserOS agent server URL is not ready')
}
const data = await agentsFetch<{ agent: HarnessAgent }>(
baseUrl,
`/${encodeURIComponent(input.agentId)}`,
{
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(input.patch),
},
)
return data.agent
},
onMutate: async ({ agentId, patch }) => {
const queryKey = [AGENT_QUERY_KEYS.agents, baseUrl]
await queryClient.cancelQueries({ queryKey })
const previous = queryClient.getQueryData<HarnessAgentsResponse>(queryKey)
if (!previous) return { previous: undefined }
queryClient.setQueryData<HarnessAgentsResponse>(queryKey, {
...previous,
agents: previous.agents.map((agent) =>
agent.id === agentId ? { ...agent, ...patch } : agent,
),
})
return { previous }
},
onError: (_err, _vars, context) => {
if (!context?.previous) return
queryClient.setQueryData(
[AGENT_QUERY_KEYS.agents, baseUrl],
context.previous,
)
},
onSettled: async () => {
await queryClient.invalidateQueries({
queryKey: [AGENT_QUERY_KEYS.agents],
})
},
})
}
export function useDeleteHarnessAgent() {
const { baseUrl, isLoading: urlLoading } = useAgentServerUrl()
const queryClient = useQueryClient()
@@ -206,6 +264,8 @@ export interface HarnessActiveTurnInfo {
lastSeq: number
startedAt: number
endedAt?: number
/** User message that kicked off the turn; null when not captured. */
prompt: string | null
}
/**
@@ -260,3 +320,145 @@ export async function fetchHarnessAgentHistory(
`/${encodeURIComponent(agentId)}/sessions/main/history`,
)
}
export interface EnqueueMessageInput {
message: string
attachments?: ReadonlyArray<unknown>
}
export async function enqueueHarnessMessage(
agentId: string,
input: EnqueueMessageInput,
): Promise<HarnessQueuedMessage> {
const baseUrl = await getAgentServerUrl()
const response = await fetch(
`${baseUrl}/agents/${encodeURIComponent(agentId)}/queue`,
{
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
message: input.message,
...(input.attachments && input.attachments.length > 0
? { attachments: input.attachments }
: {}),
}),
},
)
if (!response.ok) {
let message = `Request failed with status ${response.status}`
try {
const body = (await response.json()) as { error?: string }
if (body.error) message = body.error
} catch {}
throw new Error(message)
}
const body = (await response.json()) as { queued: HarnessQueuedMessage }
return body.queued
}
export async function removeHarnessQueuedMessage(
agentId: string,
messageId: string,
): Promise<{ removed: boolean }> {
const baseUrl = await getAgentServerUrl()
const response = await fetch(
`${baseUrl}/agents/${encodeURIComponent(agentId)}/queue/${encodeURIComponent(
messageId,
)}`,
{ method: 'DELETE' },
)
if (!response.ok) return { removed: false }
return (await response.json()) as { removed: boolean }
}
/**
* Optimistic enqueue: writes the new queued message into the listing
* cache immediately so the queue panel reflects the change without
* waiting for the next poll. Rolls back if the server rejects.
*/
export function useEnqueueHarnessMessage() {
const { baseUrl } = useAgentServerUrl()
const queryClient = useQueryClient()
return useMutation({
mutationFn: async (input: { agentId: string } & EnqueueMessageInput) =>
enqueueHarnessMessage(input.agentId, input),
onMutate: async (input) => {
const queryKey = [AGENT_QUERY_KEYS.agents, baseUrl]
await queryClient.cancelQueries({ queryKey })
const previous = queryClient.getQueryData<HarnessAgentsResponse>(queryKey)
if (!previous) return { previous: undefined }
const optimistic: HarnessQueuedMessage = {
id: `optimistic-${Math.random().toString(36).slice(2, 10)}`,
createdAt: Date.now(),
message: input.message,
}
queryClient.setQueryData<HarnessAgentsResponse>(queryKey, {
...previous,
agents: previous.agents.map((agent) =>
agent.id === input.agentId
? { ...agent, queue: [...(agent.queue ?? []), optimistic] }
: agent,
),
})
return { previous }
},
onError: (_err, _vars, context) => {
if (!context?.previous) return
queryClient.setQueryData(
[AGENT_QUERY_KEYS.agents, baseUrl],
context.previous,
)
},
onSettled: async () => {
await queryClient.invalidateQueries({
queryKey: [AGENT_QUERY_KEYS.agents],
})
},
})
}
/**
* Optimistic queue removal mirror of `useEnqueueHarnessMessage`.
*/
export function useRemoveHarnessQueuedMessage() {
const { baseUrl } = useAgentServerUrl()
const queryClient = useQueryClient()
return useMutation({
mutationFn: async (input: { agentId: string; messageId: string }) =>
removeHarnessQueuedMessage(input.agentId, input.messageId),
onMutate: async (input) => {
const queryKey = [AGENT_QUERY_KEYS.agents, baseUrl]
await queryClient.cancelQueries({ queryKey })
const previous = queryClient.getQueryData<HarnessAgentsResponse>(queryKey)
if (!previous) return { previous: undefined }
queryClient.setQueryData<HarnessAgentsResponse>(queryKey, {
...previous,
agents: previous.agents.map((agent) =>
agent.id === input.agentId
? {
...agent,
queue: (agent.queue ?? []).filter(
(entry) => entry.id !== input.messageId,
),
}
: agent,
),
})
return { previous }
},
onError: (_err, _vars, context) => {
if (!context?.previous) return
queryClient.setQueryData(
[AGENT_QUERY_KEYS.agents, baseUrl],
context.previous,
)
},
onSettled: async () => {
await queryClient.invalidateQueries({
queryKey: [AGENT_QUERY_KEYS.agents],
})
},
})
}

View File

@@ -59,15 +59,3 @@ export interface AgentConversation {
createdAt: number
updatedAt: number
}
export interface AgentCardData {
agentId: string
name: string
model?: string
status: 'idle' | 'working' | 'error'
lastMessage?: string
lastMessageTimestamp?: number
activitySummary?: string
currentTool?: string
costUsd?: number
}

View File

@@ -9,6 +9,7 @@
"build": "bun run codegen && wxt build",
"build:dev": "bun --env-file=.env.development wxt build --mode development",
"zip": "wxt zip",
"test": "bun run ../../scripts/run-bun-test.ts ./apps/agent",
"compile": "bun --env-file=.env.development wxt prepare && tsgo --noEmit",
"lint": "bunx biome check",
"typecheck": "bun --env-file=.env.development wxt prepare && tsgo --noEmit",

View File

@@ -0,0 +1,51 @@
# Copy to .env.development for local eval runs.
# Provider keys used by existing config files.
OPENROUTER_API_KEY=
FIREWORKS_API_KEY=
ANTHROPIC_API_KEY=
OPENAI_API_KEY=
GOOGLE_GENERATIVE_AI_API_KEY=
# Claude Agent SDK token used by performance_grader.
CLAUDE_CODE_OAUTH_TOKEN=
# Suite-mode model selection.
EVAL_VARIANT=local
EVAL_AGENT_PROVIDER=openai-compatible
EVAL_AGENT_MODEL=
EVAL_AGENT_API_KEY=
EVAL_AGENT_BASE_URL=
EVAL_AGENT_SUPPORTS_IMAGES=true
# Optional suite-mode executor override for orchestrator suites.
EVAL_EXECUTOR_MODEL=
EVAL_EXECUTOR_API_KEY=
EVAL_EXECUTOR_BASE_URL=
# Clado visual action executor.
CLADO_ACTION_MODEL=
CLADO_ACTION_API_KEY=
CLADO_ACTION_BASE_URL=
# Backward-compatible alias used by older local scripts.
CLADO_ACTION_URL=
# BrowserOS runner.
BROWSEROS_BINARY=/Applications/BrowserOS.app/Contents/MacOS/BrowserOS
BROWSEROS_SERVER_URL=http://127.0.0.1:9110
BROWSEROS_SERVER_LOG_DIR=/tmp/browseros-server-logs
BROWSEROS_CONFIG_URL=
# Captcha solver extension.
NOPECHA_API_KEY=
# WebArena-Infinity.
WEBARENA_INFINITY_DIR=
INFINITY_APP_URL=
# R2 publishing and weekly report.
EVAL_R2_ACCOUNT_ID=
EVAL_R2_ACCESS_KEY_ID=
EVAL_R2_SECRET_ACCESS_KEY=
EVAL_R2_BUCKET=browseros-eval
EVAL_R2_CDN_BASE_URL=https://eval.browseros.com

View File

@@ -14,6 +14,7 @@ Evaluation framework for BrowserOS browser automation agents. Runs tasks from st
```bash
cd apps/eval
cp .env.example .env.development
# Edit .env.development with your keys, then:
bun run eval
```
@@ -23,11 +24,55 @@ Opens the eval dashboard at `http://localhost:9900` in config mode. From there:
### CLI mode
```bash
bun run eval -c configs/browseros-agent-weekly.json
bun run eval -c configs/legacy/browseros-agent-weekly.json
bun run eval suite --config configs/legacy/browseros-agent-weekly.json --publish r2
```
Runs immediately. Dashboard still available at `http://localhost:9900` for live progress.
The `suite` command is the workflow-compatible full loop: execute tasks, run graders, write artifacts, and optionally publish to R2. The old `-c` form remains supported during migration.
```bash
bun run eval run --config configs/legacy/browseros-agent-weekly.json
bun run eval suite --suite configs/suites/agisdk-daily-10.json --variant kimi-fireworks --publish r2
bun run eval grade --run results/browseros-agent-weekly/2026-04-29-1430
bun run eval publish --run results/browseros-agent-weekly/2026-04-29-1430 --target r2
```
Config files live in two groups:
```txt
configs/legacy/ # Complete EvalConfig files used by older workflows and the dashboard
configs/suites/ # Suite definitions; model/provider comes from CLI flags or env
```
Suite mode takes model settings from CLI flags first, then env:
```bash
EVAL_VARIANT=kimi-fireworks \
EVAL_AGENT_PROVIDER=openai-compatible \
EVAL_AGENT_MODEL=accounts/fireworks/models/kimi-k2p5 \
EVAL_AGENT_API_KEY=$FIREWORKS_API_KEY \
EVAL_AGENT_BASE_URL=https://api.fireworks.ai/inference/v1 \
bun run eval suite --suite configs/suites/agisdk-daily-10.json --publish r2
```
### Suites and variants
A **suite** is what we run: the task dataset, graders, worker count, timeout, and browser settings. For example, `agisdk-daily-10` means "run these 10 AGI SDK tasks and grade them with `agisdk_state_diff`."
A **variant** is the model setup we are testing on that suite. `EVAL_VARIANT` is just the human-readable name for that setup. The actual model connection still comes from `EVAL_AGENT_PROVIDER`, `EVAL_AGENT_MODEL`, `EVAL_AGENT_API_KEY`, and `EVAL_AGENT_BASE_URL`.
This lets us run the same suite against multiple model setups without copying the benchmark config:
```txt
agisdk-daily-10 + kimi-fireworks
agisdk-daily-10 + claude-sonnet
agisdk-daily-10 + clado-action-000159
```
For `orchestrator-executor` suites, there can also be an executor model/backend. The `EVAL_AGENT_*` vars describe the main agent or orchestrator. The optional `EVAL_EXECUTOR_*` or `CLADO_ACTION_*` vars describe the delegated executor.
## Agent types
| Type | Description |
@@ -66,9 +111,9 @@ The orchestrator works with any LLM provider. The executor can be another LLM, o
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"model": "Qwen3.5-35B-A3B-action-000159-merged",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
"baseUrl": "https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run"
}
}
}
@@ -96,6 +141,20 @@ The `apiKey` field supports two formats:
- **Env var name**: `"OPENAI_API_KEY"` — resolved from `.env.development` at runtime
- **Direct value**: `"sk-xxxxx"` — used as-is (not recommended)
### Environment variables
| Variable | Used for |
|----------|----------|
| `EVAL_AGENT_PROVIDER`, `EVAL_AGENT_MODEL`, `EVAL_AGENT_API_KEY`, `EVAL_AGENT_BASE_URL`, `EVAL_AGENT_SUPPORTS_IMAGES` | Suite variant model selection |
| `FIREWORKS_API_KEY`, `OPENROUTER_API_KEY`, `ANTHROPIC_API_KEY`, provider-specific keys | Config-file or provider-backed model calls |
| `EVAL_EXECUTOR_MODEL`, `EVAL_EXECUTOR_API_KEY`, `EVAL_EXECUTOR_BASE_URL` | Suite-mode orchestrator executor override |
| `CLADO_ACTION_MODEL`, `CLADO_ACTION_API_KEY`, `CLADO_ACTION_BASE_URL` | Clado executor defaults |
| `BROWSEROS_BINARY` | BrowserOS binary path in CI/local smoke runs |
| `BROWSEROS_SERVER_URL` | Optional grader MCP URL override |
| `WEBARENA_INFINITY_DIR` | Local WebArena-Infinity checkout for Infinity tasks |
| `NOPECHA_API_KEY` | CAPTCHA solver extension |
| `EVAL_R2_ACCOUNT_ID`, `EVAL_R2_ACCESS_KEY_ID`, `EVAL_R2_SECRET_ACCESS_KEY`, `EVAL_R2_BUCKET`, `EVAL_R2_CDN_BASE_URL` | R2 upload and viewer URL |
### Supported providers
| Provider | `provider` value | Requires `baseUrl` |
@@ -110,6 +169,22 @@ The `apiKey` field supports two formats:
| Ollama | `ollama` | No |
| Clado Action (executor only) | `clado-action` | Yes |
### R2 publishing
`suite --config ... --publish r2` and `publish --target r2` upload the run artifacts plus `viewer.html` to the viewer-compatible R2 layout:
```bash
export EVAL_R2_ACCOUNT_ID=...
export EVAL_R2_ACCESS_KEY_ID=...
export EVAL_R2_SECRET_ACCESS_KEY=...
export EVAL_R2_BUCKET=browseros-eval
export EVAL_R2_CDN_BASE_URL=https://eval.browseros.com
```
`EVAL_R2_CDN_BASE_URL` must be a public R2 custom domain, `r2.dev` URL, or Worker URL. Do not set it to the private `*.r2.cloudflarestorage.com` S3 API endpoint.
Published runs are available at `EVAL_R2_CDN_BASE_URL/viewer.html?run=<run-id>`.
### BrowserOS infrastructure
```json
@@ -137,10 +212,12 @@ Each worker gets its own Chrome instance. Worker N uses `base_port + N` for CDP
| File | Tasks | Description |
|------|-------|-------------|
| `agisdk-daily-10.jsonl` | 10 | Daily AGI SDK / REAL Bench subset |
| `webvoyager.jsonl` | 643 | Full WebVoyager benchmark |
| `mind2web.jsonl` | 300 | Online-Mind2Web |
| `webbench-{0,1,2}of4-50.jsonl` | 50 each | WebBench shards (50-task subsets) |
| `agisdk-real.jsonl` | 40 | AGI SDK / REAL Bench (action-only tasks) |
| `agisdk-real-smoke.jsonl` | 1 | AGI SDK / REAL Bench smoke task |
| `agisdk-real.jsonl` | 36 | AGI SDK / REAL Bench (action-only tasks) |
| `webarena-infinity-hard-50.jsonl` | 50 | WebArena-Infinity hard set |
| `browsecomp-medium-hard-50.jsonl` | 50 | BrowseComp medium-hard |
| `browsecomp-very-hard-50.jsonl` | 50 | BrowseComp very-hard |
@@ -167,14 +244,47 @@ results/
browseros-agent-weekly/
2026-04-29-1430/
Amazon--0/
attempt.json # Stable attempt summary for viewer/reporting
metadata.json # Task result, timing, grader scores
grades.json # Compact grader results
messages.jsonl # Full message log
grader-artifacts/ # Grader-specific inputs/outputs/stderr
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
```
R2 publishing preserves the task files under `runs/<run-id>/...`, writes `runs/<run-id>/manifest.json`, and uploads `viewer.html` at the bucket root. The viewer URL is `EVAL_R2_CDN_BASE_URL/viewer.html?run=<run-id>`.
### R2 viewer manifest
`runs/<run-id>/manifest.json` is the source of truth for the public viewer. New manifests include `schemaVersion: 2` and each task includes explicit artifact paths:
```json
{
"schemaVersion": 2,
"runId": "agisdk-real-smoke-2026-04-30-0000",
"tasks": [
{
"queryId": "agisdk-dashdish-10",
"paths": {
"metadata": "tasks/agisdk-dashdish-10/metadata.json",
"messages": "tasks/agisdk-dashdish-10/messages.jsonl",
"grades": "tasks/agisdk-dashdish-10/grades.json",
"trace": "tasks/agisdk-dashdish-10/trace.jsonl",
"screenshots": "tasks/agisdk-dashdish-10/screenshots",
"graderArtifacts": "tasks/agisdk-dashdish-10/grader-artifacts"
}
}
]
}
```
The static viewer uses `task.paths` when present. Older uploaded runs without `schemaVersion` or `task.paths` still work through the legacy inferred layout: `runs/<run-id>/<task-id>/metadata.json`, `messages.jsonl`, and `screenshots/<n>.png`.
Manifest paths are stable artifact locations, not a guarantee that every optional artifact exists for every task. For example, `attempt.json`, `trace.jsonl`, or grader artifact directories may be absent when that artifact was not produced by the run.
## Troubleshooting
**BrowserOS not found**: Expects `/Applications/BrowserOS.app/Contents/MacOS/BrowserOS`. Set `BROWSEROS_BINARY` to override.

View File

@@ -7,8 +7,8 @@
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
},
"dataset": "../data/agisdk-real.jsonl",
"num_workers": 10,
"dataset": "../../data/agisdk-real-smoke.jsonl",
"num_workers": 1,
"restart_server_per_task": true,
"browseros": {
"server_url": "http://127.0.0.1:9110",

View File

@@ -0,0 +1,26 @@
{
"agent": {
"type": "single",
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1",
"supportsImages": true
},
"dataset": "../../data/agisdk-real.jsonl",
"num_workers": 4,
"restart_server_per_task": true,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["agisdk_state_diff"],
"timeout_ms": 1800000
}

View File

@@ -7,7 +7,7 @@
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
},
"dataset": "../data/webbench-2of4-50.jsonl",
"dataset": "../../data/webbench-2of4-50.jsonl",
"num_workers": 10,
"restart_server_per_task": true,
"browseros": {

View File

@@ -14,7 +14,7 @@
"baseUrl": "https://api.fireworks.ai/inference/v1"
}
},
"dataset": "../data/webbench-2of4-50.jsonl",
"dataset": "../../data/webbench-2of4-50.jsonl",
"num_workers": 10,
"restart_server_per_task": true,
"browseros": {

View File

@@ -9,12 +9,12 @@
},
"executor": {
"provider": "clado-action",
"model": "qwen3-vl-30b-a3b-instruct",
"model": "Qwen3.5-35B-A3B-action-000159-merged",
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
"baseUrl": "https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run"
}
},
"dataset": "../data/webbench-2of4-50.jsonl",
"dataset": "../../data/agisdk-real.jsonl",
"num_workers": 10,
"restart_server_per_task": true,
"browseros": {
@@ -23,11 +23,11 @@
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
"headless": true
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"graders": ["agisdk_state_diff"],
"timeout_ms": 1800000
}

View File

@@ -7,7 +7,7 @@
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
},
"dataset": "../data/webarena-infinity-hard-50.jsonl",
"dataset": "../../data/webarena-infinity-hard-50.jsonl",
"num_workers": 10,
"restart_server_per_task": true,
"browseros": {

View File

@@ -5,7 +5,7 @@
"model": "openai/gpt-4.1",
"apiKey": "OPENROUTER_API_KEY"
},
"dataset": "../data/mind2web.jsonl",
"dataset": "../../data/mind2web.jsonl",
"num_workers": 5,
"restart_server_per_task": true,
"browseros": {

View File

@@ -7,7 +7,7 @@
"baseUrl": "https://api.fireworks.ai/inference/v1",
"supportsImages": true
},
"dataset": "../data/webvoyager.jsonl",
"dataset": "../../data/webvoyager.jsonl",
"num_workers": 3,
"restart_server_per_task": true,
"browseros": {

View File

@@ -0,0 +1,22 @@
{
"id": "agisdk-daily-10",
"dataset": "../../data/agisdk-daily-10.jsonl",
"agent": {
"type": "single"
},
"graders": ["agisdk_state_diff"],
"workers": 1,
"restartBrowserPerTask": true,
"timeoutMs": 1800000,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": true
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
}
}

View File

@@ -0,0 +1,22 @@
{
"id": "agisdk-real-smoke",
"dataset": "../../data/agisdk-real-smoke.jsonl",
"agent": {
"type": "single"
},
"graders": ["agisdk_state_diff"],
"workers": 1,
"restartBrowserPerTask": true,
"timeoutMs": 1800000,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
}
}

View File

@@ -0,0 +1,22 @@
{
"id": "agisdk-real",
"dataset": "../../data/agisdk-real.jsonl",
"agent": {
"type": "single"
},
"graders": ["agisdk_state_diff"],
"workers": 1,
"restartBrowserPerTask": true,
"timeoutMs": 1800000,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
}
}

View File

@@ -0,0 +1,10 @@
{"query_id": "agisdk-dashdish-10", "dataset": "agisdk-real", "query": "Place an order from \"Souvla\" for a \"Medium Classic Cheeseburger\" and a \"Small Bacon Double Cheeseburger\" with \"Standard Delivery\" as the method with the default charged options.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-dashdish.vercel.app", "metadata": {"original_task_id": "dashdish-10", "website": "DashDish", "category": "agisdk-real", "additional": {"agisdk_task_id": "dashdish-10", "challenge_type": "action", "difficulty": "hard", "similar_to": "Doordash"}}}
{"query_id": "agisdk-fly-unified-5", "dataset": "agisdk-real", "query": "Find me the cheapest fare for a flight from Orlando to Milwaukee on December 5th, 2024 and book it.\nPassenger: John Doe\nDate of Birth: 01/01/1990\nSex: Male\nSeat Selection: No\nPayment: Credit Card (378342143523967), Exp: 12/30, Security Code: 420 Address: 123 Main St, San Francisco, CA, 94105, USA, Phone: 555-123-4567, Email: johndoe@example.com.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-fly-unified.vercel.app", "metadata": {"original_task_id": "fly-unified-5", "website": "Fly Unified", "category": "agisdk-real", "additional": {"agisdk_task_id": "fly-unified-5", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "United Airlines"}}}
{"query_id": "agisdk-udriver-10", "dataset": "agisdk-real", "query": "Order me a ride for 4pm, I'll be at the de Young muesum headed to the Waterbar, fanciest option possible please.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-10", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-10", "challenge_type": "action", "difficulty": "hard", "similar_to": "Uber"}}}
{"query_id": "agisdk-udriver-9", "dataset": "agisdk-real", "query": "Book me a ride from the thai restaurant I last took a ride to for later today at 2pm, I'll be at 333 Apartments on Fremont", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-9", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-9", "challenge_type": "retrieval-action", "difficulty": "hard", "similar_to": "Uber"}}}
{"query_id": "agisdk-topwork-4", "dataset": "agisdk-real", "query": "Create a job post for a UI/UX Designer with expertise in Figma, Sketch, and Adobe Creative Suite, including project details, timeline, and required skills (Wireframing, Prototyping, Responsive Design).", "graders": ["agisdk_state_diff"], "start_url": "https://evals-topwork.vercel.app", "metadata": {"original_task_id": "topwork-4", "website": "TopWork", "category": "agisdk-real", "additional": {"agisdk_task_id": "topwork-4", "challenge_type": "action", "difficulty": "medium", "similar_to": "Upwork"}}}
{"query_id": "agisdk-gocalendar-4", "dataset": "agisdk-real", "query": "Change the \"Team Check-In\" event on July 18, 2024, name to \"Project Kickoff\" and update the location to \"Zoom\"", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gocalendar.vercel.app", "metadata": {"original_task_id": "gocalendar-4", "website": "GoCalendar", "category": "agisdk-real", "additional": {"agisdk_task_id": "gocalendar-4", "challenge_type": "action", "difficulty": "medium", "similar_to": "Google Calendar"}}}
{"query_id": "agisdk-staynb-6", "dataset": "agisdk-real", "query": "Find and book the stay with the best value for money (cheapest stay with the best reviews) for 1 day. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-staynb.vercel.app", "metadata": {"original_task_id": "staynb-6", "website": "StayNB", "category": "agisdk-real", "additional": {"agisdk_task_id": "staynb-6", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "Airbnb"}}}
{"query_id": "agisdk-udriver-11", "dataset": "agisdk-real", "query": "I need to go from Pacific Catch on Chestnut back home to 333 Fremont now. If the fancy version is within ten dollars of the regular one, book that.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-11", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-11", "challenge_type": "action", "difficulty": "hard", "similar_to": "Uber"}}}
{"query_id": "agisdk-networkin-5", "dataset": "agisdk-real", "query": "Send a connection request to John Smith.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-networkin.vercel.app", "metadata": {"original_task_id": "networkin-5", "website": "Networkin", "category": "agisdk-real", "additional": {"agisdk_task_id": "networkin-5", "challenge_type": "action", "difficulty": "easy", "similar_to": "LinkedIn"}}}
{"query_id": "agisdk-zilloft-6", "dataset": "agisdk-real", "query": "Select a property listed in San Francisco as \"Condos\" within a price range under $300,000 and request a tour for tomorrow at 4:00 PM. Use these contact details: Name: Sarah Brown, Email: sarahbrown@example.com, Phone: 555-987-6543.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-zilloft.vercel.app", "metadata": {"original_task_id": "zilloft-6", "website": "Zilloft", "category": "agisdk-real", "additional": {"agisdk_task_id": "zilloft-6", "challenge_type": "action", "difficulty": "medium", "similar_to": "Zillow"}}}

View File

@@ -0,0 +1 @@
{"query_id": "agisdk-dashdish-10", "dataset": "agisdk-real", "query": "Place an order from \"Souvla\" for a \"Medium Classic Cheeseburger\" and a \"Small Bacon Double Cheeseburger\" with \"Standard Delivery\" as the method with the default charged options.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-dashdish.vercel.app", "metadata": {"original_task_id": "dashdish-10", "website": "DashDish", "category": "agisdk-real", "additional": {"agisdk_task_id": "dashdish-10", "challenge_type": "action", "difficulty": "hard", "similar_to": "Doordash"}}}

View File

@@ -5,6 +5,7 @@
"type": "module",
"scripts": {
"eval": "bun --env-file=.env.development run src/index.ts",
"test": "bun run ../../scripts/run-bun-test.ts ./apps/eval/tests",
"typecheck": "tsc --noEmit"
},
"dependencies": {

View File

@@ -1,34 +1,73 @@
/**
* Test script for Clado API endpoints (grounding + action models)
* Smoke-test for the Clado BrowserOS Action endpoint.
*
* Health-checks the model, then runs a generate call and prints every
* field the new contract documents (action, coordinates, text, key,
* direction, scroll/drag fields, wait, end+final_answer, thinking,
* parse_error, raw_response).
*
* Usage:
* bun apps/eval/scripts/test-clado-api.ts [screenshot-path]
*
* If no screenshot provided, captures one from a running BrowserOS server.
* If no screenshot path is given, captures one over MCP from a
* running BrowserOS server (default http://127.0.0.1:9110, override
* with BROWSEROS_URL).
*
* Cold start can take ~5 minutes; the script waits up to 6.
*/
import { readFile } from 'node:fs/promises'
import { resolve } from 'node:path'
const ACTION_URL =
'https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run'
'https://clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef.modal.run'
const ACTION_HEALTH_URL =
'https://clado-ai--clado-browseros-action-actionmodel-health.modal.run'
const GROUNDING_URL =
'https://clado-ai--clado-browseros-grounding-groundingmodel-generate.modal.run'
const GROUNDING_HEALTH_URL =
'https://clado-ai--clado-browseros-grounding-groundingmodel-health.modal.run'
'https://clado-ai--clado-browseros-action-000159-merged-actionmod-5e5033.modal.run'
async function checkHealth(name: string, url: string): Promise<boolean> {
console.log(`\n--- ${name} health check ---`)
console.log(` URL: ${url}`)
const COLD_START_BUDGET_MS = 360_000 // 6 min — Clado cold start is ~5 min
const COLD_START_WARN_MS = 30_000
interface CladoResponse {
action?: string | null
thinking?: string | null
raw_response?: string
parse_error?: string | null
inference_time_seconds?: number
x?: number
y?: number
text?: string
key?: string
direction?: string
amount?: number
startX?: number
startY?: number
endX?: number
endY?: number
time?: number
final_answer?: string | null
}
async function checkHealth(): Promise<boolean> {
console.log(`\n--- Action model health ---`)
console.log(` URL: ${ACTION_HEALTH_URL}`)
console.log(
` Note: cold start can take ~5 min; waiting up to ${COLD_START_BUDGET_MS / 1000}s.`,
)
const start = performance.now()
const warn = setTimeout(() => {
console.log(
` ...still waiting (${COLD_START_WARN_MS / 1000}s in) — model is likely cold-starting on Modal.`,
)
}, COLD_START_WARN_MS)
try {
const resp = await fetch(url, { signal: AbortSignal.timeout(30_000) })
const resp = await fetch(ACTION_HEALTH_URL, {
signal: AbortSignal.timeout(COLD_START_BUDGET_MS),
})
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
const body = await resp.text()
console.log(` Status: ${resp.status} (${elapsed}s)`)
console.log(` Body: ${body.slice(0, 200)}`)
console.log(` Body: ${body.slice(0, 400)}`)
return resp.ok
} catch (err) {
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
@@ -36,63 +75,34 @@ async function checkHealth(name: string, url: string): Promise<boolean> {
` FAILED (${elapsed}s): ${err instanceof Error ? err.message : err}`,
)
return false
} finally {
clearTimeout(warn)
}
}
async function testGenerate(
name: string,
url: string,
async function generate(
label: string,
payload: Record<string, unknown>,
): Promise<Record<string, unknown> | null> {
console.log(`\n--- ${name} generate ---`)
console.log(` URL: ${url}`)
): Promise<CladoResponse | null> {
console.log(`\n--- ${label} ---`)
console.log(` URL: ${ACTION_URL}`)
console.log(` Instruction: ${payload.instruction}`)
console.log(
` Image size: ${((payload.image_base64 as string).length / 1024).toFixed(0)} KB (base64)`,
` Image size: ${((payload.image_base64 as string).length / 1024).toFixed(0)} KB (base64)`,
)
if (payload.history) console.log(` History: ${payload.history}`)
if (payload.history && payload.history !== 'None') {
console.log(` History: ${payload.history}`)
}
const start = performance.now()
let resp: Response
try {
const resp = await fetch(url, {
resp = await fetch(ACTION_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload),
signal: AbortSignal.timeout(120_000),
signal: AbortSignal.timeout(COLD_START_BUDGET_MS),
})
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
if (!resp.ok) {
const body = await resp.text()
console.log(` FAILED: HTTP ${resp.status} (${elapsed}s)`)
console.log(` Body: ${body.slice(0, 400)}`)
return null
}
const result = (await resp.json()) as Record<string, unknown>
console.log(` Status: ${resp.status} (${elapsed}s)`)
console.log(` Action: ${result.action}`)
if (result.x !== null && result.x !== undefined)
console.log(` Coordinates: (${result.x}, ${result.y})`)
if (result.text)
console.log(` Text: ${(result.text as string).slice(0, 100)}`)
if (result.key) console.log(` Key: ${result.key}`)
if (result.inference_time_seconds)
console.log(` Inference: ${result.inference_time_seconds}s`)
// Show thinking if present
const raw = result.raw_response as string | undefined
if (raw) {
const thinkMatch = raw.match(/<thinking>([\s\S]*?)<\/thinking>/)
if (thinkMatch) {
const thinking = thinkMatch[1].trim()
console.log(
` Thinking: ${thinking.slice(0, 200)}${thinking.length > 200 ? '...' : ''}`,
)
}
}
return result
} catch (err) {
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
console.log(
@@ -100,6 +110,50 @@ async function testGenerate(
)
return null
}
const elapsed = ((performance.now() - start) / 1000).toFixed(2)
if (!resp.ok) {
const body = await resp.text()
console.log(` HTTP ${resp.status} ${resp.statusText} (${elapsed}s)`)
console.log(` Body: ${body.slice(0, 400)}`)
return null
}
const result = (await resp.json()) as CladoResponse
console.log(` HTTP ${resp.status} (${elapsed}s)`)
console.log(` action: ${result.action ?? 'null'}`)
if (result.parse_error) {
console.log(` parse_error: ${result.parse_error}`)
}
if (result.thinking) {
const trimmed = result.thinking.replace(/\s+/g, ' ').trim()
console.log(
` thinking: ${trimmed.slice(0, 240)}${trimmed.length > 240 ? '…' : ''}`,
)
}
if (typeof result.x === 'number' || typeof result.y === 'number') {
console.log(` x, y: ${result.x}, ${result.y}`)
}
if (typeof result.text === 'string')
console.log(` text: ${result.text.slice(0, 120)}`)
if (typeof result.key === 'string')
console.log(` key: ${result.key}`)
if (typeof result.direction === 'string')
console.log(` direction: ${result.direction}`)
if (typeof result.amount === 'number')
console.log(` amount: ${result.amount}`)
if (typeof result.startX === 'number' || typeof result.endX === 'number') {
console.log(
` drag: (${result.startX}, ${result.startY}) → (${result.endX}, ${result.endY})`,
)
}
if (typeof result.time === 'number')
console.log(` time: ${result.time}s`)
if (result.final_answer)
console.log(` final_answer: ${result.final_answer.slice(0, 240)}`)
if (typeof result.inference_time_seconds === 'number')
console.log(` inference_time_seconds: ${result.inference_time_seconds}`)
return result
}
async function loadScreenshot(path?: string): Promise<string> {
@@ -110,10 +164,9 @@ async function loadScreenshot(path?: string): Promise<string> {
return data.toString('base64')
}
// Try to capture from a running BrowserOS server
const serverUrl = process.env.BROWSEROS_URL || 'http://127.0.0.1:9110'
console.log(
`No screenshot path provided. Trying to capture from ${serverUrl}...`,
`No screenshot path provided. Capturing from ${serverUrl} via MCP...`,
)
const { Client } = await import('@modelcontextprotocol/sdk/client/index.js')
@@ -134,82 +187,101 @@ async function loadScreenshot(path?: string): Promise<string> {
arguments: { format: 'png', page: 1 },
})) as { content: Array<{ type: string; data?: string }> }
const imageContent = result.content?.find((c) => c.type === 'image')
if (!imageContent?.data)
throw new Error('No image data in screenshot response')
const image = result.content?.find((c) => c.type === 'image')
if (!image?.data)
throw new Error('No image data in take_screenshot response')
console.log(
`Captured screenshot (${(imageContent.data.length / 1024).toFixed(0)} KB base64)`,
`Captured screenshot (${(image.data.length / 1024).toFixed(0)} KB base64)`,
)
return imageContent.data
return image.data
} finally {
try {
await transport.close()
} catch {}
} catch {
/* ignore */
}
}
}
function summarize(history: CladoResponse[]): string {
if (history.length === 0) return 'None'
return history
.map((h) => {
switch (h.action) {
case 'click':
case 'double_click':
case 'right_click':
case 'hover':
return `${h.action}(${h.x}, ${h.y})`
case 'type':
return `type(${JSON.stringify(h.text ?? '')})`
case 'press_key':
return `press_key(${JSON.stringify(h.key ?? '')})`
case 'scroll':
return `scroll(${h.direction ?? 'down'})`
case 'drag':
return `drag(${h.startX},${h.startY} -> ${h.endX},${h.endY})`
case 'wait':
return `wait(${h.time ?? 1}s)`
case 'end':
return 'end()'
default:
return h.action ?? 'invalid'
}
})
.join(' -> ')
}
async function main() {
const screenshotPath = process.argv[2]
console.log('=== Clado action endpoint smoke test ===')
console.log('=== Clado API Test ===\n')
// Health checks (parallel)
const [actionHealthy, groundingHealthy] = await Promise.all([
checkHealth('Action Model', ACTION_HEALTH_URL),
checkHealth('Grounding Model', GROUNDING_HEALTH_URL),
])
if (!actionHealthy && !groundingHealthy) {
console.log('\nBoth endpoints are down. Exiting.')
const healthy = await checkHealth()
if (!healthy) {
console.log('\nHealth check failed. Exiting.')
process.exit(1)
}
// Load screenshot
let imageBase64: string
try {
imageBase64 = await loadScreenshot(screenshotPath)
imageBase64 = await loadScreenshot(process.argv[2])
} catch (err) {
console.log(
`\nFailed to load screenshot: ${err instanceof Error ? err.message : err}`,
)
console.log(
'Provide a screenshot path: bun apps/eval/scripts/test-clado-api.ts path/to/screenshot.png',
'Pass a path: bun apps/eval/scripts/test-clado-api.ts path/to/screenshot.png',
)
process.exit(1)
}
const instruction = 'Click on the search button or search bar'
const history: CladoResponse[] = []
// Test grounding model
if (groundingHealthy) {
await testGenerate('Grounding Model', GROUNDING_URL, {
instruction,
// Step 1: open task — let the model decide what to do.
const step1 = await generate('Step 1: cold task', {
instruction: 'Find the search bar and click it',
image_base64: imageBase64,
history: 'None',
})
if (step1?.action) history.push(step1)
// Step 2: continuation with history, asks for typing.
if (step1?.action) {
const step2 = await generate('Step 2: with history', {
instruction: 'Type "hello world" into the search bar',
image_base64: imageBase64,
history: summarize(history),
})
} else {
console.log('\nSkipping grounding model (unhealthy)')
if (step2?.action) history.push(step2)
}
// Test action model (no history)
if (actionHealthy) {
const result = await testGenerate('Action Model (step 1)', ACTION_URL, {
instruction,
image_base64: imageBase64,
history: 'None',
})
// Test action model with history (simulate multi-turn)
if (result && result.action === 'click') {
await testGenerate('Action Model (step 2, with history)', ACTION_URL, {
instruction: 'Type "hello world" in the search bar',
image_base64: imageBase64,
history: `click(${result.x}, ${result.y})`,
})
}
} else {
console.log('\nSkipping action model (unhealthy)')
}
// Step 3: ask for end with a final answer to exercise that field.
await generate('Step 3: ask for end+final_answer', {
instruction:
'You have completed the task. Reply with end() and final_answer="done".',
image_base64: imageBase64,
history: summarize(history),
})
console.log('\n=== Done ===')
}

View File

@@ -1,349 +1,43 @@
#!/usr/bin/env bun
/**
* Upload eval runs to R2.
*
* Two modes:
* bun scripts/upload-run.ts results/browseros-agent-weekly/2026-03-21-1730
* → uploads that specific run
*
* bun scripts/upload-run.ts results/browseros-agent-weekly
* → finds all timestamped subfolders, uploads any not yet in R2
*
* Env vars: EVAL_R2_ACCOUNT_ID, EVAL_R2_ACCESS_KEY_ID, EVAL_R2_SECRET_ACCESS_KEY
* EVAL_R2_BUCKET (default: browseros-eval)
* EVAL_R2_CDN_BASE_URL (default: https://eval.browseros.com)
*/
import { readdir, readFile, stat } from 'node:fs/promises'
import { basename, dirname, extname, join } from 'node:path'
import {
GetObjectCommand,
PutObjectCommand,
S3Client,
} from '@aws-sdk/client-s3'
loadR2ConfigFromEnv,
R2Publisher,
} from '../src/publishing/r2-publisher'
const CONCURRENCY = 20
const CONTENT_TYPES: Record<string, string> = {
'.json': 'application/json',
'.jsonl': 'application/x-ndjson',
'.png': 'image/png',
}
interface R2Config {
accountId: string
accessKeyId: string
secretAccessKey: string
bucket: string
cdnBaseUrl: string
}
function loadConfig(): R2Config {
const accountId = process.env.EVAL_R2_ACCOUNT_ID
const accessKeyId = process.env.EVAL_R2_ACCESS_KEY_ID
const secretAccessKey = process.env.EVAL_R2_SECRET_ACCESS_KEY
if (!accountId || !accessKeyId || !secretAccessKey) {
console.error(
'Missing required env vars: EVAL_R2_ACCOUNT_ID, EVAL_R2_ACCESS_KEY_ID, EVAL_R2_SECRET_ACCESS_KEY',
)
process.exit(1)
}
return {
accountId,
accessKeyId,
secretAccessKey,
bucket: process.env.EVAL_R2_BUCKET || 'browseros-eval',
cdnBaseUrl: (
process.env.EVAL_R2_CDN_BASE_URL || 'https://eval.browseros.com'
).replace(/\/+$/, ''),
}
}
function createClient(config: R2Config): S3Client {
return new S3Client({
region: 'auto',
endpoint: `https://${config.accountId}.r2.cloudflarestorage.com`,
credentials: {
accessKeyId: config.accessKeyId,
secretAccessKey: config.secretAccessKey,
},
})
}
async function upload(
client: S3Client,
bucket: string,
key: string,
body: Buffer,
contentType: string,
) {
await client.send(
new PutObjectCommand({
Bucket: bucket,
Key: key,
Body: body,
ContentType: contentType,
}),
)
}
async function collectFiles(dir: string): Promise<string[]> {
const files: string[] = []
const entries = await readdir(dir, { withFileTypes: true })
for (const entry of entries) {
const full = join(dir, entry.name)
if (entry.isDirectory()) {
files.push(...(await collectFiles(full)))
} else {
files.push(full)
}
}
return files
}
async function runPool<T>(
items: T[],
concurrency: number,
fn: (item: T) => Promise<void>,
) {
let i = 0
const workers = Array.from({ length: concurrency }, async () => {
while (i < items.length) {
const idx = i++
await fn(items[idx])
}
})
await Promise.all(workers)
}
// Check if a run has already been uploaded to R2
async function isUploaded(
client: S3Client,
bucket: string,
runId: string,
): Promise<boolean> {
try {
await client.send(
new GetObjectCommand({
Bucket: bucket,
Key: `runs/${runId}/manifest.json`,
}),
)
return true
} catch {
return false
}
}
// Detect if a directory is a run dir (has task subdirs with metadata.json)
// vs a config dir (has timestamped subdirs like 2026-03-21-1730/)
async function isRunDir(dir: string): Promise<boolean> {
const entries = await readdir(dir, { withFileTypes: true })
const subdirs = entries.filter((e) => e.isDirectory())
for (const subdir of subdirs) {
const metaPath = join(dir, subdir.name, 'metadata.json')
const metaStat = await stat(metaPath).catch(() => null)
if (metaStat?.isFile()) return true
}
return false
}
async function uploadSingleRun(
runDir: string,
runId: string,
r2Config: R2Config,
client: S3Client,
): Promise<void> {
const taskDirs = await readdir(runDir, { withFileTypes: true })
const taskEntries = taskDirs.filter((d) => d.isDirectory())
if (taskEntries.length === 0) {
console.warn(` No task subdirectories in ${runId}, skipping`)
return
}
const manifestTasks: Record<string, unknown>[] = []
const jobs: { key: string; filePath: string; contentType: string }[] = []
// Extract agent config from first task
let agentConfig: Record<string, unknown> | undefined
let dataset: string | undefined
for (const taskDir of taskEntries) {
const taskId = taskDir.name
const taskPath = join(runDir, taskId)
const metaPath = join(taskPath, 'metadata.json')
let meta: Record<string, unknown> = {}
try {
meta = JSON.parse(await readFile(metaPath, 'utf-8'))
} catch {
continue
}
if (!agentConfig && meta.agent_config)
agentConfig = meta.agent_config as Record<string, unknown>
if (!dataset && meta.dataset) dataset = meta.dataset as string
const files = await collectFiles(taskPath)
let screenshotCount = 0
for (const file of files) {
const relative = file.slice(taskPath.length + 1)
const ext = extname(file)
if (relative.startsWith('screenshots/') && ext === '.png')
screenshotCount++
jobs.push({
key: `runs/${runId}/${taskId}/${relative}`,
filePath: file,
contentType: CONTENT_TYPES[ext] || 'application/octet-stream',
})
}
manifestTasks.push({
queryId: meta.query_id || taskId,
query: meta.query || '',
startUrl: meta.start_url || '',
status:
meta.termination_reason === 'completed'
? 'completed'
: meta.termination_reason || 'unknown',
durationMs: meta.total_duration_ms || 0,
screenshotCount: (meta.screenshot_count as number) || screenshotCount,
graderResults: meta.grader_results || {},
})
}
if (manifestTasks.length === 0) {
console.warn(` No completed tasks in ${runId}, skipping`)
return
}
console.log(
` Uploading ${jobs.length} files across ${manifestTasks.length} tasks...`,
)
let uploaded = 0
await runPool(jobs, CONCURRENCY, async (job) => {
const body = await readFile(job.filePath)
await upload(client, r2Config.bucket, job.key, body, job.contentType)
uploaded++
if (uploaded % 50 === 0 || uploaded === jobs.length) {
console.log(` ${uploaded}/${jobs.length}`)
}
})
// Read summary.json if it exists
let summaryData: Record<string, unknown> | undefined
try {
summaryData = JSON.parse(
await readFile(join(runDir, 'summary.json'), 'utf-8'),
)
} catch {}
// Upload manifest
const manifest = {
runId,
uploadedAt: new Date().toISOString(),
agentConfig,
dataset,
summary: summaryData
? {
passRate: summaryData.passRate,
avgDurationMs: summaryData.avgDurationMs,
}
: undefined,
tasks: manifestTasks,
}
const manifestBody = Buffer.from(JSON.stringify(manifest, null, 2))
await upload(
client,
r2Config.bucket,
`runs/${runId}/manifest.json`,
manifestBody,
'application/json',
)
// Upload viewer.html to bucket root
const viewerPath = join(
import.meta.dir,
'..',
'src',
'dashboard',
'viewer.html',
)
const viewerBody = await readFile(viewerPath)
await upload(client, r2Config.bucket, 'viewer.html', viewerBody, 'text/html')
console.log(` Uploaded ${uploaded + 2} files`)
console.log(` ${r2Config.cdnBaseUrl}/viewer.html?run=${runId}`)
}
async function main() {
async function main(): Promise<void> {
const inputDir = process.argv[2]
if (!inputDir) {
console.error(
throw new Error(
'Usage:\n' +
' bun scripts/upload-run.ts results/config-name/2026-03-21-1730 (specific run)\n' +
' bun scripts/upload-run.ts results/config-name (all un-uploaded runs)',
)
process.exit(1)
}
const dirStat = await stat(inputDir).catch(() => null)
if (!dirStat?.isDirectory()) {
console.error(`Not a directory: ${inputDir}`)
process.exit(1)
}
const r2Config = loadConfig()
const client = createClient(r2Config)
if (await isRunDir(inputDir)) {
// Single run: results/config-name/2026-03-21-1730
const timestamp = basename(inputDir)
const configName = basename(dirname(inputDir))
const runId = `${configName}-${timestamp}`
console.log(`Uploading run: ${runId}`)
await uploadSingleRun(inputDir, runId, r2Config, client)
} else {
// Config dir: results/config-name/ — upload all un-uploaded runs
const configName = basename(inputDir)
const entries = await readdir(inputDir, { withFileTypes: true })
const runDirs = entries
.filter((e) => e.isDirectory())
.map((e) => e.name)
.sort()
if (runDirs.length === 0) {
console.error('No run subdirectories found')
process.exit(1)
}
console.log(
`Found ${runDirs.length} runs for config "${configName}", checking R2...`,
)
let uploadedCount = 0
for (const dir of runDirs) {
const runId = `${configName}-${dir}`
const alreadyUploaded = await isUploaded(client, r2Config.bucket, runId)
if (alreadyUploaded) {
console.log(` ${runId}: already uploaded, skipping`)
continue
}
console.log(` ${runId}: uploading...`)
await uploadSingleRun(join(inputDir, dir), runId, r2Config, client)
uploadedCount++
}
console.log(
`\nDone. Uploaded ${uploadedCount} new run(s), ${runDirs.length - uploadedCount} already in R2.`,
' bun scripts/upload-run.ts results/config-name/2026-03-21-1730\n' +
' bun scripts/upload-run.ts results/config-name',
)
}
const publisher = new R2Publisher({ config: loadR2ConfigFromEnv() })
const result = await publisher.publishPath(inputDir)
for (const run of result.uploadedRuns) {
console.log(`Uploaded ${run.uploadedFiles} files for ${run.runId}`)
console.log(run.viewerUrl)
}
for (const runId of result.skippedRuns) {
console.log(`${runId}: already uploaded, skipping`)
}
console.log(
`Done. Uploaded ${result.uploadedRuns.length} run(s), skipped ${result.skippedRuns.length}.`,
)
}
main()
main().catch((error) => {
console.error(error instanceof Error ? error.message : String(error))
process.exit(1)
})

View File

@@ -24,45 +24,11 @@ import {
PutObjectCommand,
S3Client,
} from '@aws-sdk/client-s3'
interface ManifestTask {
queryId: string
query: string
status: string
durationMs: number
screenshotCount: number
graderResults: Record<string, { pass: boolean; score: number }>
}
interface Manifest {
runId: string
uploadedAt: string
agentConfig?: { type?: string; model?: string }
dataset?: string
summary?: { passRate?: number; avgDurationMs?: number }
tasks: ManifestTask[]
}
interface RunSummary {
runId: string
configName: string
date: string
avgScore: number
total: number
completed: number
failed: number
timeout: number
avgDurationMs: number
model: string
dataset: string
agentType: string
}
const PASS_FAIL_GRADER_ORDER = [
'agisdk_state_diff',
'infinity_state',
'performance_grader',
]
import {
buildRunSummaries,
type ReportManifest,
type RunSummary,
} from '../src/reporting/run-summary'
function requireEnv(name: string): string {
const value = process.env[name]
@@ -87,7 +53,7 @@ const client = new S3Client({
// Step 1: List all manifest.json files in runs/
console.log('Scanning R2 for eval runs...')
const manifests: Manifest[] = []
const manifests: ReportManifest[] = []
let continuationToken: string | undefined
do {
@@ -127,64 +93,9 @@ if (manifests.length === 0) {
}
// Step 2: Build run summaries
const runs: RunSummary[] = manifests
.map((m) => {
const total = m.tasks.length
const completed = m.tasks.filter((t) => t.status === 'completed').length
const failed = m.tasks.filter((t) => t.status === 'failed').length
const timeout = m.tasks.filter((t) => t.status === 'timeout').length
let scoredCount = 0
let scoreSum = 0
for (const task of m.tasks) {
if (!task.graderResults) continue
for (const name of PASS_FAIL_GRADER_ORDER) {
if (task.graderResults[name]) {
scoredCount++
scoreSum += task.graderResults[name].score ?? 0
break
}
}
}
const avgScore = scoredCount > 0 ? (scoreSum / scoredCount) * 100 : 0
const durations = m.tasks
.filter((t) => t.durationMs > 0)
.map((t) => t.durationMs)
const avgDurationMs =
durations.length > 0
? durations.reduce((a, b) => a + b, 0) / durations.length
: 0
const date = m.uploadedAt
? `${m.uploadedAt.split('T')[0]} ${m.uploadedAt.split('T')[1]?.slice(0, 5) || ''}`
: m.runId.slice(0, 15)
const model = m.agentConfig?.model || 'unknown'
const dataset = m.dataset || m.runId
const agentType = m.agentConfig?.type || 'unknown'
const configName = extractConfigName(m.runId)
return {
runId: m.runId,
configName,
date,
avgScore,
total,
completed,
failed,
timeout,
avgDurationMs,
model,
dataset,
agentType,
}
})
.sort((a, b) => a.date.localeCompare(b.date))
const runs: RunSummary[] = buildRunSummaries(manifests)
// Step 3: Identify unique config groups
// runId can be "ci-weekly" (old) or "ci-weekly-2026-03-21-1730" (timestamped)
// Extract config name by stripping the date-time suffix pattern
function escHtml(s: string): string {
return s
.replace(/&/g, '&amp;')
@@ -193,12 +104,6 @@ function escHtml(s: string): string {
.replace(/"/g, '&quot;')
}
function extractConfigName(runId: string): string {
// "browseros-agent-weekly-2026-03-21-1730" → "browseros-agent-weekly"
// "ci-weekly" → "ci-weekly" (no timestamp, old format)
return runId.replace(/-\d{4}-\d{2}-\d{2}-\d{4}$/, '')
}
const configGroups = [...new Set(runs.map((r) => r.configName))]
const defaultConfig = configGroups.includes('ci-weekly')
? 'ci-weekly'

View File

@@ -1,113 +1,67 @@
import { randomUUID } from 'node:crypto'
import { MAX_ACTIONS_PER_DELEGATION } from '../../../../constants'
import { McpClient, type McpToolResult } from '../../../../utils/mcp-client'
import { sleep } from '../../../../utils/sleep'
import type {
ExecutorConfig,
ExecutorResult,
} from '../../../orchestrator-executor/types'
import type { ExecutorCallbacks } from '../../executor-backend'
import {
CLADO_REQUEST_TIMEOUT_MS,
MAX_ACTIONS_PER_DELEGATION,
} from '../../constants'
import { McpClient, type McpToolResult } from '../../utils/mcp-client'
import { sleep } from '../../utils/sleep'
import type { ExecutorCallbacks } from './executor'
import type { ExecutorConfig, ExecutorResult } from './types'
extractCladoThinking,
formatCladoHistory,
getCladoActionSignature,
parseCladoActions,
summarizeCladoPrediction,
} from './clado-actions'
import {
normalizeCladoDirection,
normalizeCladoPressKey,
normalizeCladoScrollAmount,
prepareCladoToolArgs,
resolveCladoPoint,
} from './clado-browser-driver'
import { CladoActionClient } from './clado-client'
import {
CLADO_ACTION_PROVIDER,
type CladoAction,
type CladoActionPoint,
type CladoActionResponse,
type CladoViewport,
isCladoActionProvider,
} from './types'
const CLADO_ACTION_PROVIDER = 'clado-action'
const PAGE_SCOPED_TOOLS = new Set<string>([
'take_screenshot',
'evaluate_script',
'click',
'click_at',
'hover',
'hover_at',
'clear',
'fill',
'press_key',
'type_at',
'drag',
'drag_at',
'scroll',
'handle_dialog',
'select_option',
'navigate_page',
'close_page',
'wait_for',
])
interface CladoActionResponse {
action?: string
x?: number
y?: number
text?: string
key?: string
direction?: string
startX?: number
startY?: number
endX?: number
endY?: number
amount?: number
time?: number
inference_time_seconds?: number
raw_response?: string
}
interface Viewport {
width: number
height: number
}
interface CladoAction {
action: string
x?: number
y?: number
text?: string
key?: string
direction?: string
startX?: number
startY?: number
endX?: number
endY?: number
amount?: number
time?: number
}
type RawActionPayload = Partial<CladoAction>
interface ActionPoint {
x: number
y: number
}
const MAX_CONSECUTIVE_PARSE_FAILURES = 3
function asErrorMessage(error: unknown): string {
return error instanceof Error ? error.message : String(error)
}
function clampNormalized(value: number): number {
return Math.min(999, Math.max(0, Math.round(value)))
}
function isCladoProvider(provider: string): boolean {
return provider === CLADO_ACTION_PROVIDER
}
export class CladoActionExecutor {
private readonly mcpClient: McpClient
private readonly cladoClient: CladoActionClient
private readonly pageId: number
private callbacks: ExecutorCallbacks = {}
private stepsUsed = 0
private viewport: Viewport | null = null
private lastPoint: ActionPoint | null = null
private viewport: CladoViewport | null = null
private lastPoint: CladoActionPoint | null = null
private currentUrl = ''
constructor(
private readonly config: ExecutorConfig,
config: ExecutorConfig,
serverUrl: string,
readonly _windowId?: number,
readonly _tabId?: number,
initialPageId?: number,
) {
if (!isCladoProvider(config.provider)) {
if (!isCladoActionProvider(config.provider)) {
throw new Error(
`CladoActionExecutor requires provider="${CLADO_ACTION_PROVIDER}"`,
)
}
this.mcpClient = new McpClient(`${serverUrl}/mcp`)
this.cladoClient = new CladoActionClient({
baseUrl: config.baseUrl,
apiKey: config.apiKey,
})
this.pageId = initialPageId ?? 1
}
@@ -135,6 +89,8 @@ export class CladoActionExecutor {
const actionHistory: CladoAction[] = []
let predictionCalls = 0
const thinkingTrace: string[] = []
let consecutiveParseFailures = 0
let finalAnswer: string | undefined
let status: ExecutorResult['status'] = 'done'
let reason = 'Goal executed.'
@@ -155,7 +111,7 @@ export class CladoActionExecutor {
break
}
const historyForPrediction = this.formatHistory(actionHistory)
const historyForPrediction = formatCladoHistory(actionHistory)
const actionToolCallId = randomUUID()
const predictionInput = {
instruction,
@@ -177,7 +133,7 @@ export class CladoActionExecutor {
signal,
)
predictionCalls++
const thinking = this.extractThinking(prediction.raw_response)
const thinking = extractCladoThinking(prediction.raw_response)
if (thinking) {
const previous = thinkingTrace[thinkingTrace.length - 1]
if (previous !== thinking) {
@@ -207,8 +163,19 @@ export class CladoActionExecutor {
break
}
const predictedActions = this.parseActions(prediction)
const predictedActions = parseCladoActions(prediction)
if (predictedActions.length === 0) {
// Per Clado contract: HTTP 200 with action=null on parse failure.
// Count as an invalid step so the model can self-correct on the
// next call instead of dropping the trajectory.
consecutiveParseFailures++
const parseError =
prediction.parse_error ?? 'no parsable <answer> in raw_response'
actionHistory.push({
action: 'invalid',
text: `parse_error: ${parseError}`,
})
this.stepsUsed++
await this.callbacks.onStepFinish?.({
toolCalls: [
{
@@ -222,16 +189,23 @@ export class CladoActionExecutor {
toolCallId: actionToolCallId,
toolName: 'clado_action_predict',
output: {
prediction: this.summarizePrediction(prediction),
prediction: summarizeCladoPrediction(prediction),
parsedActions: [],
parseError,
consecutiveParseFailures,
},
},
],
})
status = 'blocked'
reason = 'Clado action response did not contain a valid action.'
break
if (consecutiveParseFailures >= MAX_CONSECUTIVE_PARSE_FAILURES) {
status = 'blocked'
reason = `Clado returned ${consecutiveParseFailures} consecutive unparseable responses.`
break
}
continue
}
consecutiveParseFailures = 0
let requestedStop = false
const executionNotes: string[] = []
@@ -257,7 +231,7 @@ export class CladoActionExecutor {
toolCallId: actionToolCallId,
toolName: 'clado_action_predict',
output: {
prediction: this.summarizePrediction(prediction),
prediction: summarizeCladoPrediction(prediction),
parsedActions: predictedActions,
executed: executionNotes,
},
@@ -272,7 +246,12 @@ export class CladoActionExecutor {
actionHistory.push(predictedAction)
if (predictedAction.action === 'end') {
reason = 'Model requested end() and marked task complete.'
if (predictedAction.final_answer) {
finalAnswer = predictedAction.final_answer
reason = `Model requested end() with final_answer: ${predictedAction.final_answer.slice(0, 240)}`
} else {
reason = 'Model requested end() and marked task complete.'
}
requestedStop = true
break
}
@@ -293,7 +272,7 @@ export class CladoActionExecutor {
toolCallId: actionToolCallId,
toolName: 'clado_action_predict',
output: {
prediction: this.summarizePrediction(prediction),
prediction: summarizeCladoPrediction(prediction),
parsedActions: predictedActions,
executed: executionNotes,
},
@@ -327,6 +306,7 @@ export class CladoActionExecutor {
actions: actionHistory,
url: this.currentUrl,
thinkingTrace,
finalAnswer,
})
return {
@@ -344,121 +324,12 @@ export class CladoActionExecutor {
actionHistory: CladoAction[],
signal?: AbortSignal,
): Promise<CladoActionResponse> {
if (!this.config.baseUrl) {
throw new Error('executor.baseUrl must be set for clado-action provider')
}
const requestController = new AbortController()
const onAbort = () => requestController.abort()
signal?.addEventListener('abort', onAbort, { once: true })
const timeoutHandle = setTimeout(() => {
requestController.abort()
}, CLADO_REQUEST_TIMEOUT_MS)
try {
const headers: Record<string, string> = {
'Content-Type': 'application/json',
}
if (this.config.apiKey) {
headers.Authorization = `Bearer ${this.config.apiKey}`
}
const response = await fetch(this.config.baseUrl, {
method: 'POST',
headers,
body: JSON.stringify({
instruction,
image_base64: imageBase64,
history: this.formatHistory(actionHistory),
}),
signal: requestController.signal,
})
if (!response.ok) {
const body = await response.text()
throw new Error(
`HTTP ${response.status} ${response.statusText}: ${body.slice(0, 400)}`,
)
}
return (await response.json()) as CladoActionResponse
} finally {
clearTimeout(timeoutHandle)
signal?.removeEventListener('abort', onAbort)
}
}
private parseActions(prediction: CladoActionResponse): CladoAction[] {
const actionFromField =
typeof prediction.action === 'string' ? prediction.action : null
const rawActions = this.parseActionsFromRawResponse(prediction.raw_response)
const primaryFromRaw = rawActions[0] ?? null
const mergedPrimary = {
...primaryFromRaw,
...prediction,
action: actionFromField ?? primaryFromRaw?.action,
}
const normalized: CladoAction[] = []
const primary = this.normalizeActionPayload(mergedPrimary)
if (primary) normalized.push(primary)
for (const candidate of rawActions.slice(1)) {
const parsed = this.normalizeActionPayload(candidate)
if (!parsed) continue
const prev = normalized[normalized.length - 1]
if (
!prev ||
this.getActionSignature(prev) !== this.getActionSignature(parsed)
) {
normalized.push(parsed)
}
}
return normalized
}
private normalizeActionPayload(
payload: RawActionPayload,
): CladoAction | null {
if (!payload.action || typeof payload.action !== 'string') {
return null
}
return {
action: payload.action,
x: typeof payload.x === 'number' ? payload.x : undefined,
y: typeof payload.y === 'number' ? payload.y : undefined,
text: typeof payload.text === 'string' ? payload.text : undefined,
key: typeof payload.key === 'string' ? payload.key : undefined,
direction:
typeof payload.direction === 'string' ? payload.direction : undefined,
startX: typeof payload.startX === 'number' ? payload.startX : undefined,
startY: typeof payload.startY === 'number' ? payload.startY : undefined,
endX: typeof payload.endX === 'number' ? payload.endX : undefined,
endY: typeof payload.endY === 'number' ? payload.endY : undefined,
amount: typeof payload.amount === 'number' ? payload.amount : undefined,
time: typeof payload.time === 'number' ? payload.time : undefined,
}
}
private parseActionsFromRawResponse(
rawResponse: string | undefined,
): RawActionPayload[] {
if (!rawResponse) return []
const matches = [
...rawResponse.matchAll(/<answer>\s*([\s\S]*?)\s*<\/answer>/gi),
]
const parsed: RawActionPayload[] = []
for (const match of matches) {
try {
parsed.push(JSON.parse(match[1]) as RawActionPayload)
} catch {
// ignore malformed answer blocks
}
}
return parsed
return this.cladoClient.requestActionPrediction({
instruction,
imageBase64,
actionHistory,
signal,
})
}
private async executeAction(
@@ -529,14 +400,14 @@ export class CladoActionExecutor {
}
case 'press_key': {
const key = this.normalizePressKey(action.key)
const key = normalizeCladoPressKey(action.key)
await this.runTool('press_key', { key }, signal)
return `Pressed key "${key}".`
}
case 'scroll': {
const direction = this.normalizeDirection(action.direction)
const amountPx = this.normalizeScrollAmount(action.amount)
const direction = normalizeCladoDirection(action.direction)
const amountPx = normalizeCladoScrollAmount(action.amount)
const ticks = Math.max(1, Math.round(amountPx / 120))
await this.runTool('scroll', { direction, amount: ticks }, signal)
@@ -578,7 +449,9 @@ export class CladoActionExecutor {
}
case 'end': {
return 'Model requested end().'
return action.final_answer
? `Model requested end() with final_answer: ${action.final_answer.slice(0, 240)}`
: 'Model requested end().'
}
default: {
@@ -588,9 +461,10 @@ export class CladoActionExecutor {
}
private async captureScreenshotBase64(signal?: AbortSignal): Promise<string> {
// Clado contract is PNG or JPEG; use PNG for lossless input.
const result = await this.runTool(
'take_screenshot',
{ format: 'webp', quality: 80 },
{ format: 'png' },
signal,
)
@@ -604,7 +478,7 @@ export class CladoActionExecutor {
return image.data
}
private async getViewport(signal?: AbortSignal): Promise<Viewport> {
private async getViewport(signal?: AbortSignal): Promise<CladoViewport> {
if (this.viewport) return this.viewport
try {
@@ -635,15 +509,9 @@ export class CladoActionExecutor {
normalizedX: number | undefined,
normalizedY: number | undefined,
signal?: AbortSignal,
): Promise<ActionPoint> {
): Promise<CladoActionPoint> {
const viewport = await this.getViewport(signal)
const nx = clampNormalized(normalizedX ?? 500)
const ny = clampNormalized(normalizedY ?? 500)
return {
x: Math.round((nx / 1000) * viewport.width),
y: Math.round((ny / 1000) * viewport.height),
}
return resolveCladoPoint(viewport, normalizedX, normalizedY)
}
private async getCurrentUrl(signal?: AbortSignal): Promise<string> {
@@ -670,7 +538,7 @@ export class CladoActionExecutor {
throw new Error('aborted')
}
const toolArgs = this.prepareToolArgs(toolName, args)
const toolArgs = prepareCladoToolArgs(toolName, args, this.pageId)
try {
const raw = await this.mcpClient.callTool(toolName, toolArgs)
@@ -689,211 +557,22 @@ export class CladoActionExecutor {
}
}
private prepareToolArgs(
toolName: string,
args: Record<string, unknown>,
): Record<string, unknown> {
const prepared: Record<string, unknown> = { ...args }
if (
toolName === 'evaluate_script' &&
typeof prepared.function === 'string' &&
prepared.expression === undefined
) {
prepared.expression = this.toEvaluateExpression(prepared.function)
delete prepared.function
}
if (
toolName === 'click_at' &&
typeof prepared.dblClick === 'boolean' &&
prepared.clickCount === undefined
) {
prepared.clickCount = prepared.dblClick ? 2 : 1
delete prepared.dblClick
}
// Use fixed page ID for all page-scoped tools (single-page operation)
if (PAGE_SCOPED_TOOLS.has(toolName) && typeof prepared.page !== 'number') {
prepared.page = this.pageId
}
return prepared
}
private toEvaluateExpression(rawFunction: unknown): string {
const source = String(rawFunction).trim()
if (source.startsWith('() =>') || source.startsWith('async () =>')) {
return `(${source})()`
}
if (source.startsWith('function')) {
return `(${source})()`
}
return source
}
private normalizePressKey(key: string | undefined): string {
const raw = (key ?? '').trim()
if (!raw) throw new Error('press_key action missing key field')
const map: Record<string, string> = {
'C-a': 'Control+A',
'C-c': 'Control+C',
'C-v': 'Control+V',
'C-x': 'Control+X',
'C-z': 'Control+Z',
'C-y': 'Control+Y',
'C-s': 'Control+S',
'C-t': 'Control+T',
'C-w': 'Control+W',
'C-h': 'Control+H',
'C-f': 'Control+F',
'C-+': 'Control++',
'C--': 'Control+-',
'C-tab': 'Control+Tab',
'C-S-tab': 'Control+Shift+Tab',
'C-S-n': 'Control+Shift+N',
'C-down': 'Control+ArrowDown',
'M-f4': 'Alt+F4',
}
return map[raw] ?? raw
}
private normalizeDirection(
direction: string | undefined,
): 'up' | 'down' | 'left' | 'right' {
if (
direction === 'up' ||
direction === 'down' ||
direction === 'left' ||
direction === 'right'
) {
return direction
}
return 'down'
}
private normalizeScrollAmount(amount: number | undefined): number {
if (typeof amount !== 'number') return 500
if (amount <= 0) return 100
const clamped = Math.min(amount, 1000)
return Math.max(100, Math.round((clamped / 1000) * 900))
}
private summarizePrediction(
prediction: CladoActionResponse,
): Record<string, unknown> {
const preview =
typeof prediction.raw_response === 'string' &&
prediction.raw_response.length > 0
? prediction.raw_response.slice(0, 240)
: undefined
return {
action: prediction.action,
x: prediction.x,
y: prediction.y,
text: prediction.text,
key: prediction.key,
direction: prediction.direction,
startX: prediction.startX,
startY: prediction.startY,
endX: prediction.endX,
endY: prediction.endY,
amount: prediction.amount,
time: prediction.time,
inference_time_seconds: prediction.inference_time_seconds,
raw_response_preview: preview,
}
}
private extractThinking(rawResponse: string | undefined): string | undefined {
if (!rawResponse) return undefined
const matches = [
...rawResponse.matchAll(/<thinking>\s*([\s\S]*?)\s*<\/thinking>/gi),
]
if (matches.length === 0) return undefined
const merged = matches
.map((match) => match[1]?.replace(/\s+/g, ' ').trim() ?? '')
.filter((value) => value.length > 0)
.join(' ')
if (!merged) return undefined
return merged
}
private getActionSignature(action: CladoAction): string {
switch (action.action) {
case 'click':
case 'double_click':
case 'right_click':
case 'hover':
return `${action.action}:${action.x ?? 'x'}:${action.y ?? 'y'}`
case 'type':
return `${action.action}:${(action.text ?? '').slice(0, 16)}`
case 'press_key':
return `${action.action}:${action.key ?? 'key'}`
case 'scroll':
return `${action.action}:${action.direction ?? 'down'}:${action.amount ?? 500}`
case 'drag':
return `${action.action}:${action.startX}:${action.startY}:${action.endX}:${action.endY}`
case 'wait':
return `${action.action}:${action.time ?? 1}`
case 'end':
return 'end()'
default:
return action.action
}
}
private formatHistory(actions: CladoAction[]): string {
if (actions.length === 0) return 'None'
const parts = actions.map((action) => {
switch (action.action) {
case 'click':
case 'double_click':
case 'right_click':
case 'hover':
return `${action.action}(${Math.round(action.x ?? 500)}, ${Math.round(action.y ?? 500)})`
case 'type': {
const text = (action.text ?? '').replace(/'/g, "\\'")
return `type('${text}')`
}
case 'press_key':
return `press_key('${action.key ?? 'Enter'}')`
case 'scroll':
return `scroll(${action.direction ?? 'down'})`
case 'drag':
return `drag(${Math.round(action.startX ?? 500)},${Math.round(action.startY ?? 500)} -> ${Math.round(action.endX ?? 500)},${Math.round(action.endY ?? 500)})`
case 'wait':
return `wait(${Math.round(action.time ?? 1)}s)`
case 'end':
return 'end()'
default:
return action.action
}
})
return parts.join(' -> ')
}
private buildObservation(params: {
status: ExecutorResult['status']
reason: string
actions: CladoAction[]
url: string
thinkingTrace: string[]
finalAnswer?: string
}): string {
const { status, reason, actions, url, thinkingTrace } = params
const { status, reason, actions, url, thinkingTrace, finalAnswer } = params
const actionSummary =
actions.length === 0
? 'No actions were executed.'
: actions
.slice(-5)
.map(
(action, idx) => `${idx + 1}. ${this.getActionSignature(action)}`,
(action, idx) => `${idx + 1}. ${getCladoActionSignature(action)}`,
)
.join('\n')
const thinkingSummary =
@@ -907,6 +586,7 @@ export class CladoActionExecutor {
`Status: ${status}`,
`Reason: ${reason}`,
`URL: ${url || 'unknown'}`,
finalAnswer ? `Final answer: ${finalAnswer}` : '',
'',
'Recent actions:',
actionSummary,

View File

@@ -0,0 +1,191 @@
import type {
CladoAction,
CladoActionResponse,
RawCladoActionPayload,
} from './types'
/** Parses Clado's structured response plus any raw `<answer>` blocks into executable actions. */
export function parseCladoActions(
prediction: CladoActionResponse,
): CladoAction[] {
const actionFromField =
typeof prediction.action === 'string' ? prediction.action : null
const rawActions = parseCladoActionsFromRawResponse(prediction.raw_response)
const primaryFromRaw = rawActions[0] ?? null
const mergedPrimary = {
...primaryFromRaw,
...prediction,
action: actionFromField ?? primaryFromRaw?.action,
}
const normalized: CladoAction[] = []
const primary = normalizeCladoActionPayload(mergedPrimary)
if (primary) normalized.push(primary)
for (const candidate of rawActions.slice(1)) {
const parsed = normalizeCladoActionPayload(candidate)
if (!parsed) continue
const prev = normalized[normalized.length - 1]
if (
!prev ||
getCladoActionSignature(prev) !== getCladoActionSignature(parsed)
) {
normalized.push(parsed)
}
}
return normalized
}
export function normalizeCladoActionPayload(
payload: RawCladoActionPayload,
): CladoAction | null {
if (!payload.action || typeof payload.action !== 'string') {
return null
}
return {
action: payload.action,
x: typeof payload.x === 'number' ? payload.x : undefined,
y: typeof payload.y === 'number' ? payload.y : undefined,
text: typeof payload.text === 'string' ? payload.text : undefined,
key: typeof payload.key === 'string' ? payload.key : undefined,
direction:
typeof payload.direction === 'string' ? payload.direction : undefined,
startX: typeof payload.startX === 'number' ? payload.startX : undefined,
startY: typeof payload.startY === 'number' ? payload.startY : undefined,
endX: typeof payload.endX === 'number' ? payload.endX : undefined,
endY: typeof payload.endY === 'number' ? payload.endY : undefined,
amount: typeof payload.amount === 'number' ? payload.amount : undefined,
time: typeof payload.time === 'number' ? payload.time : undefined,
final_answer:
typeof payload.final_answer === 'string'
? payload.final_answer
: undefined,
}
}
export function parseCladoActionsFromRawResponse(
rawResponse: string | undefined,
): RawCladoActionPayload[] {
if (!rawResponse) return []
const matches = [
...rawResponse.matchAll(/<answer>\s*([\s\S]*?)\s*<\/answer>/gi),
]
const parsed: RawCladoActionPayload[] = []
for (const match of matches) {
try {
parsed.push(JSON.parse(match[1]) as RawCladoActionPayload)
} catch {
// Ignore malformed answer blocks so one bad block does not drop the whole prediction.
}
}
return parsed
}
export function extractCladoThinking(
rawResponse: string | undefined,
): string | undefined {
if (!rawResponse) return undefined
const matches = [
...rawResponse.matchAll(/<thinking>\s*([\s\S]*?)\s*<\/thinking>/gi),
]
if (matches.length === 0) return undefined
const merged = matches
.map((match) => match[1]?.replace(/\s+/g, ' ').trim() ?? '')
.filter((value) => value.length > 0)
.join(' ')
if (!merged) return undefined
return merged
}
export function summarizeCladoPrediction(
prediction: CladoActionResponse,
): Record<string, unknown> {
const preview =
typeof prediction.raw_response === 'string' &&
prediction.raw_response.length > 0
? prediction.raw_response.slice(0, 240)
: undefined
return {
action: prediction.action,
x: prediction.x,
y: prediction.y,
text: prediction.text,
key: prediction.key,
direction: prediction.direction,
startX: prediction.startX,
startY: prediction.startY,
endX: prediction.endX,
endY: prediction.endY,
amount: prediction.amount,
time: prediction.time,
inference_time_seconds: prediction.inference_time_seconds,
raw_response_preview: preview,
}
}
export function getCladoActionSignature(action: CladoAction): string {
switch (action.action) {
case 'click':
case 'double_click':
case 'right_click':
case 'hover':
return `${action.action}:${action.x ?? 'x'}:${action.y ?? 'y'}`
case 'type':
return `${action.action}:${(action.text ?? '').slice(0, 16)}`
case 'press_key':
return `${action.action}:${action.key ?? 'key'}`
case 'scroll':
return `${action.action}:${action.direction ?? 'down'}:${action.amount ?? 500}`
case 'drag':
return `${action.action}:${action.startX}:${action.startY}:${action.endX}:${action.endY}`
case 'wait':
return `${action.action}:${action.time ?? 1}`
case 'end':
return action.final_answer
? `end(${action.final_answer.slice(0, 32)})`
: 'end()'
case 'invalid':
return `invalid(${(action.text ?? '').slice(0, 40)})`
default:
return action.action
}
}
export function formatCladoHistory(actions: CladoAction[]): string {
if (actions.length === 0) return 'None'
const parts = actions.map((action) => {
switch (action.action) {
case 'click':
case 'double_click':
case 'right_click':
case 'hover':
return `${action.action}(${Math.round(action.x ?? 500)}, ${Math.round(action.y ?? 500)})`
case 'type': {
const text = (action.text ?? '').replace(/'/g, "\\'")
return `type('${text}')`
}
case 'press_key':
return `press_key('${action.key ?? 'Enter'}')`
case 'scroll':
return `scroll(${action.direction ?? 'down'})`
case 'drag':
return `drag(${Math.round(action.startX ?? 500)},${Math.round(action.startY ?? 500)} -> ${Math.round(action.endX ?? 500)},${Math.round(action.endY ?? 500)})`
case 'wait':
return `wait(${Math.round(action.time ?? 1)}s)`
case 'end':
return 'end()'
case 'invalid':
return 'invalid()'
default:
return action.action
}
})
return parts.join(' -> ')
}

View File

@@ -0,0 +1,123 @@
import {
CLADO_PAGE_SCOPED_TOOLS,
type CladoActionPoint,
type CladoViewport,
} from './types'
export function clampCladoNormalizedCoordinate(value: number): number {
return Math.min(999, Math.max(0, Math.round(value)))
}
/** Converts Clado's 0-1000 normalized coordinate space into BrowserOS viewport pixels. */
export function resolveCladoPoint(
viewport: CladoViewport,
normalizedX: number | undefined,
normalizedY: number | undefined,
): CladoActionPoint {
const nx = clampCladoNormalizedCoordinate(normalizedX ?? 500)
const ny = clampCladoNormalizedCoordinate(normalizedY ?? 500)
return {
x: Math.round((nx / 1000) * viewport.width),
y: Math.round((ny / 1000) * viewport.height),
}
}
/** Adapts Clado action tool arguments to the BrowserOS MCP tool argument contract. */
export function prepareCladoToolArgs(
toolName: string,
args: Record<string, unknown>,
pageId: number,
): Record<string, unknown> {
const prepared: Record<string, unknown> = { ...args }
if (
toolName === 'evaluate_script' &&
typeof prepared.function === 'string' &&
prepared.expression === undefined
) {
prepared.expression = toCladoEvaluateExpression(prepared.function)
delete prepared.function
}
if (
toolName === 'click_at' &&
typeof prepared.dblClick === 'boolean' &&
prepared.clickCount === undefined
) {
prepared.clickCount = prepared.dblClick ? 2 : 1
delete prepared.dblClick
}
if (
CLADO_PAGE_SCOPED_TOOLS.has(toolName) &&
typeof prepared.page !== 'number'
) {
prepared.page = pageId
}
return prepared
}
export function toCladoEvaluateExpression(rawFunction: unknown): string {
const source = String(rawFunction).trim()
if (source.startsWith('() =>') || source.startsWith('async () =>')) {
return `(${source})()`
}
if (source.startsWith('function')) {
return `(${source})()`
}
return source
}
export function normalizeCladoPressKey(key: string | undefined): string {
const raw = (key ?? '').trim()
if (!raw) throw new Error('press_key action missing key field')
const map: Record<string, string> = {
'C-a': 'Control+A',
'C-c': 'Control+C',
'C-v': 'Control+V',
'C-x': 'Control+X',
'C-z': 'Control+Z',
'C-y': 'Control+Y',
'C-s': 'Control+S',
'C-t': 'Control+T',
'C-w': 'Control+W',
'C-h': 'Control+H',
'C-f': 'Control+F',
'C-+': 'Control++',
'C--': 'Control+-',
'C-tab': 'Control+Tab',
'C-S-tab': 'Control+Shift+Tab',
'C-S-n': 'Control+Shift+N',
'C-down': 'Control+ArrowDown',
'M-a': 'Meta+A',
'M-c': 'Meta+C',
'M-v': 'Meta+V',
'M-x': 'Meta+X',
'M-f4': 'Alt+F4',
}
return map[raw] ?? raw
}
export function normalizeCladoDirection(
direction: string | undefined,
): 'up' | 'down' | 'left' | 'right' {
if (
direction === 'up' ||
direction === 'down' ||
direction === 'left' ||
direction === 'right'
) {
return direction
}
return 'down'
}
export function normalizeCladoScrollAmount(amount: number | undefined): number {
if (typeof amount !== 'number') return 500
if (amount <= 0) return 100
const clamped = Math.min(amount, 1000)
return Math.max(100, Math.round((clamped / 1000) * 900))
}

View File

@@ -0,0 +1,68 @@
import { CLADO_REQUEST_TIMEOUT_MS } from '../../../../constants'
import { formatCladoHistory } from './clado-actions'
import type { CladoAction, CladoActionResponse } from './types'
export interface CladoActionClientOptions {
baseUrl?: string
apiKey?: string
}
export interface CladoActionPredictionInput {
instruction: string
imageBase64: string
actionHistory: CladoAction[]
signal?: AbortSignal
}
/** Calls the Clado action model without exposing credentials in process arguments or artifacts. */
export class CladoActionClient {
constructor(private readonly options: CladoActionClientOptions) {}
async requestActionPrediction(
input: CladoActionPredictionInput,
): Promise<CladoActionResponse> {
if (!this.options.baseUrl) {
throw new Error('executor.baseUrl must be set for clado-action provider')
}
const requestController = new AbortController()
const onAbort = () => requestController.abort()
input.signal?.addEventListener('abort', onAbort, { once: true })
const timeoutHandle = setTimeout(() => {
requestController.abort()
}, CLADO_REQUEST_TIMEOUT_MS)
try {
const headers: Record<string, string> = {
'Content-Type': 'application/json',
}
if (this.options.apiKey) {
headers.Authorization = `Bearer ${this.options.apiKey}`
}
const response = await fetch(this.options.baseUrl, {
method: 'POST',
headers,
body: JSON.stringify({
instruction: input.instruction,
image_base64: input.imageBase64,
history: formatCladoHistory(input.actionHistory),
}),
signal: requestController.signal,
})
if (!response.ok) {
const body = await response.text()
throw new Error(
`HTTP ${response.status} ${response.statusText}: ${body.slice(0, 400)}`,
)
}
return (await response.json()) as CladoActionResponse
} finally {
clearTimeout(timeoutHandle)
input.signal?.removeEventListener('abort', onAbort)
}
}
}

View File

@@ -0,0 +1,56 @@
import type { ResolvedAgentConfig } from '@browseros/server/agent/types'
import type {
DelegationResult,
ExecutorBackend,
ExecutorCallbacks,
} from '../../executor-backend'
import { CladoActionExecutor } from './clado-action-executor'
export interface CladoExecutorBackendOptions {
configTemplate: ResolvedAgentConfig
serverUrl: string
initialPageId?: number
callbacks?: ExecutorCallbacks
}
/** Executes delegated goals through the Clado visual action model. */
export class CladoExecutorBackend implements ExecutorBackend {
readonly kind = 'clado'
private executor: CladoActionExecutor | null = null
constructor(private readonly options: CladoExecutorBackendOptions) {}
async execute(
instruction: string,
signal?: AbortSignal,
): Promise<DelegationResult> {
const executor = this.getExecutor()
const result = await executor.execute(instruction, signal)
return result
}
async close(): Promise<void> {
await this.executor?.close()
}
getTotalSteps(): number {
return this.executor?.getTotalSteps() ?? 0
}
private getExecutor(): CladoActionExecutor {
if (this.executor) return this.executor
this.executor = new CladoActionExecutor(
{
provider: this.options.configTemplate.provider,
model: this.options.configTemplate.model,
apiKey: this.options.configTemplate.apiKey ?? '',
baseUrl: this.options.configTemplate.baseUrl,
},
this.options.serverUrl,
this.options.initialPageId,
)
this.executor.setCallbacks(this.options.callbacks ?? {})
return this.executor
}
}

View File

@@ -0,0 +1,78 @@
export const CLADO_ACTION_PROVIDER = 'clado-action'
export const CLADO_PAGE_SCOPED_TOOLS = new Set<string>([
'take_screenshot',
'evaluate_script',
'click',
'click_at',
'hover',
'hover_at',
'clear',
'fill',
'press_key',
'type_at',
'drag',
'drag_at',
'scroll',
'handle_dialog',
'select_option',
'navigate_page',
'close_page',
'wait_for',
])
export interface CladoActionResponse {
action?: string | null
x?: number
y?: number
text?: string
key?: string
direction?: string
startX?: number
startY?: number
endX?: number
endY?: number
amount?: number
time?: number
final_answer?: string | null
inference_time_seconds?: number
raw_response?: string
thinking?: string | null
parse_error?: string | null
}
export interface CladoViewport {
width: number
height: number
}
export interface CladoAction {
action: string
x?: number
y?: number
text?: string
key?: string
direction?: string
startX?: number
startY?: number
endX?: number
endY?: number
amount?: number
time?: number
final_answer?: string
}
export type RawCladoActionPayload = Partial<
Omit<CladoAction, 'final_answer'>
> & {
final_answer?: string | null
}
export interface CladoActionPoint {
x: number
y: number
}
export function isCladoActionProvider(provider: string): boolean {
return provider === CLADO_ACTION_PROVIDER
}

View File

@@ -0,0 +1,60 @@
import type { ResolvedAgentConfig } from '@browseros/server/agent/types'
import type { Browser } from '@browseros/server/browser'
import type {
ExecutorBackend,
ExecutorBackendKind,
ExecutorCallbacks,
} from '../executor-backend'
import { CladoExecutorBackend } from './clado/clado-executor-backend'
import { isCladoActionProvider } from './clado/types'
import { ToolLoopExecutorBackend } from './tool-loop/tool-loop-executor-backend'
export interface CreateExecutorBackendOptions {
backendKind?: ExecutorBackendKind
provider?: string
configTemplate?: ResolvedAgentConfig
browser?: Browser | null
serverUrl?: string
windowId?: number
tabId?: number
initialPageId?: number
callbacks?: ExecutorCallbacks
executor?: ExecutorBackend
}
export function backendKindForProvider(provider: string): ExecutorBackendKind {
return isCladoActionProvider(provider) ? 'clado' : 'tool-loop'
}
/** Creates the backend used for one orchestrator delegation. */
export function createExecutorBackend(
options: CreateExecutorBackendOptions,
): ExecutorBackend {
if (options.executor) return options.executor
const kind =
options.backendKind ??
backendKindForProvider(
options.provider ?? options.configTemplate?.provider ?? '',
)
if (kind === 'clado') {
return new CladoExecutorBackend({
configTemplate: required(options.configTemplate, 'configTemplate'),
serverUrl: required(options.serverUrl, 'serverUrl'),
initialPageId: options.initialPageId,
callbacks: options.callbacks,
})
}
return new ToolLoopExecutorBackend({
configTemplate: required(options.configTemplate, 'configTemplate'),
browser: options.browser ?? null,
callbacks: options.callbacks,
})
}
function required<T>(value: T | undefined, name: string): T {
if (value === undefined) throw new Error(`${name} is required`)
return value
}

View File

@@ -0,0 +1,144 @@
import { randomUUID } from 'node:crypto'
import { AiSdkAgent } from '@browseros/server/agent/tool-loop'
import type { ResolvedAgentConfig } from '@browseros/server/agent/types'
import type { Browser } from '@browseros/server/browser'
import { registry } from '@browseros/server/tools/registry'
import type { BrowserContext } from '@browseros/shared/schemas/browser-context'
import type {
DelegationResult,
ExecutorBackend,
ExecutorCallbacks,
} from '../../executor-backend'
import { TOOL_LOOP_EXECUTOR_SYSTEM_PROMPT } from './tool-loop-executor-prompt'
export interface ToolLoopExecutorBackendOptions {
configTemplate: ResolvedAgentConfig
browser: Browser | null
callbacks?: ExecutorCallbacks
}
/** Executes delegated goals through the BrowserOS ToolLoopAgent. */
export class ToolLoopExecutorBackend implements ExecutorBackend {
readonly kind = 'tool-loop'
private stepsUsed = 0
private currentUrl = ''
constructor(private readonly options: ToolLoopExecutorBackendOptions) {}
async execute(
instruction: string,
signal?: AbortSignal,
): Promise<DelegationResult> {
const browser = this.options.browser
if (!browser) {
throw new Error('Browser instance is required for tool-loop executor')
}
const stepsAtStart = this.stepsUsed
const toolsUsed: string[] = []
let status: DelegationResult['status'] = 'done'
let resultText = ''
const conversationId = randomUUID()
const agentConfig: ResolvedAgentConfig = {
...this.options.configTemplate,
conversationId,
userSystemPrompt: TOOL_LOOP_EXECUTOR_SYSTEM_PROMPT,
evalMode: true,
workingDir: `/tmp/browseros-eval-executor-${conversationId}`,
}
const browserContext = await this.browserContext(browser)
let agent: AiSdkAgent | null = null
try {
agent = await AiSdkAgent.create({
resolvedConfig: agentConfig,
browser,
registry,
browserContext,
})
await agent.toolLoopAgent.generate({
prompt: instruction,
abortSignal: signal,
experimental_onToolCallStart: ({ toolCall }) => {
const input = toolCall.input as Record<string, unknown> | undefined
if (input && typeof input.url === 'string' && input.url.length > 0) {
this.currentUrl = input.url
}
this.options.callbacks?.onToolCallStart?.({
toolCallId: toolCall.toolCallId,
toolName: toolCall.toolName,
input: toolCall.input,
})
},
experimental_onToolCallFinish: async () => {
this.stepsUsed++
await this.options.callbacks?.onToolCallFinish?.()
},
onStepFinish: async ({ toolCalls, toolResults, text }) => {
if (toolCalls) {
for (const toolCall of toolCalls) {
if (!toolsUsed.includes(toolCall.toolName)) {
toolsUsed.push(toolCall.toolName)
}
}
}
if (text) resultText = text
await this.options.callbacks?.onStepFinish?.({
toolCalls,
toolResults,
text,
})
},
})
} catch {
status = signal?.aborted ? 'timeout' : 'blocked'
} finally {
if (agent) await agent.dispose().catch(() => {})
}
if (status === 'done' && signal?.aborted) {
status = 'timeout'
}
return {
observation: resultText || 'Execution completed with no actions taken.',
status,
url: this.currentUrl,
actionsPerformed: this.stepsUsed - stepsAtStart,
toolsUsed,
}
}
async close(): Promise<void> {
// No persistent resources; AiSdkAgent is disposed at the end of each execute() call.
}
getTotalSteps(): number {
return this.stepsUsed
}
private async browserContext(
browser: Browser,
): Promise<BrowserContext | undefined> {
const pages = await browser.listPages()
const activePage = pages[0]
if (!activePage) return undefined
return {
activeTab: {
id: activePage.tabId,
pageId: activePage.pageId,
url: activePage.url,
title: activePage.title,
},
}
}
}

View File

@@ -0,0 +1,21 @@
export const TOOL_LOOP_EXECUTOR_SYSTEM_PROMPT = `You are a browser executor. You receive a single goal-level instruction and execute it using browser tools.
## Your Job
1. Execute browser actions to achieve the given goal
2. Stop as soon as the goal is accomplished -- do NOT perform extra actions
3. Write a final observation describing the result
## Final Response Format
When done, your response MUST include:
- What you accomplished (or what went wrong)
- What the page currently shows: key headings, links, data, or content visible
- The current URL from the address bar
- If you got stuck, what is blocking progress
## Rules
- Only do what was asked. Do not navigate away, open extra tabs, or reorganize the browser.
- If the goal is to navigate somewhere, confirm you arrived by describing what you see.
- If the goal is to click something, confirm the result of the click.
- If you cannot find what was asked for, say so clearly -- do not guess or improvise.
- Prefer browser_navigate over browser_open_tab for going to URLs.
- Do NOT call browser_group_tabs or other organizational tools.`

View File

@@ -0,0 +1,33 @@
import type { ExecutorResult } from '../orchestrator-executor/types'
export type ExecutorBackendKind = 'tool-loop' | 'clado'
export type DelegationResult = ExecutorResult
export interface ToolCallInfo {
toolCallId: string
toolName: string
input: unknown
}
export interface ToolResultInfo {
toolCallId: string
toolName: string
output: unknown
}
export interface ExecutorCallbacks {
onToolCallStart?: (toolCall: ToolCallInfo) => void
onToolCallFinish?: () => Promise<void>
onStepFinish?: (step: {
toolCalls?: ReadonlyArray<ToolCallInfo>
toolResults?: ReadonlyArray<ToolResultInfo>
text?: string
}) => Promise<void>
}
export interface ExecutorBackend {
readonly kind: ExecutorBackendKind
execute(instruction: string, signal?: AbortSignal): Promise<DelegationResult>
close(): Promise<void>
getTotalSteps(): number
}

View File

@@ -1,243 +0,0 @@
/**
* Executor - Wraps AiSdkAgent for page-level browser actions (direct CDP)
*
* The executor:
* - Receives goal-level instructions from orchestrator
* - Executes browser actions until the goal is accomplished
* - Returns observation to orchestrator (not full history)
*/
import { randomUUID } from 'node:crypto'
import { AiSdkAgent } from '@browseros/server/agent/tool-loop'
import type { ResolvedAgentConfig } from '@browseros/server/agent/types'
import type { Browser } from '@browseros/server/browser'
import { registry } from '@browseros/server/tools/registry'
import type { BrowserContext } from '@browseros/shared/schemas/browser-context'
import { CladoActionExecutor } from './clado-action-executor'
import type { ExecutorResult } from './types'
const EXECUTOR_SYSTEM_PROMPT = `You are a browser executor. You receive a single goal-level instruction and execute it using browser tools.
## Your Job
1. Execute browser actions to achieve the given goal
2. Stop as soon as the goal is accomplished — do NOT perform extra actions
3. Write a final observation describing the result
## Final Response Format
When done, your response MUST include:
- What you accomplished (or what went wrong)
- What the page currently shows: key headings, links, data, or content visible
- The current URL from the address bar
- If you got stuck, what is blocking progress
## Rules
- Only do what was asked. Do not navigate away, open extra tabs, or reorganize the browser.
- If the goal is to navigate somewhere, confirm you arrived by describing what you see.
- If the goal is to click something, confirm the result of the click.
- If you cannot find what was asked for, say so clearly — do not guess or improvise.
- Prefer browser_navigate over browser_open_tab for going to URLs.
- Do NOT call browser_group_tabs or other organizational tools.`
export interface ToolCallInfo {
toolCallId: string
toolName: string
input: unknown
}
export interface ToolResultInfo {
toolCallId: string
toolName: string
output: unknown
}
export interface ExecutorCallbacks {
onToolCallStart?: (toolCall: ToolCallInfo) => void
onToolCallFinish?: () => Promise<void>
onStepFinish?: (step: {
toolCalls?: ReadonlyArray<ToolCallInfo>
toolResults?: ReadonlyArray<ToolResultInfo>
text?: string
}) => Promise<void>
}
export class Executor {
private cladoExecutor: CladoActionExecutor | null = null
private stepsUsed = 0
private currentUrl = ''
private configTemplate: ResolvedAgentConfig
private isCladoAction: boolean
private browser: Browser | null
private serverUrl: string
private windowId?: number
private tabId?: number
private initialPageId?: number
private callbacks: ExecutorCallbacks
constructor(
configTemplate: ResolvedAgentConfig,
browser: Browser | null,
serverUrl: string,
options?: {
isCladoAction?: boolean
windowId?: number
tabId?: number
initialPageId?: number
callbacks?: ExecutorCallbacks
},
) {
this.configTemplate = configTemplate
this.isCladoAction = options?.isCladoAction ?? false
this.browser = browser
this.serverUrl = serverUrl
this.windowId = options?.windowId
this.tabId = options?.tabId
this.initialPageId = options?.initialPageId
this.callbacks = options?.callbacks ?? {}
}
async execute(
instruction: string,
signal?: AbortSignal,
): Promise<ExecutorResult> {
if (this.isCladoAction) {
if (!this.cladoExecutor) {
this.cladoExecutor = new CladoActionExecutor(
{
provider: this.configTemplate.provider,
model: this.configTemplate.model,
apiKey: this.configTemplate.apiKey ?? '',
baseUrl: this.configTemplate.baseUrl,
},
this.serverUrl,
this.windowId,
this.tabId,
this.initialPageId,
)
this.cladoExecutor.setCallbacks(this.callbacks)
}
const result = await this.cladoExecutor.execute(instruction, signal)
this.stepsUsed = this.cladoExecutor.getTotalSteps()
this.currentUrl = result.url || this.currentUrl
return result
}
if (!this.browser) {
throw new Error('Browser instance is required for standard executor path')
}
const stepsAtStart = this.stepsUsed
const toolsUsed: string[] = []
let status: 'done' | 'blocked' | 'timeout' = 'done'
let resultText = ''
const conversationId = randomUUID()
const agentConfig: ResolvedAgentConfig = {
...this.configTemplate,
conversationId,
userSystemPrompt: EXECUTOR_SYSTEM_PROMPT,
evalMode: true,
workingDir: `/tmp/browseros-eval-executor-${conversationId}`,
}
// Build browser context so executor agent knows the correct page ID
let browserContext: BrowserContext | undefined
if (this.browser) {
const pages = await this.browser.listPages()
const activePage = pages[0]
if (activePage) {
browserContext = {
activeTab: {
id: activePage.tabId,
pageId: activePage.pageId,
url: activePage.url,
title: activePage.title,
},
}
}
}
let agent: AiSdkAgent | null = null
try {
agent = await AiSdkAgent.create({
resolvedConfig: agentConfig,
browser: this.browser,
registry,
browserContext,
})
await agent.toolLoopAgent.generate({
prompt: instruction,
abortSignal: signal,
experimental_onToolCallStart: ({ toolCall }) => {
const input = toolCall.input as Record<string, unknown> | undefined
if (input && typeof input.url === 'string' && input.url.length > 0) {
this.currentUrl = input.url
}
this.callbacks.onToolCallStart?.({
toolCallId: toolCall.toolCallId,
toolName: toolCall.toolName,
input: toolCall.input,
})
},
experimental_onToolCallFinish: async () => {
this.stepsUsed++
await this.callbacks.onToolCallFinish?.()
},
onStepFinish: async ({ toolCalls, toolResults, text }) => {
if (toolCalls) {
for (const tc of toolCalls) {
if (!toolsUsed.includes(tc.toolName)) {
toolsUsed.push(tc.toolName)
}
}
}
if (text) {
resultText = text
}
await this.callbacks.onStepFinish?.({ toolCalls, toolResults, text })
},
})
} catch {
if (signal?.aborted) {
status = 'timeout'
} else {
status = 'blocked'
}
} finally {
if (agent) await agent.dispose().catch(() => {})
}
if (status === 'done' && signal?.aborted) {
status = 'timeout'
}
const observation =
resultText || 'Execution completed with no actions taken.'
return {
observation,
status,
url: this.currentUrl,
actionsPerformed: this.stepsUsed - stepsAtStart,
toolsUsed,
}
}
async close(): Promise<void> {
await this.cladoExecutor?.close()
}
getTotalSteps(): number {
if (this.isCladoAction) {
return this.cladoExecutor?.getTotalSteps() ?? 0
}
return this.stepsUsed
}
}

View File

@@ -24,15 +24,16 @@ import {
resolveProviderConfig,
} from '../../utils/resolve-provider-config'
import { withEvalTimeout } from '../../utils/with-eval-timeout'
import { isCladoActionProvider } from '../orchestrated/backends/clado/types'
import { createExecutorBackend } from '../orchestrated/backends/create-executor-backend'
import type { ExecutorCallbacks } from '../orchestrated/executor-backend'
import type { AgentContext, AgentEvaluator, AgentResult } from '../types'
import { Executor, type ExecutorCallbacks } from './executor'
import { OrchestratorAgent } from './orchestrator-agent'
import type { ExecutorFactory, ExecutorResult } from './types'
interface ResolvedConfigs {
orchestratorConfig: ResolvedAgentConfig & { maxTurns?: number }
executorConfig: ResolvedAgentConfig
isCladoAction: boolean
}
function toResolvedAgentConfig(
@@ -67,7 +68,10 @@ async function resolveAgentConfig(
if (!executorModel) {
throw new Error('executor.model is required in config')
}
if (config.executor.provider === 'clado-action' && !config.executor.baseUrl) {
if (
isCladoActionProvider(config.executor.provider) &&
!config.executor.baseUrl
) {
throw new Error(
'executor.baseUrl is required in config for clado-action provider',
)
@@ -75,10 +79,8 @@ async function resolveAgentConfig(
const resolvedOrchestrator = await resolveProviderConfig(config.orchestrator)
const isCladoAction = config.executor.provider === 'clado-action'
let executorConfig: ResolvedAgentConfig
if (isCladoAction) {
if (isCladoActionProvider(config.executor.provider)) {
executorConfig = {
conversationId: crypto.randomUUID(),
provider: config.executor.provider as ResolvedAgentConfig['provider'],
@@ -107,7 +109,7 @@ async function resolveAgentConfig(
maxTurns: config.orchestrator.maxTurns,
}
return { orchestratorConfig, executorConfig, isCladoAction }
return { orchestratorConfig, executorConfig }
}
export class OrchestratorExecutorEvaluator implements AgentEvaluator {
@@ -127,7 +129,7 @@ export class OrchestratorExecutorEvaluator implements AgentEvaluator {
}
const agentConfig = config.agent as OrchestratorExecutorConfig
const { orchestratorConfig, executorConfig, isCladoAction } =
const { orchestratorConfig, executorConfig } =
await resolveAgentConfig(agentConfig)
// Connect to Chrome via CDP — same per-worker offset used by app-manager.
@@ -235,12 +237,12 @@ export class OrchestratorExecutorEvaluator implements AgentEvaluator {
await capture.messageLogger.logStreamEvent(delegateInputEvent)
capture.emitEvent(task.query_id, delegateInputEvent)
const executor = new Executor(
executorConfig,
const executor = createExecutorBackend({
configTemplate: executorConfig,
browser,
config.browseros.server_url,
{ isCladoAction, callbacks },
)
serverUrl: config.browseros.server_url,
callbacks,
})
let result: ExecutorResult
try {
result = await executor.execute(instruction, signal)
@@ -329,6 +331,5 @@ export class OrchestratorExecutorEvaluator implements AgentEvaluator {
}
}
export { Executor } from './executor'
export { OrchestratorAgent } from './orchestrator-agent'
export * from './types'

View File

@@ -57,6 +57,20 @@ export class TrajectorySaver {
)
}
async saveAttempt(attempt: Record<string, unknown>): Promise<void> {
await writeFile(
join(this.outputDir, 'attempt.json'),
JSON.stringify(attempt, null, 2),
)
}
async saveGrades(graderResults: Record<string, GraderResult>): Promise<void> {
await writeFile(
join(this.outputDir, 'grades.json'),
JSON.stringify(graderResults, null, 2),
)
}
async loadMetadata(): Promise<TaskMetadata> {
const content = await readFile(
join(this.outputDir, 'metadata.json'),
@@ -70,6 +84,7 @@ export class TrajectorySaver {
): Promise<void> {
const metadata = await this.loadMetadata()
metadata.grader_results = graderResults
await this.saveGrades(graderResults)
await this.saveMetadata(metadata)
}

View File

@@ -0,0 +1,170 @@
import { parseArgs } from 'node:util'
export type PublishTarget = 'r2'
export interface LegacyCliArgs {
command: 'legacy'
configPath?: string
help?: boolean
}
export interface SuiteCliArgs {
command: 'suite'
configPath?: string
suitePath?: string
variantId?: string
provider?: string
model?: string
apiKey?: string
baseUrl?: string
publishTarget?: PublishTarget
}
export interface RunCliArgs
extends Omit<SuiteCliArgs, 'command' | 'publishTarget'> {
command: 'run'
}
export interface GradeCliArgs {
command: 'grade'
runDir: string
}
export interface PublishCliArgs {
command: 'publish'
runDir: string
target: PublishTarget
}
export type EvalCliArgs =
| LegacyCliArgs
| SuiteCliArgs
| RunCliArgs
| GradeCliArgs
| PublishCliArgs
const COMMANDS = new Set(['suite', 'run', 'grade', 'publish'])
function stringValue(value: string | boolean | undefined): string | undefined {
return typeof value === 'string' && value.length > 0 ? value : undefined
}
function publishTarget(value: string | undefined): PublishTarget | undefined {
if (value === undefined) return undefined
if (value === 'r2') return 'r2'
throw new Error(`Unsupported publish target: ${value}`)
}
function requireOne(
command: string,
configPath: string | undefined,
suitePath: string | undefined,
): void {
if (!configPath && !suitePath) {
throw new Error(`${command} requires --config or --suite`)
}
if (configPath && suitePath) {
throw new Error(`${command} accepts either --config or --suite, not both`)
}
}
function parseSuiteLikeArgs(
command: 'suite' | 'run',
argv: string[],
): SuiteCliArgs | RunCliArgs {
const { values } = parseArgs({
args: argv,
options: {
config: { type: 'string' },
suite: { type: 'string' },
variant: { type: 'string' },
provider: { type: 'string' },
model: { type: 'string' },
'api-key': { type: 'string' },
'base-url': { type: 'string' },
publish: { type: 'string' },
},
})
const configPath = stringValue(values.config)
const suitePath = stringValue(values.suite)
requireOne(command, configPath, suitePath)
const parsed: SuiteCliArgs | RunCliArgs =
command === 'suite' ? { command: 'suite' } : { command: 'run' }
if (configPath) parsed.configPath = configPath
if (suitePath) parsed.suitePath = suitePath
const variantId = stringValue(values.variant)
if (variantId) parsed.variantId = variantId
const provider = stringValue(values.provider)
if (provider) parsed.provider = provider
const model = stringValue(values.model)
if (model) parsed.model = model
const apiKey = stringValue(values['api-key'])
if (apiKey) parsed.apiKey = apiKey
const baseUrl = stringValue(values['base-url'])
if (baseUrl) parsed.baseUrl = baseUrl
if (command === 'suite') {
const target = publishTarget(stringValue(values.publish))
if (target) {
const suiteArgs = parsed as SuiteCliArgs
suiteArgs.publishTarget = target
}
}
return parsed
}
function parseLegacyArgs(argv: string[]): LegacyCliArgs {
const { values } = parseArgs({
args: argv,
options: {
config: { type: 'string', short: 'c' },
help: { type: 'boolean', short: 'h' },
},
})
const parsed: LegacyCliArgs = { command: 'legacy' }
const configPath = stringValue(values.config)
if (configPath) parsed.configPath = configPath
if (values.help) parsed.help = true
return parsed
}
/** Parses the eval CLI command without running browser or publishing side effects. */
export function parseEvalCliArgs(argv: string[]): EvalCliArgs {
const [command, ...rest] = argv
if (!COMMANDS.has(command ?? '')) {
return parseLegacyArgs(argv)
}
switch (command) {
case 'suite':
return parseSuiteLikeArgs('suite', rest)
case 'run':
return parseSuiteLikeArgs('run', rest)
case 'grade': {
const { values } = parseArgs({
args: rest,
options: { run: { type: 'string' } },
})
const runDir = stringValue(values.run)
if (!runDir) throw new Error('grade requires --run')
return { command: 'grade', runDir }
}
case 'publish': {
const { values } = parseArgs({
args: rest,
options: { run: { type: 'string' }, target: { type: 'string' } },
})
const runDir = stringValue(values.run)
if (!runDir) throw new Error('publish requires --run')
const target = publishTarget(stringValue(values.target))
if (!target) throw new Error('publish requires --target')
return { command: 'publish', runDir, target }
}
default:
return parseLegacyArgs(argv)
}
}

View File

@@ -0,0 +1,84 @@
import { readdir, readFile, stat } from 'node:fs/promises'
import { join } from 'node:path'
import { TrajectorySaver } from '../../capture/trajectory-saver'
import { runGraders } from '../../grading/grader-runner'
import { type Message, MessageSchema, TaskMetadataSchema } from '../../types'
import type { GradeCliArgs } from '../args'
async function loadMessages(taskDir: string): Promise<Message[]> {
const content = await readFile(
join(taskDir, 'messages.jsonl'),
'utf-8',
).catch(() => '')
return content
.split('\n')
.filter((line) => line.trim().length > 0)
.map((line) => MessageSchema.parse(JSON.parse(line)))
}
async function findTaskDirs(runDir: string): Promise<string[]> {
const entries = await readdir(runDir, { withFileTypes: true })
const taskDirs: string[] = []
for (const entry of entries) {
if (!entry.isDirectory()) continue
const taskDir = join(runDir, entry.name)
const metadata = await stat(join(taskDir, 'metadata.json')).catch(
() => null,
)
if (metadata?.isFile()) taskDirs.push(taskDir)
}
return taskDirs
}
/** Re-runs graders for task artifacts that already contain metadata and messages. */
export async function runGradeCommand(args: GradeCliArgs): Promise<void> {
const runStat = await stat(args.runDir).catch(() => null)
if (!runStat?.isDirectory()) {
throw new Error(`Not a run directory: ${args.runDir}`)
}
const taskDirs = await findTaskDirs(args.runDir)
if (taskDirs.length === 0) {
throw new Error(`No task metadata found under ${args.runDir}`)
}
let graded = 0
for (const taskDir of taskDirs) {
const metadata = TaskMetadataSchema.parse(
JSON.parse(await readFile(join(taskDir, 'metadata.json'), 'utf-8')),
)
const graderNames = Object.keys(metadata.grader_results ?? {})
if (graderNames.length === 0) {
console.warn(`Skipping ${metadata.query_id}: no existing grader names`)
continue
}
const messages = await loadMessages(taskDir)
const graderResults = await runGraders(graderNames, {
task: {
query_id: metadata.query_id,
query: metadata.query,
dataset: metadata.dataset,
},
messages,
screenshotCount: metadata.screenshot_count ?? metadata.total_steps,
finalAnswer: metadata.final_answer,
taskArtifactDir: taskDir,
outputDir: taskDir,
mcpUrl: `${process.env.BROWSEROS_SERVER_URL || 'http://127.0.0.1:9110'}/mcp`,
})
await new TrajectorySaver(
args.runDir,
metadata.query_id,
).updateGraderResults(graderResults)
graded++
}
if (graded === 0) {
throw new Error(
`No tasks with existing grader names found under ${args.runDir}`,
)
}
console.log(`Re-graded ${graded} task(s) in ${args.runDir}`)
}

View File

@@ -0,0 +1,25 @@
import { publishPathToR2 } from '../../publishing/r2-publisher'
import type { PublishCliArgs, PublishTarget } from '../args'
export interface PublishRunOptions {
runDir: string
target: PublishTarget
}
/** Publishes run artifacts through the R2 viewer upload path. */
export async function publishRun(options: PublishRunOptions): Promise<void> {
if (options.target !== 'r2') {
throw new Error(`Unsupported publish target: ${options.target}`)
}
const result = await publishPathToR2(options.runDir)
for (const run of result.uploadedRuns) {
console.log(run.viewerUrl)
}
for (const runId of result.skippedRuns) {
console.log(`${runId}: already uploaded, skipping`)
}
}
export async function runPublishCommand(args: PublishCliArgs): Promise<void> {
await publishRun({ runDir: args.runDir, target: args.target })
}

View File

@@ -0,0 +1,21 @@
import type { RunCliArgs } from '../args'
import { runSuiteCommand, type SuiteCommandDeps } from './suite'
/** Executes tasks from a config or suite without publishing artifacts. */
export async function runRunCommand(
args: RunCliArgs,
deps: SuiteCommandDeps = {},
): Promise<void> {
await runSuiteCommand(
{
configPath: args.configPath,
suitePath: args.suitePath,
variantId: args.variantId,
provider: args.provider,
model: args.model,
apiKey: args.apiKey,
baseUrl: args.baseUrl,
},
deps,
)
}

View File

@@ -0,0 +1,187 @@
import type { RunEvalOptions, RunEvalResult } from '../../runner/types'
import { runEval as defaultRunEval } from '../../runs/eval-runner'
import {
type AdaptedEvalConfig,
adaptEvalConfigFile,
} from '../../suites/config-adapter'
import { loadSuite } from '../../suites/load-suite'
import { type EvalVariant, resolveVariant } from '../../suites/resolve-variant'
import type { EvalSuite } from '../../suites/schema'
import { type EvalConfig, EvalConfigSchema } from '../../types'
import type { PublishTarget } from '../args'
type Env = Record<string, string | undefined>
export interface SuiteCommandOptions {
configPath?: string
suitePath?: string
variantId?: string
provider?: string
model?: string
apiKey?: string
baseUrl?: string
publishTarget?: PublishTarget
env?: Env
}
export type ResolvedSuiteCommand =
| (AdaptedEvalConfig & { kind: 'config'; datasetPath?: undefined })
| {
kind: 'suite'
suitePath: string
suite: EvalSuite
variant: EvalVariant
datasetPath: string
evalConfig: EvalConfig
}
export interface SuiteCommandDeps {
runEval?: (options: RunEvalOptions) => Promise<RunEvalResult | undefined>
publishRun?: (options: {
runDir: string
target: PublishTarget
}) => Promise<void>
}
function ensureRunnableSuite(suite: EvalSuite): void {
if (!suite.browseros) {
throw new Error('suite browseros config is required to run suite commands')
}
}
function suiteToEvalConfig(
suite: EvalSuite,
datasetPath: string,
variant: EvalVariant,
env: Env,
): EvalConfig {
ensureRunnableSuite(suite)
const base = {
dataset: datasetPath,
num_workers: suite.workers,
restart_server_per_task: suite.restartBrowserPerTask,
browseros: suite.browseros,
graders: suite.graders,
timeout_ms: suite.timeoutMs,
captcha: suite.captcha,
}
if (suite.agent.type === 'single' || suite.agent.type === 'tool-loop') {
// The legacy runner names the BrowserOS tool-loop agent "single".
return EvalConfigSchema.parse({
...base,
agent: {
type: 'single',
provider: variant.agent.provider,
model: variant.agent.model,
apiKey: variant.agent.apiKey,
baseUrl: variant.agent.baseUrl,
supportsImages: variant.agent.supportsImages,
},
})
}
const executorBackend = suite.agent.executorBackend ?? 'tool-loop'
const executor =
executorBackend === 'clado'
? {
provider: 'clado-action' as const,
model:
env.EVAL_EXECUTOR_MODEL ?? env.CLADO_ACTION_MODEL ?? 'clado-action',
apiKey: env.EVAL_EXECUTOR_API_KEY ?? env.CLADO_ACTION_API_KEY ?? '',
baseUrl:
env.EVAL_EXECUTOR_BASE_URL ??
env.CLADO_ACTION_BASE_URL ??
env.CLADO_ACTION_URL,
}
: {
provider: variant.agent.provider,
model: variant.agent.model,
apiKey: variant.agent.apiKey,
baseUrl: variant.agent.baseUrl,
}
return EvalConfigSchema.parse({
...base,
agent: {
type: 'orchestrator-executor',
orchestrator: {
provider: variant.agent.provider,
model: variant.agent.model,
apiKey: variant.agent.apiKey,
baseUrl: variant.agent.baseUrl,
},
executor,
},
})
}
/** Resolves config-backed or suite-backed CLI input into the run shape used by the runner. */
export async function resolveSuiteCommand(
options: SuiteCommandOptions,
): Promise<ResolvedSuiteCommand> {
const env = options.env ?? process.env
if (options.configPath) {
return {
kind: 'config',
...(await adaptEvalConfigFile(options.configPath, { env })),
}
}
if (!options.suitePath) {
throw new Error('suite requires --config or --suite')
}
const loaded = await loadSuite(options.suitePath)
const variant = resolveVariant({
variantId: options.variantId,
provider: options.provider,
model: options.model,
apiKey: options.apiKey,
baseUrl: options.baseUrl,
env,
})
return {
kind: 'suite',
suitePath: loaded.suitePath,
suite: loaded.suite,
variant,
datasetPath: loaded.datasetPath,
evalConfig: suiteToEvalConfig(
loaded.suite,
loaded.datasetPath,
variant,
env,
),
}
}
/** Runs the full suite loop: resolve input, execute tasks, then optionally publish the run. */
export async function runSuiteCommand(
options: SuiteCommandOptions,
deps: SuiteCommandDeps = {},
): Promise<void> {
const runEval = deps.runEval ?? defaultRunEval
const resolved = await resolveSuiteCommand(options)
const runOptions: RunEvalOptions =
resolved.kind === 'config'
? { configPath: resolved.configPath }
: {
configPath: resolved.suitePath,
dataPath: resolved.datasetPath,
config: resolved.evalConfig,
}
const result = await runEval(runOptions)
if (!options.publishTarget) return
const outputDir = result?.outputDir
if (!outputDir) {
throw new Error('publish requested but runner did not return an outputDir')
}
if (!deps.publishRun) {
throw new Error('publish requested before the publisher is configured')
}
await deps.publishRun({ runDir: outputDir, target: options.publishTarget })
}

View File

@@ -0,0 +1,70 @@
import { startDashboard } from '../dashboard/server'
import { runEval } from '../runs/eval-runner'
import { type EvalCliArgs, parseEvalCliArgs } from './args'
import { runGradeCommand } from './commands/grade'
import { publishRun, runPublishCommand } from './commands/publish'
import { runRunCommand } from './commands/run'
import { runSuiteCommand } from './commands/suite'
export function usage(): string {
return `
BrowserOS Eval
Usage:
bun run eval suite --config <config.json> [--publish r2]
bun run eval suite --suite <suite.json> --variant <id> [--publish r2]
bun run eval run --config <config.json>
bun run eval run --suite <suite.json> --variant <id>
bun run eval grade --run <results/run-dir>
bun run eval publish --run <results/run-dir> --target r2
bun run eval -c <config.json>
`
}
async function runLegacyCommand(args: EvalCliArgs): Promise<void> {
if (args.command !== 'legacy') return
if (args.help) {
console.log(usage())
return
}
if (args.configPath) {
await runEval({ configPath: args.configPath })
return
}
startDashboard({
tasks: [],
configName: '',
agentType: '',
outputDir: '',
configMode: true,
})
console.log(
'Dashboard running at http://localhost:9900 — configure and run from the UI',
)
await new Promise(() => {})
}
/** Dispatches the eval CLI while preserving the old config/dashboard entry points. */
export async function runCli(
argv: string[] = Bun.argv.slice(2),
): Promise<void> {
const args = parseEvalCliArgs(argv)
switch (args.command) {
case 'legacy':
await runLegacyCommand(args)
break
case 'suite':
await runSuiteCommand(args, { publishRun })
break
case 'run':
await runRunCommand(args)
break
case 'grade':
await runGradeCommand(args)
break
case 'publish':
await runPublishCommand(args)
break
}
}

View File

@@ -5,4 +5,5 @@
export const DEFAULT_TIMEOUT_MS = 30 * 60 * 1000 // 30 minutes
export const SCREENSHOT_TIMEOUT_MS = 65_000 // 65s — ensures we get extension's error (60s)
export const MAX_ACTIONS_PER_DELEGATION = 15
export const CLADO_REQUEST_TIMEOUT_MS = 120_000
// Cold start can take ~5 minutes per Clado; 6 minutes leaves headroom.
export const CLADO_REQUEST_TIMEOUT_MS = 360_000

View File

@@ -1,5 +1,5 @@
import { mkdir, readdir, readFile, stat } from 'node:fs/promises'
import { join, resolve } from 'node:path'
import { dirname, join, resolve, sep } from 'node:path'
import { Hono } from 'hono'
import { streamSSE } from 'hono/streaming'
import { ParallelExecutor } from '../runner/parallel-executor'
@@ -128,6 +128,35 @@ let dashboardConfigMode = false
const configsDir = join(import.meta.dir, '..', '..', 'configs')
const projectRoot = resolve(import.meta.dir, '..', '..', '..', '..')
async function listConfigFiles(dir: string, prefix = ''): Promise<string[]> {
const entries = await readdir(join(dir, prefix), { withFileTypes: true })
const files: string[] = []
for (const entry of entries) {
const relativePath = prefix ? join(prefix, entry.name) : entry.name
if (entry.isDirectory()) {
files.push(...(await listConfigFiles(dir, relativePath)))
} else if (entry.isFile() && entry.name.endsWith('.json')) {
files.push(relativePath.split(sep).join('/'))
}
}
return files.sort()
}
function resolveConfigPath(name: string): string | null {
if (!name.endsWith('.json')) return null
if (name.split('/').some((part) => !part || part === '.' || part === '..')) {
return null
}
const resolvedPath = resolve(configsDir, name)
const resolvedConfigsDir = resolve(configsDir)
const configRootPrefix = resolvedConfigsDir.endsWith(sep)
? resolvedConfigsDir
: `${resolvedConfigsDir}${sep}`
if (!resolvedPath.startsWith(configRootPrefix)) return null
return resolvedPath
}
// ============================================================================
// Hono App
// ============================================================================
@@ -339,21 +368,21 @@ app.get('/api/mode', (c) => {
// List saved config files
app.get('/api/configs', async (c) => {
try {
const files = await readdir(configsDir)
return c.json(files.filter((f) => f.endsWith('.json')))
return c.json(await listConfigFiles(configsDir))
} catch {
return c.json([])
}
})
// Read a specific config file
app.get('/api/config/:name', async (c) => {
const name = c.req.param('name')
if (name.includes('/') || name.includes('..')) {
app.get('/api/config/*', async (c) => {
const name = decodeURIComponent(c.req.path.slice('/api/config/'.length))
const configPath = resolveConfigPath(name)
if (!configPath) {
return c.json({ error: 'Invalid config name' }, 400)
}
try {
const content = await readFile(join(configsDir, name), 'utf-8')
const content = await readFile(configPath, 'utf-8')
return c.json(JSON.parse(content))
} catch {
return c.notFound()
@@ -382,8 +411,17 @@ app.post('/api/run', async (c) => {
const config = parseResult.data
// Resolve relative paths from configs/ dir (dataset dropdown values are relative to it)
const baseDir = configsDir
let baseDir = configsDir
if (body.configName) {
const configPath = resolveConfigPath(body.configName)
if (!configPath) {
return c.json({ error: 'Invalid config name' }, 400)
}
baseDir = dirname(configPath)
}
// Resolve relative paths from the loaded config location. Unsaved dashboard
// configs keep using apps/eval/configs as their base for dropdown values.
const datasetPath = resolve(
config.dataset.startsWith('/')
? config.dataset

View File

@@ -685,6 +685,59 @@
});
}
// Test harness note: these ASCII section markers are used by r2-viewer-compat.test.ts.
// -- Artifact path resolution
function taskKey(task) {
return task.queryId || task.id || 'unknown-task';
}
function legacyArtifactPath(task, artifact) {
const id = taskKey(task);
switch (artifact) {
case 'attempt':
return `${id}/attempt.json`;
case 'metadata':
return `${id}/metadata.json`;
case 'messages':
return `${id}/messages.jsonl`;
case 'trace':
return `${id}/trace.jsonl`;
case 'grades':
return `${id}/grades.json`;
case 'screenshots':
return `${id}/screenshots`;
case 'graderArtifacts':
return `${id}/grader-artifacts`;
default:
return `${id}/${artifact}`;
}
}
function artifactPath(task, artifact) {
const manifestPath = task.paths && task.paths[artifact];
if (typeof manifestPath === 'string' && manifestPath.length > 0) {
return manifestPath.replace(/^\/+/, '');
}
return legacyArtifactPath(task, artifact);
}
function artifactUrl(task, artifact) {
return `${basePath}/${artifactPath(task, artifact)}`;
}
function metadataUrl(task) {
return artifactUrl(task, 'metadata');
}
function messagesUrl(task) {
return artifactUrl(task, 'messages');
}
function screenshotUrl(task, n) {
return `${artifactUrl(task, 'screenshots')}/${n}.png`;
}
// -- Task selection
// ── Task selection ─────────────────────────────────────────────
function selectTask(task) {
stopAutoplay();
@@ -716,6 +769,7 @@
}
}
// -- Center panel
// ── Center panel: screenshot viewer ────────────────────────────
function renderCenterPanel(task) {
const panel = document.getElementById('center-panel');
@@ -763,10 +817,6 @@
updateControls();
}
function screenshotUrl(task, n) {
return `${basePath}/${task.queryId || task.id}/screenshots/${n}.png`;
}
function goToStep(n) {
if (!selectedTask || n < 1 || n > totalSteps) return;
currentStep = n;
@@ -914,7 +964,7 @@
body.innerHTML = '<div class="placeholder"><div class="ph-text" style="color: #6e7681;">Loading messages...</div></div>';
countEl.textContent = '';
const msgUrl = `${basePath}/${task.queryId || task.id}/messages.jsonl`;
const msgUrl = messagesUrl(task);
fetch(msgUrl)
.then((res) => {
@@ -1075,7 +1125,7 @@
// ── Load task metadata for rich grader details ──────────────────
function loadTaskMetadata(task) {
const metaUrl = `${basePath}/${task.queryId || task.id}/metadata.json`;
const metaUrl = metadataUrl(task);
fetch(metaUrl)
.then((res) => res.ok ? res.json() : null)
.then((meta) => {

View File

@@ -1,5 +1,12 @@
import { spawn } from 'node:child_process'
import { join } from 'node:path'
import {
writeGraderJsonArtifact,
writeGraderTextArtifact,
} from '../../grading/artifacts'
import {
type PythonEvaluatorResult,
runPythonJsonEvaluator,
} from '../../grading/python-evaluator'
import type { GraderResult } from '../../types'
import { callMcpTool } from '../../utils/mcp-client'
import type { Grader, GraderInput } from '../types'
@@ -7,12 +14,23 @@ import type { Grader, GraderInput } from '../types'
const EVAL_SCRIPT = join(
import.meta.dirname,
'..',
'..',
'..',
'scripts',
'python',
'agisdk-evaluate.py',
)
interface AgisdkEvaluatorInput {
task_id: string
env_state: Record<string, unknown>
model_response: string
}
interface AgisdkEvaluatorOutput {
reward: number
pass: boolean
message: string
per_criterion: unknown[]
}
export class AgisdkStateDiffGrader implements Grader {
name = 'agisdk_state_diff'
@@ -36,6 +54,16 @@ export class AgisdkStateDiffGrader implements Grader {
let envState: Record<string, unknown>
try {
envState = await this.fetchFinishState(origin, mcpEndpoint)
await writeGraderJsonArtifact(
input,
this.name,
'finish-state.json',
envState,
)
await writeGraderJsonArtifact(input, this.name, 'context.json', {
origin,
agisdk_task_id: taskId,
})
} catch (error) {
return {
score: 0,
@@ -46,10 +74,30 @@ export class AgisdkStateDiffGrader implements Grader {
}
try {
const result = await this.runPythonEvaluator(
taskId,
envState,
input.finalAnswer || '',
const evaluatorInput: AgisdkEvaluatorInput = {
task_id: taskId,
env_state: envState,
model_response: input.finalAnswer || '',
}
await writeGraderJsonArtifact(
input,
this.name,
'evaluator-input.json',
evaluatorInput,
)
const evaluation = await this.runPythonEvaluator(evaluatorInput)
const result = evaluation.output
await writeGraderJsonArtifact(
input,
this.name,
'evaluator-output.json',
result,
)
await writeGraderTextArtifact(
input,
this.name,
'stderr.txt',
evaluation.stderr,
)
return {
score: result.reward,
@@ -144,59 +192,12 @@ export class AgisdkStateDiffGrader implements Grader {
}
private runPythonEvaluator(
taskId: string,
envState: Record<string, unknown>,
modelResponse: string,
): Promise<{
reward: number
pass: boolean
message: string
per_criterion: unknown[]
}> {
return new Promise((resolve, reject) => {
const proc = spawn('python3', [EVAL_SCRIPT], {
stdio: ['pipe', 'pipe', 'pipe'],
})
const inputData = JSON.stringify({
task_id: taskId,
env_state: envState,
model_response: modelResponse,
})
let stdout = ''
let stderr = ''
proc.stdout.on('data', (data: Buffer) => {
stdout += data.toString()
})
proc.stderr.on('data', (data: Buffer) => {
stderr += data.toString()
})
proc.on('close', (code) => {
if (code !== 0) {
reject(
new Error(`Python evaluator exited with code ${code}: ${stderr}`),
)
return
}
try {
const result = JSON.parse(stdout.trim())
resolve(result)
} catch {
reject(new Error(`Failed to parse evaluator output: ${stdout}`))
}
})
proc.on('error', (err) => {
reject(new Error(`Failed to spawn Python evaluator: ${err.message}`))
})
proc.stdin.write(inputData)
proc.stdin.end()
evalInput: AgisdkEvaluatorInput,
): Promise<PythonEvaluatorResult<AgisdkEvaluatorOutput>> {
return runPythonJsonEvaluator<AgisdkEvaluatorOutput>({
scriptPath: EVAL_SCRIPT,
input: evalInput,
timeoutMs: 300_000,
})
}
}

View File

@@ -1,4 +1,12 @@
import { join, resolve } from 'node:path'
import {
writeGraderJsonArtifact,
writeGraderTextArtifact,
} from '../../grading/artifacts'
import {
type PythonEvaluatorResult,
runPythonJsonEvaluator,
} from '../../grading/python-evaluator'
import type { GraderResult } from '../../types'
import type { Grader, GraderInput } from '../types'
@@ -14,10 +22,7 @@ interface InfinityEvalOutput {
message: string
}
const EVAL_SCRIPT = resolve(
import.meta.dir,
'../../../scripts/infinity-evaluate.py',
)
const EVAL_SCRIPT = resolve(import.meta.dir, '../python/infinity-evaluate.py')
export class InfinityStateGrader implements Grader {
name = 'infinity_state'
@@ -66,7 +71,32 @@ export class InfinityStateGrader implements Grader {
}
try {
const result = await this.runPythonEvaluator(evalInput)
await writeGraderJsonArtifact(input, this.name, 'verifier.json', {
appName: parsed.appName,
taskId: parsed.taskId,
verifierPath,
appServerUrl,
})
await writeGraderJsonArtifact(
input,
this.name,
'evaluator-input.json',
evalInput,
)
const evaluation = await this.runPythonEvaluator(evalInput)
const result = evaluation.output
await writeGraderJsonArtifact(
input,
this.name,
'evaluator-output.json',
result,
)
await writeGraderTextArtifact(
input,
this.name,
'stderr.txt',
evaluation.stderr,
)
return {
score: result.pass ? 1 : 0,
pass: result.pass,
@@ -108,27 +138,11 @@ export class InfinityStateGrader implements Grader {
private async runPythonEvaluator(
evalInput: InfinityEvalInput,
): Promise<InfinityEvalOutput> {
const proc = Bun.spawn(['python3', EVAL_SCRIPT], {
stdin: 'pipe',
stdout: 'pipe',
stderr: 'pipe',
): Promise<PythonEvaluatorResult<InfinityEvalOutput>> {
return runPythonJsonEvaluator<InfinityEvalOutput>({
scriptPath: EVAL_SCRIPT,
input: evalInput,
timeoutMs: 300_000,
})
const inputJson = JSON.stringify(evalInput)
proc.stdin.write(inputJson)
proc.stdin.end()
const stdout = await new Response(proc.stdout).text()
const stderr = await new Response(proc.stderr).text()
const exitCode = await proc.exited
if (exitCode !== 0) {
throw new Error(
`Python evaluator exited with code ${exitCode}: ${stderr || stdout}`,
)
}
return JSON.parse(stdout.trim()) as InfinityEvalOutput
}
}

View File

@@ -1,6 +1,7 @@
import { readFile } from 'node:fs/promises'
import { join } from 'node:path'
import { query } from '@anthropic-ai/claude-agent-sdk'
import { writeGraderJsonArtifact } from '../../grading/artifacts'
import type { GraderResult } from '../../types'
import type { Grader, GraderInput } from '../types'
import {
@@ -63,6 +64,7 @@ export class PerformanceGrader implements Grader {
input.screenshotCount,
terminationReason,
)
await writeGraderJsonArtifact(input, this.name, 'metrics.json', metrics)
const systemPrompt = PERFORMANCE_SYSTEM_PROMPT.replace(
/\{screenshot_count\}/g,
@@ -82,6 +84,14 @@ export class PerformanceGrader implements Grader {
userPrompt,
input.outputDir,
)
if (response) {
await writeGraderJsonArtifact(
input,
this.name,
'agent-output.json',
response,
)
}
if (!response) {
return {
@@ -140,6 +150,7 @@ export class PerformanceGrader implements Grader {
`Perf grader: LLM returned ${returnedAxes.size}/${expectedAxes.size} axes, missing: ${missingAxes.join(', ')}`,
)
}
await writeGraderJsonArtifact(input, this.name, 'axes.json', axisResults)
return {
score: compositeScore / 100,

View File

@@ -1,51 +1,2 @@
import type { GraderResult } from '../types'
import { AgisdkStateDiffGrader } from './benchmark/agisdk-state-diff'
import { InfinityStateGrader } from './benchmark/infinity-state'
import { PerformanceGrader } from './performance/performance-grader'
import type { Grader, GraderInput } from './types'
export const PASS_FAIL_GRADER_ORDER = [
'agisdk_state_diff',
'infinity_state',
'performance_grader',
] as const
export function createGrader(name: string): Grader | null {
switch (name) {
case 'agisdk_state_diff':
return new AgisdkStateDiffGrader()
case 'infinity_state':
return new InfinityStateGrader()
case 'performance_grader':
return new PerformanceGrader()
default:
console.warn(`Unknown grader: ${name}`)
return null
}
}
export async function runGraders(
graderNames: string[],
input: GraderInput,
): Promise<Record<string, GraderResult>> {
const results: Record<string, GraderResult> = {}
for (const name of graderNames) {
const grader = createGrader(name)
if (!grader) continue
try {
console.log(` Running grader: ${name}`)
results[name] = await grader.grade(input)
} catch (error) {
results[name] = {
score: 0,
pass: false,
reasoning: `Error running grader: ${error}`,
}
}
}
return results
}
export { AgisdkStateDiffGrader, InfinityStateGrader, PerformanceGrader }
export * from '../grading/grader-registry'
export { runConfiguredGraders, runGraders } from '../grading/grader-runner'

View File

@@ -1,21 +1 @@
import type { GraderResult, Message } from '../types'
export interface GraderInput {
task: {
query_id: string
query: string
dataset: string
}
messages: Message[]
screenshotCount: number
finalAnswer: string | null
expectedAnswer?: string | null
outputDir: string
mcpUrl?: string
infinityAppUrl?: string
}
export interface Grader {
name: string
grade(input: GraderInput): Promise<GraderResult>
}
export type { Grader, GraderInput } from '../grading/types'

View File

@@ -0,0 +1,34 @@
import { mkdir, writeFile } from 'node:fs/promises'
import { join } from 'node:path'
import type { GraderInput } from './types'
function artifactDir(input: GraderInput, graderName: string): string {
return join(
input.taskArtifactDir || input.outputDir,
'grader-artifacts',
graderName,
)
}
/** Writes a JSON artifact for a grader under the task artifact directory. */
export async function writeGraderJsonArtifact(
input: GraderInput,
graderName: string,
filename: string,
value: unknown,
): Promise<void> {
const dir = artifactDir(input, graderName)
await mkdir(dir, { recursive: true })
await writeFile(join(dir, filename), JSON.stringify(value, null, 2))
}
export async function writeGraderTextArtifact(
input: GraderInput,
graderName: string,
filename: string,
value: string,
): Promise<void> {
const dir = artifactDir(input, graderName)
await mkdir(dir, { recursive: true })
await writeFile(join(dir, filename), value)
}

View File

@@ -0,0 +1,26 @@
import { AgisdkStateDiffGrader } from '../graders/benchmark/agisdk-state-diff'
import { InfinityStateGrader } from '../graders/benchmark/infinity-state'
import { PerformanceGrader } from '../graders/performance/performance-grader'
import type { Grader } from './types'
export const PASS_FAIL_GRADER_ORDER = [
'agisdk_state_diff',
'infinity_state',
'performance_grader',
] as const
export function createGrader(name: string): Grader | null {
switch (name) {
case 'agisdk_state_diff':
return new AgisdkStateDiffGrader()
case 'infinity_state':
return new InfinityStateGrader()
case 'performance_grader':
return new PerformanceGrader()
default:
console.warn(`Unknown grader: ${name}`)
return null
}
}
export { AgisdkStateDiffGrader, InfinityStateGrader, PerformanceGrader }

View File

@@ -0,0 +1,36 @@
import type { GraderResult } from '../types'
import { createGrader as defaultCreateGrader } from './grader-registry'
import type { Grader, GraderInput } from './types'
export interface GraderRunnerDeps {
createGrader?: (name: string) => Grader | null
}
/** Runs configured graders independently so one failure does not hide others. */
export async function runConfiguredGraders(
graderNames: string[],
input: GraderInput,
deps: GraderRunnerDeps = {},
): Promise<Record<string, GraderResult>> {
const create = deps.createGrader ?? defaultCreateGrader
const results: Record<string, GraderResult> = {}
for (const name of graderNames) {
const grader = create(name)
if (!grader) continue
try {
console.log(` Running grader: ${name}`)
results[name] = await grader.grade(input)
} catch (error) {
results[name] = {
score: 0,
pass: false,
reasoning: `Error running grader: ${error instanceof Error ? error.message : String(error)}`,
}
}
}
return results
}
export const runGraders = runConfiguredGraders

View File

@@ -0,0 +1,65 @@
export interface PythonEvaluatorOptions {
scriptPath: string
input: unknown
timeoutMs: number
}
export interface PythonEvaluatorResult<T> {
output: T
stdout: string
stderr: string
exitCode: number
}
/** Runs a Python evaluator that accepts stdin JSON and emits stdout JSON. */
export async function runPythonJsonEvaluator<T>(
options: PythonEvaluatorOptions,
): Promise<PythonEvaluatorResult<T>> {
const proc = Bun.spawn(['python3', options.scriptPath], {
stdin: 'pipe',
stdout: 'pipe',
stderr: 'pipe',
})
proc.stdin.write(JSON.stringify(options.input))
proc.stdin.end()
let timeoutHandle: ReturnType<typeof setTimeout> | undefined
const timeout = new Promise<never>((_, reject) => {
timeoutHandle = setTimeout(() => {
proc.kill('SIGKILL')
reject(
new Error(`Python evaluator timed out after ${options.timeoutMs}ms`),
)
}, options.timeoutMs)
})
const completed = (async (): Promise<PythonEvaluatorResult<T>> => {
const stdout = await new Response(proc.stdout).text()
const stderr = await new Response(proc.stderr).text()
const exitCode = await proc.exited
if (exitCode !== 0) {
throw new Error(
`Python evaluator exited with code ${exitCode}: ${stderr || stdout}`,
)
}
try {
return {
output: JSON.parse(stdout.trim()) as T,
stdout,
stderr,
exitCode,
}
} catch {
throw new Error(`Failed to parse Python evaluator output: ${stdout}`)
}
})()
try {
return await Promise.race([completed, timeout])
} finally {
clearTimeout(timeoutHandle)
}
}

View File

@@ -0,0 +1,22 @@
import type { GraderResult, Message } from '../types'
export interface GraderInput {
task: {
query_id: string
query: string
dataset: string
}
messages: Message[]
screenshotCount: number
finalAnswer: string | null
expectedAnswer?: string | null
taskArtifactDir: string
outputDir: string
mcpUrl?: string
infinityAppUrl?: string
}
export interface Grader {
name: string
grade(input: GraderInput): Promise<GraderResult>
}

View File

@@ -1,72 +1,10 @@
#!/usr/bin/env bun
import { parseArgs } from 'node:util'
import { runEval } from './runner/eval-runner'
import { runCli } from './cli'
const { values } = parseArgs({
args: Bun.argv.slice(2),
options: {
config: { type: 'string', short: 'c' },
help: { type: 'boolean', short: 'h' },
},
})
if (values.help) {
console.log(`
BrowserOS Eval
Usage:
bun run eval # Opens dashboard in config mode
bun run eval --config <config.json> # Runs eval with config file
Available agent types:
- single Single LLM agent driven by the BrowserOS tool loop
- orchestrator-executor High-level planner + visual/text executor
Available graders:
- performance_grader Multi-axis grader using Claude Agent SDK
- agisdk_state_diff AGI SDK / REAL Bench state-diff grader
- infinity_state WebArena-Infinity verifier-script grader
Preset configs in configs/:
- browseros-agent-weekly.json Weekly eval (single agent)
- browseros-oe-agent-weekly.json Weekly eval (orchestrator + LLM executor)
- browseros-oe-clado-weekly.json Weekly eval (orchestrator + Clado executor)
- agisdk-real-smoke.json AGI SDK smoke run
- infinity-hard-50.json WebArena-Infinity hard-50 set
- test-webvoyager.json WebVoyager test
- test-mind2web.json Mind2Web test
Examples:
bun run eval # Dashboard config mode
bun run eval -c configs/browseros-agent-weekly.json
bun run eval -c configs/test-webvoyager.json
`)
process.exit(0)
}
if (values.config) {
try {
await runEval({ configPath: values.config })
} catch (error) {
console.error(error instanceof Error ? error.message : String(error))
process.exit(1)
}
process.exit(0)
} else {
// No config — start dashboard in config mode, wait for user to configure and run
const { startDashboard } = await import('./dashboard/server')
startDashboard({
tasks: [],
configName: '',
agentType: '',
outputDir: '',
configMode: true,
})
console.log(
'Dashboard running at http://localhost:9900 — configure and run from the UI',
)
// Keep process alive until SIGINT
await new Promise(() => {})
try {
await runCli(Bun.argv.slice(2))
} catch (error) {
console.error(error instanceof Error ? error.message : String(error))
process.exit(1)
}

View File

@@ -0,0 +1,28 @@
import type {
ViewerManifest,
ViewerManifestTask,
} from '../viewer/viewer-manifest'
export interface R2UploadConfig {
accountId: string
accessKeyId: string
secretAccessKey: string
bucket: string
cdnBaseUrl: string
}
export type R2ManifestTask = ViewerManifestTask
export type R2RunManifest = ViewerManifest
export interface R2PublishRunResult {
runId: string
uploadedFiles: number
viewerUrl: string
manifest: R2RunManifest
}
export interface R2PublishPathResult {
uploadedRuns: R2PublishRunResult[]
skippedRuns: string[]
}

View File

@@ -0,0 +1,427 @@
import { readdir, readFile, stat } from 'node:fs/promises'
import { basename, dirname, extname, join } from 'node:path'
import {
GetObjectCommand,
PutObjectCommand,
S3Client,
} from '@aws-sdk/client-s3'
import {
buildViewerManifest,
type ViewerManifestTaskInput,
} from '../viewer/viewer-manifest'
import type {
R2PublishPathResult,
R2PublishRunResult,
R2RunManifest,
R2UploadConfig,
} from './r2-manifest'
const DEFAULT_CONCURRENCY = 20
const CONTENT_TYPES: Record<string, string> = {
'.json': 'application/json',
'.jsonl': 'application/x-ndjson',
'.png': 'image/png',
'.html': 'text/html',
}
export interface R2Client {
send(command: unknown): Promise<unknown>
}
export interface R2PublisherOptions {
config: R2UploadConfig
client?: R2Client
viewerPath?: string
concurrency?: number
now?: () => Date
}
interface UploadJob {
key: string
filePath: string
contentType: string
}
interface TaskDirEntry {
taskId: string
taskPath: string
}
export function contentTypeForPath(filePath: string): string {
return CONTENT_TYPES[extname(filePath)] || 'application/octet-stream'
}
export function loadR2ConfigFromEnv(
env: Record<string, string | undefined> = process.env,
): R2UploadConfig {
const accountId = env.EVAL_R2_ACCOUNT_ID
const accessKeyId = env.EVAL_R2_ACCESS_KEY_ID
const secretAccessKey = env.EVAL_R2_SECRET_ACCESS_KEY
if (!accountId || !accessKeyId || !secretAccessKey) {
throw new Error(
'Missing required env vars: EVAL_R2_ACCOUNT_ID, EVAL_R2_ACCESS_KEY_ID, EVAL_R2_SECRET_ACCESS_KEY',
)
}
return {
accountId,
accessKeyId,
secretAccessKey,
bucket: env.EVAL_R2_BUCKET || 'browseros-eval',
cdnBaseUrl: (
env.EVAL_R2_CDN_BASE_URL || 'https://eval.browseros.com'
).replace(/\/+$/, ''),
}
}
export function createR2Client(config: R2UploadConfig): S3Client {
return new S3Client({
region: 'auto',
endpoint: `https://${config.accountId}.r2.cloudflarestorage.com`,
credentials: {
accessKeyId: config.accessKeyId,
secretAccessKey: config.secretAccessKey,
},
})
}
async function collectFiles(dir: string): Promise<string[]> {
const files: string[] = []
const entries = await readdir(dir, { withFileTypes: true })
for (const entry of entries) {
const full = join(dir, entry.name)
if (entry.isDirectory()) {
files.push(...(await collectFiles(full)))
} else {
files.push(full)
}
}
return files
}
async function runPool<T>(
items: T[],
concurrency: number,
fn: (item: T) => Promise<void>,
): Promise<void> {
let i = 0
const workers = Array.from({ length: concurrency }, async () => {
while (i < items.length) {
const idx = i++
await fn(items[idx])
}
})
await Promise.all(workers)
}
async function hasMetadata(dir: string): Promise<boolean> {
const metaStat = await stat(join(dir, 'metadata.json')).catch(() => null)
return !!metaStat?.isFile()
}
async function findTaskDirs(runDir: string): Promise<TaskDirEntry[]> {
const entries = await readdir(runDir, { withFileTypes: true })
const legacyTasks: TaskDirEntry[] = []
for (const entry of entries) {
if (!entry.isDirectory() || entry.name === 'tasks') continue
const taskPath = join(runDir, entry.name)
if (await hasMetadata(taskPath)) {
legacyTasks.push({
taskId: entry.name,
taskPath,
})
}
}
const tasksRoot = join(runDir, 'tasks')
const canonicalEntries = await readdir(tasksRoot, {
withFileTypes: true,
}).catch(() => [])
const canonicalTasks: TaskDirEntry[] = []
for (const entry of canonicalEntries) {
if (!entry.isDirectory()) continue
const taskPath = join(tasksRoot, entry.name)
if (await hasMetadata(taskPath)) {
canonicalTasks.push({
taskId: entry.name,
taskPath,
})
}
}
return legacyTasks.length > 0 ? legacyTasks : canonicalTasks
}
async function isRunDir(dir: string): Promise<boolean> {
return (await findTaskDirs(dir)).length > 0
}
async function collectRunRootFiles(runDir: string): Promise<UploadJob[]> {
const entries = await readdir(runDir, { withFileTypes: true })
return entries
.filter((entry) => entry.isFile())
.map((entry) => {
const filePath = join(runDir, entry.name)
return {
key: entry.name,
filePath,
contentType: contentTypeForPath(filePath),
}
})
}
function statusFromMetadata(meta: Record<string, unknown>): string {
return meta.termination_reason === 'completed'
? 'completed'
: ((meta.termination_reason as string | undefined) ?? 'unknown')
}
function runIdForDir(runDir: string): string {
const timestamp = basename(runDir)
const configName = basename(dirname(runDir))
return `${configName}-${timestamp}`
}
/** Publishes eval artifacts in the viewer-compatible R2 layout. */
export class R2Publisher {
private readonly client: R2Client
private readonly config: R2UploadConfig
private readonly viewerPath: string
private readonly concurrency: number
private readonly now: () => Date
constructor(options: R2PublisherOptions) {
this.config = options.config
this.client = options.client ?? createR2Client(options.config)
this.viewerPath =
options.viewerPath ??
join(import.meta.dirname, '..', 'dashboard', 'viewer.html')
this.concurrency = options.concurrency ?? DEFAULT_CONCURRENCY
this.now = options.now ?? (() => new Date())
}
async isUploaded(runId: string): Promise<boolean> {
try {
await this.client.send(
new GetObjectCommand({
Bucket: this.config.bucket,
Key: `runs/${runId}/manifest.json`,
}),
)
return true
} catch {
return false
}
}
async publishPath(inputDir: string): Promise<R2PublishPathResult> {
const dirStat = await stat(inputDir).catch(() => null)
if (!dirStat?.isDirectory()) {
throw new Error(`Not a directory: ${inputDir}`)
}
if (await isRunDir(inputDir)) {
const result = await this.publishRun(inputDir, runIdForDir(inputDir))
return { uploadedRuns: [result], skippedRuns: [] }
}
const configName = basename(inputDir)
const entries = await readdir(inputDir, { withFileTypes: true })
const runDirs = entries
.filter((entry) => entry.isDirectory())
.map((entry) => entry.name)
.sort()
if (runDirs.length === 0) {
throw new Error('No run subdirectories found')
}
const uploadedRuns: R2PublishRunResult[] = []
const skippedRuns: string[] = []
for (const dir of runDirs) {
const runId = `${configName}-${dir}`
if (await this.isUploaded(runId)) {
skippedRuns.push(runId)
continue
}
uploadedRuns.push(await this.publishRun(join(inputDir, dir), runId))
}
return { uploadedRuns, skippedRuns }
}
async publishRun(
runDir: string,
runId: string = runIdForDir(runDir),
): Promise<R2PublishRunResult> {
const taskEntries = await findTaskDirs(runDir)
if (taskEntries.length === 0) {
throw new Error(`No task subdirectories in ${runId}`)
}
const manifestTasks: ViewerManifestTaskInput[] = []
const jobs: UploadJob[] = (await collectRunRootFiles(runDir)).map(
(job) => ({
...job,
key: `runs/${runId}/${job.key}`,
}),
)
let agentConfig: Record<string, unknown> | undefined
let dataset: string | undefined
for (const taskDirEntry of taskEntries) {
const { taskId, taskPath } = taskDirEntry
const meta = await this.readMetadata(taskPath)
if (!meta) continue
if (!agentConfig && meta.agent_config) {
agentConfig = meta.agent_config as Record<string, unknown>
}
if (!dataset && meta.dataset) dataset = meta.dataset as string
const files = await collectFiles(taskPath)
let screenshotCount = 0
for (const file of files) {
const relative = file.slice(taskPath.length + 1)
if (relative.startsWith('screenshots/') && extname(file) === '.png') {
screenshotCount++
}
// Keep legacy keys during the manifest v2 rollout so cached viewers and
// old manifests can still resolve task artifacts.
jobs.push({
key: `runs/${runId}/${taskId}/${relative}`,
filePath: file,
contentType: contentTypeForPath(file),
})
jobs.push({
key: `runs/${runId}/tasks/${taskId}/${relative}`,
filePath: file,
contentType: contentTypeForPath(file),
})
}
manifestTasks.push({
queryId: (meta.query_id as string | undefined) || taskId,
artifactId: taskId,
query: (meta.query as string | undefined) || '',
startUrl: (meta.start_url as string | undefined) || '',
status: statusFromMetadata(meta),
durationMs: (meta.total_duration_ms as number | undefined) || 0,
screenshotCount:
(meta.screenshot_count as number | undefined) || screenshotCount,
graderResults:
(meta.grader_results as ViewerManifestTaskInput['graderResults']) ||
{},
})
}
if (manifestTasks.length === 0) {
throw new Error(`No completed tasks in ${runId}`)
}
let uploaded = 0
await runPool(jobs, this.concurrency, async (job) => {
await this.uploadFile(job)
uploaded++
})
const manifest = await this.buildManifest(
runDir,
runId,
agentConfig,
dataset,
manifestTasks,
)
await this.uploadBuffer(
`runs/${runId}/manifest.json`,
Buffer.from(JSON.stringify(manifest, null, 2)),
'application/json',
)
await this.uploadBuffer(
'viewer.html',
await readFile(this.viewerPath),
'text/html',
)
return {
runId,
uploadedFiles: uploaded + 2,
viewerUrl: `${this.config.cdnBaseUrl}/viewer.html?run=${encodeURIComponent(runId)}`,
manifest,
}
}
private async readMetadata(
taskPath: string,
): Promise<Record<string, unknown> | null> {
try {
return JSON.parse(
await readFile(join(taskPath, 'metadata.json'), 'utf-8'),
) as Record<string, unknown>
} catch {
return null
}
}
private async buildManifest(
runDir: string,
runId: string,
agentConfig: Record<string, unknown> | undefined,
dataset: string | undefined,
tasks: ViewerManifestTaskInput[],
): Promise<R2RunManifest> {
let summaryData: Record<string, unknown> | undefined
try {
summaryData = JSON.parse(
await readFile(join(runDir, 'summary.json'), 'utf-8'),
) as Record<string, unknown>
} catch {}
return buildViewerManifest({
runId,
uploadedAt: this.now().toISOString(),
agentConfig,
dataset,
summary: summaryData
? {
passRate: summaryData.passRate,
avgDurationMs: summaryData.avgDurationMs,
}
: undefined,
tasks,
})
}
private async uploadFile(job: UploadJob): Promise<void> {
await this.uploadBuffer(
job.key,
await readFile(job.filePath),
job.contentType,
)
}
private async uploadBuffer(
key: string,
body: Buffer,
contentType: string,
): Promise<void> {
await this.client.send(
new PutObjectCommand({
Bucket: this.config.bucket,
Key: key,
Body: body,
ContentType: contentType,
}),
)
}
}
export async function publishPathToR2(
inputDir: string,
): Promise<R2PublishPathResult> {
const config = loadR2ConfigFromEnv()
return new R2Publisher({ config }).publishPath(inputDir)
}

View File

@@ -0,0 +1,104 @@
export interface ReportManifestTask {
queryId: string
query?: string
status: string
durationMs: number
screenshotCount?: number
paths?: Record<string, string>
graderResults?: Record<string, { pass?: boolean; score?: number }>
}
export interface ReportManifest {
schemaVersion?: number
runId: string
uploadedAt?: string
agentConfig?: { type?: string; model?: string }
dataset?: string
summary?: { passRate?: number; avgDurationMs?: number }
tasks?: ReportManifestTask[]
}
export interface RunSummary {
runId: string
configName: string
date: string
avgScore: number
total: number
completed: number
failed: number
timeout: number
avgDurationMs: number
model: string
dataset: string
agentType: string
}
// Report score uses the primary pass/fail grader so mixed-grader runs keep
// the same precedence as the eval summary.
const PASS_FAIL_GRADER_ORDER = [
'agisdk_state_diff',
'infinity_state',
'performance_grader',
]
export function extractConfigName(runId: string): string {
return runId.replace(/-\d{4}-\d{2}-\d{2}-\d{4}$/, '')
}
function reportDate(manifest: ReportManifest): string {
if (!manifest.uploadedAt) return 'unknown'
const [date, time] = manifest.uploadedAt.split('T')
return `${date} ${time?.slice(0, 5) || ''}`
}
function primaryScore(task: ReportManifestTask): number | null {
if (!task.graderResults) return null
for (const name of PASS_FAIL_GRADER_ORDER) {
const result = task.graderResults[name]
if (result) return result.score ?? 0
}
return null
}
export function buildRunSummaries(manifests: ReportManifest[]): RunSummary[] {
return manifests
.map((manifest) => {
const tasks = Array.isArray(manifest.tasks) ? manifest.tasks : []
const total = tasks.length
const completed = tasks.filter((t) => t.status === 'completed').length
const failed = tasks.filter((t) => t.status === 'failed').length
const timeout = tasks.filter((t) => t.status === 'timeout').length
let scoredCount = 0
let scoreSum = 0
for (const task of tasks) {
const score = primaryScore(task)
if (score === null) continue
scoredCount++
scoreSum += score
}
const durations = tasks
.filter((t) => t.durationMs > 0)
.map((t) => t.durationMs)
return {
runId: manifest.runId,
configName: extractConfigName(manifest.runId),
date: reportDate(manifest),
avgScore: scoredCount > 0 ? (scoreSum / scoredCount) * 100 : 0,
total,
completed,
failed,
timeout,
avgDurationMs:
durations.length > 0
? durations.reduce((a, b) => a + b, 0) / durations.length
: 0,
model: manifest.agentConfig?.model || 'unknown',
dataset: manifest.dataset || manifest.runId,
agentType: manifest.agentConfig?.type || 'unknown',
}
})
.sort((a, b) => a.date.localeCompare(b.date))
}

View File

@@ -14,8 +14,11 @@
*/
import {
closeSync,
existsSync,
mkdirSync,
mkdtempSync,
openSync,
readFileSync,
rmSync,
writeFileSync,
@@ -33,7 +36,17 @@ export interface EvalPorts {
const MAX_RESTART_ATTEMPTS = 3
const CDP_WAIT_TIMEOUT_MS = 30_000
const SERVER_HEALTH_TIMEOUT_MS = 30_000
// Bumped from 30s → 90s while debugging dev-CI startup. Dev's server module
// graph is ~108 files larger than main's; cold-cache module load on a CI
// runner can take much longer than the original 30s budget allowed.
const SERVER_HEALTH_TIMEOUT_MS = 90_000
// Where per-worker server stderr is written. Captured (rather than ignored)
// so eval-weekly.yml can upload these as workflow artifacts on failure for
// post-mortem debugging. Path is also referenced in the workflow's artifact
// upload step.
const SERVER_LOG_DIR =
process.env.BROWSEROS_SERVER_LOG_DIR || '/tmp/browseros-server-logs'
const MONOREPO_ROOT = join(
dirname(fileURLToPath(import.meta.url)),
@@ -53,6 +66,7 @@ export class BrowserOSAppManager {
private ports: EvalPorts
private chromeProc: Subprocess | null = null
private serverProc: Subprocess | null = null
private serverLogFd: number | null = null
private tempDir: string | null = null
private readonly workerIndex: number
private readonly loadExtensions: boolean
@@ -183,15 +197,36 @@ export class BrowserOSAppManager {
VITE_BROWSEROS_SERVER_PORT: String(server),
}
// Capture both stdout and stderr to a per-worker file so we can
// post-mortem startup hangs. The server uses pino which writes logs to
// stdout by default — capturing stderr alone misses everything. The
// eval-weekly workflow uploads /tmp/browseros-server-logs/ as a workflow
// artifact on failure.
// Open the per-worker log file under SERVER_LOG_DIR. If the directory
// can't be created or the file can't be opened (e.g. unwritable custom
// BROWSEROS_SERVER_LOG_DIR), fall back to /dev/null so spawn still works.
const logPath = join(SERVER_LOG_DIR, `server-W${this.workerIndex}.log`)
let logFd: number
try {
mkdirSync(SERVER_LOG_DIR, { recursive: true })
logFd = openSync(logPath, 'a')
} catch {
logFd = openSync('/dev/null', 'w')
}
this.serverLogFd = logFd
// `start:ci` skips `--watch` (no file-watcher overhead in CI). Falls back
// to the regular `start` script outside CI for the dev-watch experience.
const startScript = process.env.CI ? 'start:ci' : 'start'
this.serverProc = spawn({
cmd: ['bun', 'run', '--filter', '@browseros/server', 'start'],
cmd: ['bun', 'run', '--filter', '@browseros/server', startScript],
cwd: MONOREPO_ROOT,
stdout: 'ignore',
stderr: 'ignore',
stdout: logFd,
stderr: logFd,
env: serverEnv,
})
console.log(
` [W${this.workerIndex}] Server started (PID: ${this.serverProc.pid})`,
` [W${this.workerIndex}] Server started (PID: ${this.serverProc.pid}, logs → ${logPath})`,
)
// --- Wait for Server Health ---
@@ -244,6 +279,18 @@ export class BrowserOSAppManager {
await this.killProcess(this.serverProc)
this.serverProc = null
// Close the parent's copy of the server log fd. Child kept its own dup
// until it exited above, so closing here doesn't truncate any output.
// Without this we'd leak one fd per restart attempt across all workers.
if (this.serverLogFd !== null) {
try {
closeSync(this.serverLogFd)
} catch {
// already closed or invalid — ignore
}
this.serverLogFd = null
}
// Kill Chrome (graceful → force)
await this.killProcess(this.chromeProc)
this.chromeProc = null

View File

@@ -1,362 +1 @@
import { mkdir, writeFile } from 'node:fs/promises'
import { basename, dirname, join, resolve } from 'node:path'
import {
dashboardState,
setActiveExecutor,
startDashboard,
stopDashboard,
} from '../dashboard/server'
import type { ErrorSource, EvalConfig, Task } from '../types'
import {
printValidationResult,
validateConfig,
} from '../utils/config-validator'
import { ParallelExecutor } from './parallel-executor'
import {
getTaskSourceDescription,
loadTasks,
TaskLoadError,
} from './task-loader'
import type {
BatchSummary,
RunEvalOptions,
TaskResult,
TaskResultSummary,
TaskSource,
} from './types'
import { getPrimaryGraderResult, isSuccessfulResult } from './types'
// ============================================================================
// Main Entry Point
// ============================================================================
export async function runEval(options: RunEvalOptions): Promise<void> {
// Step 1: Validate configuration
const config = await loadAndValidateConfig(options.configPath)
// Step 2: Resolve paths relative to config location
const configDir = dirname(resolve(options.configPath))
const resolvedPaths = resolvePaths(options, config, configDir)
// Log configuration
console.log('Eval Configuration:')
console.log(` Config: ${options.configPath}`)
console.log(` Dataset: ${resolvedPaths.dataPath}`)
console.log(` Output: ${resolvedPaths.outputDir}`)
console.log(` Workers: ${config.num_workers}`)
console.log(` Agent: ${config.agent.type}`)
console.log()
// Step 3: Load tasks
const taskSource = resolveTaskSource(options, resolvedPaths.dataPath)
const { tasks } = await loadTasksWithLogging(taskSource)
// Step 4: Setup
await mkdir(resolvedPaths.outputDir, { recursive: true })
// Step 5: Start dashboard
startDashboard({
tasks,
configName: options.configPath,
agentType: config.agent.type,
outputDir: resolvedPaths.outputDir,
})
// Step 6: Execute tasks (parallel or sequential based on num_workers)
const results = await executeTasks(tasks, config, resolvedPaths.outputDir)
// Step 7: Summary
const summary = buildSummary(results)
await saveSummary(summary, resolvedPaths.outputDir)
printSummary(summary)
console.log(`\nResults saved to: ${resolvedPaths.outputDir}`)
stopDashboard()
}
// ============================================================================
// Configuration
// ============================================================================
async function loadAndValidateConfig(configPath: string) {
console.log('Validating configuration...')
const validationResult = await validateConfig(configPath)
printValidationResult(validationResult)
if (!validationResult.valid || !validationResult.config) {
throw new Error(
'Configuration validation failed. Fix the above errors and try again.',
)
}
return validationResult.config
}
interface ResolvedPaths {
dataPath: string
outputDir: string
}
function resolvePaths(
options: RunEvalOptions,
config: EvalConfig,
configDir: string,
): ResolvedPaths {
// Resolve dataset path: use options.dataPath if provided, otherwise resolve from config
const dataPath = options.dataPath
? options.dataPath
: config.dataset.startsWith('/')
? config.dataset
: resolve(configDir, config.dataset)
// Resolve output directory: results/{config-name}/{timestamp}/
// Config name derived from config filename (e.g., "browseros-agent-weekly.json" → "browseros-agent-weekly")
const configName = options.configPath
? basename(resolve(options.configPath), '.json')
: 'eval'
const timestamp = formatTimestamp(new Date())
const resultsBase = config.output_dir
? config.output_dir.startsWith('/')
? config.output_dir
: resolve(configDir, config.output_dir)
: resolve(configDir, '..', 'results')
const outputDir = join(resultsBase, configName, timestamp)
return { dataPath, outputDir }
}
function formatTimestamp(date: Date): string {
const y = date.getFullYear()
const m = String(date.getMonth() + 1).padStart(2, '0')
const d = String(date.getDate()).padStart(2, '0')
const h = String(date.getHours()).padStart(2, '0')
const min = String(date.getMinutes()).padStart(2, '0')
return `${y}-${m}-${d}-${h}${min}`
}
// ============================================================================
// Task Loading
// ============================================================================
function resolveTaskSource(
options: RunEvalOptions,
dataPath: string,
): TaskSource {
// If query is provided, use single task mode
if (options.query) {
return { type: 'single', query: options.query, startUrl: options.startUrl }
}
// Otherwise use file mode with the resolved dataPath
return { type: 'file', path: dataPath }
}
async function loadTasksWithLogging(
source: TaskSource,
): Promise<{ tasks: Awaited<ReturnType<typeof loadTasks>>['tasks'] }> {
console.log(`Loading tasks from ${getTaskSourceDescription(source)}...`)
try {
const result = await loadTasks(source)
console.log(`Loaded ${result.tasks.length} task(s)`)
return { tasks: result.tasks }
} catch (error) {
if (error instanceof TaskLoadError) {
throw new Error(`Failed to load tasks: ${error.message}`)
}
throw new Error(`Failed to load tasks: ${error}`)
}
}
// ============================================================================
// Task Execution
// ============================================================================
async function executeTasks(
tasks: Task[],
config: EvalConfig,
outputDir: string,
): Promise<TaskResult[]> {
console.log(`\n${'='.repeat(60)}`)
console.log('STARTING EVALUATION')
console.log(`${'='.repeat(60)}\n`)
const numWorkers = config.num_workers || 1
console.log(`Running with ${numWorkers} worker(s)`)
if (config.restart_server_per_task) {
console.log(`Server restart per task: enabled`)
}
console.log()
const executor = new ParallelExecutor({
numWorkers,
config,
outputDir,
restartServerPerTask: config.restart_server_per_task,
onEvent: (taskId, event) =>
dashboardState.broadcastStreamEvent(taskId, event),
})
// Register so dashboard stop button works for CLI runs too
setActiveExecutor(executor)
try {
return await executor.execute(tasks, (completed, total, task, result) => {
printTaskProgress(completed, total, task, result)
})
} finally {
setActiveExecutor(null)
}
}
function printTaskProgress(
completed: number,
total: number,
task: Task,
result: TaskResult,
): void {
const status =
result.status === 'completed'
? 'DONE'
: result.status === 'timeout'
? 'TIMEOUT'
: 'FAILED'
const duration =
result.durationMs > 0 ? ` (${(result.durationMs / 1000).toFixed(1)}s)` : ''
console.log(`[${completed}/${total}] ${task.query_id}: ${status}${duration}`)
if (result.status === 'failed') {
console.log(` ERROR: ${result.error.message}`)
} else if (isSuccessfulResult(result)) {
// Log agent errors (e.g., LLM API failures) even if task "completed"
if (result.agentResult.metadata.errors?.length) {
for (const err of result.agentResult.metadata.errors) {
console.log(` ERROR [${err.source}]: ${err.message}`)
}
}
for (const [name, gr] of Object.entries(result.graderResults)) {
const icon = gr.pass ? 'PASS' : 'FAIL'
console.log(` ${name}: ${icon}`)
}
}
}
// ============================================================================
// Summary
// ============================================================================
function buildSummary(results: TaskResult[]): BatchSummary {
// Track errors by source
const errorsBySource: Partial<Record<ErrorSource, number>> = {}
let totalWarnings = 0
const taskSummaries: TaskResultSummary[] = results.map((r) => {
let errorCount = 0
let warningCount = 0
let errorSources: ErrorSource[] | undefined
let failureReason: string | undefined
if (isSuccessfulResult(r)) {
// Count errors and warnings from agent metadata
errorCount = r.agentResult.metadata.errors?.length ?? 0
warningCount = r.agentResult.metadata.warnings?.length ?? 0
totalWarnings += warningCount
// Track error sources
if (r.agentResult.metadata.errors?.length) {
errorSources = r.agentResult.metadata.errors.map((e) => e.source)
for (const err of r.agentResult.metadata.errors) {
errorsBySource[err.source] = (errorsBySource[err.source] ?? 0) + 1
}
}
} else {
// Failed task
errorCount = 1
errorSources = [r.errorSource]
failureReason = r.error.message
errorsBySource[r.errorSource] = (errorsBySource[r.errorSource] ?? 0) + 1
}
return {
queryId: r.task.query_id,
status: r.status,
durationMs: r.durationMs,
graderResults: isSuccessfulResult(r)
? Object.fromEntries(
Object.entries(r.graderResults).map(([name, gr]) => [
name,
{ pass: gr.pass, score: gr.score },
]),
)
: undefined,
errorCount,
warningCount,
errorSources: errorSources?.length ? errorSources : undefined,
failureReason,
}
})
const completed = results.filter((r) => r.status === 'completed').length
const timeout = results.filter((r) => r.status === 'timeout').length
const failed = results.filter((r) => r.status === 'failed').length
// Calculate pass rate using primary grader (fallback order)
let totalGraded = 0
let totalPasses = 0
for (const result of results) {
if (isSuccessfulResult(result)) {
const primary = getPrimaryGraderResult(result.graderResults)
if (primary) {
totalGraded++
if (primary.pass) totalPasses++
}
}
}
const passRate = totalGraded > 0 ? totalPasses / totalGraded : 0
// Calculate average duration for non-failed tasks
const durations = results
.filter((r) => r.status !== 'failed')
.map((r) => r.durationMs)
const avgDurationMs =
durations.length > 0
? durations.reduce((a, b) => a + b, 0) / durations.length
: 0
return {
total: results.length,
completed,
failed,
timeout,
passRate,
avgDurationMs,
errorsBySource,
totalWarnings,
results: taskSummaries,
}
}
async function saveSummary(
summary: BatchSummary,
outputDir: string,
): Promise<void> {
await writeFile(
join(outputDir, 'summary.json'),
JSON.stringify(summary, null, 2),
)
}
function printSummary(summary: BatchSummary): void {
console.log('='.repeat(60))
console.log('EVALUATION COMPLETE')
console.log('='.repeat(60))
console.log(`Total: ${summary.total} tasks`)
console.log(` Completed: ${summary.completed}`)
console.log(` Timeout: ${summary.timeout}`)
console.log(` Failed: ${summary.failed}`)
console.log(` Pass Rate: ${(summary.passRate * 100).toFixed(1)}%`)
console.log(` Avg Duration: ${(summary.avgDurationMs / 1000).toFixed(1)}s`)
}
export { runEval } from '../runs/eval-runner'

View File

@@ -1,266 +1,5 @@
/**
* Parallel Executor
*
* Each worker gets its own isolated BrowserOS stack:
* - BrowserOSAppManager (Chrome + Server on unique ports)
* - TaskExecutor (uses that worker's server URL)
*
* Port allocation: Worker N → CDP=base+N, Server=base+N, Extension=base+N
*/
import type { EvalConfig, Task } from '../types'
import { BrowserOSAppManager, type EvalPorts } from './browseros-app-manager'
import { createTaskExecutor } from './task-executor'
import type { TaskResult } from './types'
// ============================================================================
// Types
// ============================================================================
export interface ParallelExecutorConfig {
numWorkers: number
config: EvalConfig
outputDir: string
restartServerPerTask?: boolean
onEvent?: (taskId: string, event: Record<string, unknown>) => void
}
export type ProgressCallback = (
completed: number,
total: number,
task: Task,
result: TaskResult,
) => void
// ============================================================================
// Task Queue (thread-safe for single-threaded async — index is atomic)
// ============================================================================
class TaskQueue {
private tasks: Task[]
private index: number = 0
private stopped: boolean = false
constructor(tasks: Task[]) {
this.tasks = [...tasks]
}
next(): Task | null {
if (this.stopped || this.index >= this.tasks.length) return null
return this.tasks[this.index++]
}
stop(): void {
this.stopped = true
}
}
// ============================================================================
// Parallel Executor
// ============================================================================
export class ParallelExecutor {
private readonly numWorkers: number
private readonly appManagers = new Map<number, BrowserOSAppManager>()
private completedCount: number = 0
private readonly resultLock = new Map<string, TaskResult>()
private queue: TaskQueue | null = null
constructor(private readonly config: ParallelExecutorConfig) {
this.numWorkers = Math.max(1, config.numWorkers)
}
async stop(): Promise<void> {
console.log('\nStopping eval run...')
this.queue?.stop()
const kills = [...this.appManagers.values()].map((m) => m.killApp())
await Promise.allSettled(kills)
}
async execute(
tasks: Task[],
onProgress?: ProgressCallback,
): Promise<TaskResult[]> {
if (tasks.length === 0) return []
const cleanup = this.setupSignalHandlers()
const loadExtensions = this.config.config.browseros.load_extensions ?? false
// Patch NopeCHA API key before launching any workers
const captchaConfig = this.config.config.captcha
if (captchaConfig) {
const apiKey = process.env[captchaConfig.api_key_env]
if (apiKey) {
BrowserOSAppManager.patchNopechaApiKey(apiKey)
}
}
this.queue = new TaskQueue(tasks)
const totalTasks = tasks.length
try {
const queue = this.queue
// Launch N workers in parallel — each gets its own Chrome + Server
const workers = Array.from({ length: this.numWorkers }, (_, i) =>
this.runWorker(i, queue, totalTasks, loadExtensions, onProgress),
)
await Promise.all(workers)
// Return results in original task order
return tasks.map((task) => {
const result = this.resultLock.get(task.query_id)
if (!result) {
return {
status: 'failed' as const,
task,
error: new Error('Task result not found'),
errorSource: 'unknown' as const,
durationMs: 0,
}
}
return result
})
} finally {
cleanup()
}
}
private async runWorker(
workerIndex: number,
queue: TaskQueue,
totalTasks: number,
loadExtensions: boolean,
onProgress?: ProgressCallback,
): Promise<void> {
// Per-worker isolated ports
const basePorts: EvalPorts = {
cdp: this.config.config.browseros.base_cdp_port,
server: this.config.config.browseros.base_server_port,
extension: this.config.config.browseros.base_extension_port,
}
const headless = this.config.config.browseros.headless ?? false
const appManager = new BrowserOSAppManager(
workerIndex,
basePorts,
loadExtensions,
headless,
)
this.appManagers.set(workerIndex, appManager)
// Per-worker executor pointing to this worker's server
const workerConfig: typeof this.config.config = {
...this.config.config,
browseros: {
...this.config.config.browseros,
server_url: appManager.getServerUrl(),
},
}
const executor = createTaskExecutor(
workerConfig,
workerIndex,
this.config.outputDir,
this.config.onEvent,
)
try {
// Always start Chrome+Server once for this worker
console.log(`\n Worker ${workerIndex}: Starting BrowserOS stack...`)
await appManager.restart()
while (true) {
const task = queue.next()
if (!task) break
const taskStartTime = Date.now()
let result: TaskResult
try {
// Restart between tasks if configured
if (this.config.restartServerPerTask) {
console.log(`\n${'─'.repeat(60)}`)
console.log(` Worker ${workerIndex}: Task: ${task.query_id}`)
console.log(`${'─'.repeat(60)}`)
await appManager.restart()
}
this.config.onEvent?.(task.query_id, {
type: 'task-state',
taskId: task.query_id,
status: 'running',
})
result = await executor.execute(task)
console.log(
` Worker ${workerIndex}: ${task.query_id}: ${result.status}`,
)
} catch (error) {
console.error(
` Worker ${workerIndex}: ${task.query_id}: FAILED - ${error instanceof Error ? error.message : String(error)}`,
)
result = {
status: 'failed',
task,
error: error instanceof Error ? error : new Error(String(error)),
errorSource: 'unknown',
durationMs: Date.now() - taskStartTime,
}
}
this.resultLock.set(task.query_id, result)
this.completedCount++
// Emit task completion to dashboard
const stateEvent: Record<string, unknown> = {
type: 'task-state',
taskId: task.query_id,
status: result.status,
durationMs: result.durationMs,
}
if (result.status !== 'failed' && 'graderResults' in result) {
stateEvent.graderResults = Object.fromEntries(
Object.entries(result.graderResults).map(([name, gr]) => [
name,
{
pass: gr.pass,
score: gr.score,
reasoning: gr.reasoning,
details: gr.details,
},
]),
)
stateEvent.screenshotCount =
result.agentResult?.metadata?.total_steps ?? 0
}
this.config.onEvent?.(task.query_id, stateEvent)
onProgress?.(this.completedCount, totalTasks, task, result)
if (this.config.restartServerPerTask) {
await new Promise((resolve) => setTimeout(resolve, 2000))
}
}
} finally {
await appManager.killApp()
}
}
/**
* SIGINT/SIGTERM kills all Chrome + Server instances across all workers.
* Returns a cleanup function that removes the listeners after execute() completes.
*/
private setupSignalHandlers(): () => void {
const onSignal = async () => {
console.log('\nShutting down all workers...')
this.queue?.stop()
const kills = [...this.appManagers.values()].map((m) => m.killApp())
await Promise.allSettled(kills)
process.exit(0)
}
process.on('SIGINT', onSignal)
process.on('SIGTERM', onSignal)
return () => {
process.off('SIGINT', onSignal)
process.off('SIGTERM', onSignal)
}
}
}
export {
type ProgressCallback,
TaskWorkerPool as ParallelExecutor,
type TaskWorkerPoolConfig as ParallelExecutorConfig,
} from '../runs/task-worker-pool'

View File

@@ -1,316 +1,6 @@
import { join } from 'node:path'
import { createAgent } from '../agents'
import type { AgentContext, AgentResult } from '../agents/types'
import { CaptureContext } from '../capture/context'
import {
hasExistingGraderResults,
TrajectorySaver,
} from '../capture/trajectory-saver'
import { runGraders } from '../graders/registry'
import type { ErrorSource, EvalConfig, GraderResult, Task } from '../types'
import { callMcpTool } from '../utils/mcp-client'
import { InfinityAppManager } from './infinity-app-manager'
import type { TaskResult } from './types'
// ============================================================================
// Errors
// ============================================================================
export class TaskExecutionError extends Error {
public readonly errorSource: ErrorSource
constructor(
message: string,
public readonly task: Task,
public readonly phase:
| 'navigation'
| 'agent_execution'
| 'grading'
| 'cleanup',
public readonly cause?: Error,
) {
super(message)
this.name = 'TaskExecutionError'
this.errorSource = phase as ErrorSource
}
}
// ============================================================================
// Task Executor
// ============================================================================
export interface TaskExecutorDeps {
onEvent?: (taskId: string, event: Record<string, unknown>) => void
}
export class TaskExecutor {
constructor(
private readonly config: EvalConfig,
private readonly workerIndex: number,
private readonly outputDir: string,
private readonly deps: TaskExecutorDeps,
) {}
/**
* Resolve the initial page ID via list_pages MCP call.
* Called once per task on a fresh browser — there's exactly one page.
*/
private async resolveInitialPageId(mcpUrl: string): Promise<number> {
try {
const result = await callMcpTool(mcpUrl, 'list_pages', {})
if (!result.isError) {
const textContent = result.content?.find(
(c: { type: string }) => c.type === 'text',
)
const match = textContent?.text?.match(/^\s*(\d+)\./m)
if (match) return Number.parseInt(match[1], 10)
}
} catch {
// Fall through to default
}
// Fresh browser always has page 1
return 1
}
async execute(task: Task): Promise<TaskResult> {
const startTime = Date.now()
const mcpUrl = `${this.config.browseros.server_url}/mcp`
// Check if task already has grader results (resume capability)
const existing = await hasExistingGraderResults(
this.outputDir,
task.query_id,
)
if (existing.exists && existing.metadata) {
console.log(` Skipping: already has grader results`)
return {
status:
existing.metadata.termination_reason === 'timeout'
? 'timeout'
: 'completed',
task,
agentResult: {
metadata: existing.metadata,
messages: [],
finalAnswer: existing.metadata.final_answer,
},
graderResults: existing.metadata.grader_results,
durationMs: existing.metadata.total_duration_ms,
}
}
// Resolve page ID once — fresh browser has exactly one page
const pageId = await this.resolveInitialPageId(mcpUrl)
// For Infinity tasks, start a fresh app server per task
let infinityManager: InfinityAppManager | null = null
let actualStartUrl = task.start_url
if (task.dataset === 'webarena-infinity') {
const appName = (task.metadata?.additional as Record<string, unknown>)
?.app_name as string
const appBasePort =
((task.metadata?.additional as Record<string, unknown>)
?.app_base_port as number) || 8000
if (appName && process.env.WEBARENA_INFINITY_DIR) {
infinityManager = new InfinityAppManager(this.workerIndex, appBasePort)
try {
actualStartUrl = await infinityManager.startApp(appName)
console.log(
` Infinity app "${appName}" started on port ${infinityManager.getPort()}`,
)
} catch (error) {
throw new TaskExecutionError(
`Failed to start Infinity app: ${error instanceof Error ? error.message : String(error)}`,
task,
'navigation',
error instanceof Error ? error : undefined,
)
}
}
}
try {
// Phase 1: Set viewport + navigate to start URL
try {
await callMcpTool(mcpUrl, 'evaluate_script', {
page: pageId,
expression: 'window.resizeTo(1440, 900)',
})
} catch (vpError) {
console.warn(
` Viewport resize failed: ${vpError instanceof Error ? vpError.message : String(vpError)}`,
)
}
if (actualStartUrl && actualStartUrl !== 'about:blank') {
try {
await callMcpTool(mcpUrl, 'navigate_page', {
url: actualStartUrl,
page: pageId,
})
} catch (error) {
throw new TaskExecutionError(
`Failed to navigate to start URL: ${error instanceof Error ? error.message : String(error)}`,
task,
'navigation',
error instanceof Error ? error : undefined,
)
}
}
// Phase 2: Execute agent
const agentResult = await this.executeAgent(task, pageId)
// Phase 3: Run graders
const graderResults = await this.runGraders(
task,
agentResult,
infinityManager?.getUrl(),
)
const status =
agentResult.metadata.termination_reason === 'timeout'
? 'timeout'
: 'completed'
return {
status,
task,
agentResult,
graderResults,
durationMs: Date.now() - startTime,
}
} catch (error) {
const errorSource: ErrorSource =
error instanceof TaskExecutionError ? error.errorSource : 'unknown'
return {
status: 'failed',
task,
error: error instanceof Error ? error : new Error(String(error)),
errorSource,
durationMs: Date.now() - startTime,
}
} finally {
// Navigate to about:blank to clean up
try {
await callMcpTool(mcpUrl, 'navigate_page', {
url: 'about:blank',
page: pageId,
})
} catch {
// Ignore cleanup errors
}
// Stop Infinity app server if running
if (infinityManager) {
await infinityManager.stop().catch(() => {})
}
}
}
private async executeAgent(task: Task, pageId: number): Promise<AgentResult> {
try {
const { capture, taskOutputDir } = await CaptureContext.create({
serverUrl: this.config.browseros.server_url,
outputDir: this.outputDir,
taskId: task.query_id,
initialPageId: pageId,
onEvent: this.deps.onEvent,
})
const context: AgentContext = {
config: this.config,
task,
workerIndex: this.workerIndex,
initialPageId: pageId,
outputDir: this.outputDir,
taskOutputDir,
capture,
}
const agent = createAgent(context)
return await agent.execute()
} catch (error) {
if (error instanceof TaskExecutionError) {
throw error
}
throw new TaskExecutionError(
`Agent execution failed: ${error instanceof Error ? error.message : String(error)}`,
task,
'agent_execution',
error instanceof Error ? error : undefined,
)
}
}
private async runGraders(
task: Task,
agentResult: AgentResult,
infinityAppUrl?: string,
): Promise<Record<string, GraderResult>> {
const configGraders = this.config.graders ?? []
const taskGraders = task.graders ?? []
const graderNames = configGraders.length > 0 ? configGraders : taskGraders
if (graderNames.length === 0) {
return {}
}
try {
const graderResults = await runGraders(graderNames, {
task: {
query_id: task.query_id,
query: task.query,
dataset: task.dataset,
},
messages: agentResult.messages,
screenshotCount:
agentResult.metadata.screenshot_count ??
agentResult.metadata.total_steps,
finalAnswer: agentResult.finalAnswer,
expectedAnswer: (task.metadata?.additional as Record<string, unknown>)
?.answer as string | undefined,
outputDir: join(this.outputDir, task.query_id),
mcpUrl: `${this.config.browseros.server_url}/mcp`,
infinityAppUrl,
})
try {
const saver = new TrajectorySaver(this.outputDir, task.query_id)
await saver.updateGraderResults(graderResults)
} catch (saveError) {
console.warn(
` Failed to persist grader results: ${saveError instanceof Error ? saveError.message : String(saveError)}`,
)
}
return graderResults
} catch (error) {
console.warn(
` Grading failed: ${error instanceof Error ? error.message : String(error)}`,
)
return {
_error: {
score: 0,
pass: false,
reasoning: `Grading failed: ${error instanceof Error ? error.message : String(error)}`,
},
}
}
}
}
// ============================================================================
// Factory
// ============================================================================
export function createTaskExecutor(
config: EvalConfig,
workerIndex: number,
outputDir: string,
onEvent?: (taskId: string, event: Record<string, unknown>) => void,
): TaskExecutor {
return new TaskExecutor(config, workerIndex, outputDir, { onEvent })
}
export {
createTaskRunPipeline as createTaskExecutor,
TaskExecutionError,
TaskRunPipeline as TaskExecutor,
type TaskRunPipelineDeps as TaskExecutorDeps,
} from '../runs/task-run-pipeline'

View File

@@ -8,12 +8,18 @@ import type { ErrorSource, EvalConfig, GraderResult, Task } from '../types'
export interface RunEvalOptions {
configPath: string
config?: EvalConfig
dataPath?: string
query?: string
startUrl?: string
outputDir?: string
}
export interface RunEvalResult {
outputDir: string
summary: BatchSummary
}
// ============================================================================
// Task Loading
// ============================================================================

Some files were not shown because too many files have changed in this diff Show More