Compare commits

...

27 Commits

Author SHA1 Message Date
Nikhil Sonti
e78bf58d41 test: cover terminal limactl resolver errors 2026-04-28 17:04:39 -07:00
Nikhil Sonti
a1c83a4b9c fix: avoid eager limactl resolution in server tests 2026-04-28 16:53:12 -07:00
Nikhil
7a92654abc feat: add BrowserOS MCP to ACP agents (#851)
* feat: add BrowserOS MCP to ACP agents

* fix: bypass ACP agent permissions

* fix: address review feedback for PR #851
2026-04-28 16:30:20 -07:00
Nikhil
91d3285aa0 feat: add ACP agent harness (#849)
* feat: add acp agent runtime spike

* feat: add agent harness catalog

* feat: persist harness agents in json

* feat: persist agent transcripts

* feat: route harness service through agent records

* feat: expose generic agent harness routes

* feat: add harness agent frontend api

* feat: create harness agents from agents page

* feat: chat with persisted harness agents

* chore: remove obsolete agent profile spike

* chore: self-review fixes

* fix: combine openclaw and harness agents UI

* refactor: split agents page components

* fix: hide persisted harness turns
2026-04-28 15:29:38 -07:00
Nikhil
7bb6dac949 fix(dogfood): copy extension state into dev profile (#850)
* fix(dogfood): copy extension state into dev profile

* fix(dogfood): address profile import review feedback

* fix(dogfood): clarify refresh profile in-use error
2026-04-28 15:25:38 -07:00
shivammittal274
d9c254053e refactor(eval): drop unused agents/graders, collapse registries (#847)
* refactor(eval): drop unused agents/graders, collapse registries

Sweep of dead code in the eval app: deleted gemini-computer-use and
yutori-navigator agents, fara/webvoyager/mind2web graders, eight
debug/analyze/test scripts, three stale planning docs, and the orphaned
eval-targets/coordinate-click testbed.

With two agents and three graders left, the Map-backed plugin registries
were over-engineered — collapsed both into plain switches. Removed the
now-dead GraderOptions plumbing (no remaining grader takes API keys),
dropped grader_api_key_env/grader_base_url/grader_model from the schema
and configs, and de-duped PASS_FAIL_GRADER_ORDER (was defined in three
places). Replaced the URL-parsing extractCdpPort hack in single-agent
and orchestrator-executor with workerIndex passed cleanly through
AgentContext.

README and --help text rewritten to match reality. Renamed
configs/test_*.json to test-*.json for kebab-case consistency.

Net: ~10,460 LOC removed across 60 files. Typecheck clean, all tests
pass.

* ci(eval): pull BrowserOS from rolling stable CDN URL

The pinned v0.44.0.1 .deb on GitHub releases regressed on Linux —
servers start but never become healthy. Switch to the canonical rolling
URL at cdn.browseros.com/download/BrowserOS.deb so CI tracks the same
stable channel users get from the marketing site.
2026-04-29 02:14:47 +05:30
Nikhil
6b9945f933 feat(dev): use dev dock icon for browser launches (#848) 2026-04-28 13:28:19 -07:00
Dani Akash
6a5a7775a9 fix(openclaw): wire LlmProvider.supportsImages through to OpenClaw model config (#846)
When BrowserOS sets up a custom OpenAI-compat provider on the gateway,
the agent UI's "Supports Image" flag (LlmProviderConfig.supportsImages)
was being dropped on the floor. As a result the persisted model entry
had no `input` field, OpenClaw defaulted it to ['text'], and image_url
content parts were silently stripped before the model saw them.

Fix:
- Extend OpenClawSetupInput / OpenClawAgentMutationInput on the agent
  side (useOpenClaw.ts) and the route body schema + SetupInput +
  createAgent input on the server side with `supportsImages?: boolean`.
- AgentsPage forwards `llmOption?.supportsImages` from the selected
  LlmProviderConfig in both handleSetup and handleCreate.
- provider-map.resolveSupportedOpenClawProvider emits
  `input: ['text', 'image']` on the model entry when the flag is
  truthy; otherwise emits the explicit `['text']` so the value is
  always pinned (avoids relying on OpenClaw's implicit default).
- applyBrowserosConfig adds `tools.media.image.enabled = true` to the
  bootstrap batch so the gateway's image-understanding pipeline is
  always wired up — per-model `input` still gates which models see
  images, this just enables the global path.

ACP image content blocks are still dropped by the OpenClaw bridge —
that's a separate bridge bug, not addressed here. This commit
restores image support for the OpenAI-compat /v1/chat/completions
path that the upcoming ACP chat panel will use as a carve-out for
image-bearing prompts.

Existing custom-provider configs are NOT auto-migrated; users will
re-acquire image support either by re-running setup or by editing
their model entries' `input` field manually. A migration pass for
legacy installs is not in scope for this commit because the
"supportsImages" intent isn't recoverable from the persisted config
alone — the source of truth is the LlmProvider record on the agent
side.
2026-04-29 00:23:45 +05:30
shivammittal274
af48a2110c feat(eval): Phase 1 — exclude broken tasks, freshen card dates, add grader leniency (#841)
* fix(eval): exclude broken tasks + freshen expired card dates

Two AGISDK tasks are unsolvable today for non-model reasons:

- topwork-1: evals-topwork.vercel.app throws Minified React error #185
  ("Maximum update depth exceeded") on every form submit. The page renders
  "Application error: a client-side exception has occurred" instead of saving.
  Whole-task failure, every model affected.

- fly-unified-2: hardcodes Exp: 12/25 in both the goal text AND a jmespath
  grader criterion. Today is 2026-04, so the eval-site rejects the card.
  Freshening the goal alone leaves the grader expecting the original value;
  freshening both would require monkey-patching agisdk's TaskConfig at
  runtime — too fragile to maintain.

Adds these to a new EXCLUDED_TASKS set alongside the existing
EXCLUDED_WEBSITES (omnizon).

Also adds freshen_goal_dates(): for AGISDK fly-unified tasks whose goal
contains an `Exp: MM/YY` within 6 months of today (or past), rewrites it
to a far-future date (12/30). This rescues fly-unified-5 (had Exp 12/25,
no card-exp grader criterion) and protects fly-unified-4 (had Exp 06/26,
2 months from expiring) from the next eval run hitting the same trap.

Dataset goes from 47 -> 45 tasks; 2 freshened.

* feat(eval): add lenient-strings grader softening

The agisdk grader compares jmespath-extracted values via strict equality.
For tasks where the model adds harmless decoration to a free-text field
(e.g. topwork-3 expects title "Full-Stack Developer" but model produces
"Full-Stack Developer - Enterprise Microservices Platform"), this fails
every other criterion would pass.

Adds a substring fallback in the wrapper: a failed criterion is re-marked
as a softened pass when both actual_value and expected_value are strings
and the (stripped, lower-cased) expected_value is contained in the
actual_value. Numbers/bools/dates/None stay strict.

- Default-on. Set AGISDK_STRICT_STRINGS=1 to recover the strict score.
- Softened criteria are tagged with `softened: true` in per_criterion
  output for transparency in run manifests.
- Aggregate `pass`/`reward` are recomputed after softening.

Expected to rescue 4 tasks in our 45-set: topwork-3, topwork-4 (both pure
title-decoration), gomail-8 (grader contradicts goal), and networkin-6
(grader hardcodes profile id).

* fix(eval): exclude 5 more tasks where pipeline (not agent) fails

Extends EXCLUDED_TASKS to 7 entries based on the K2.5 + Opus 4.6
head-to-head deep-dive on the 2026-04-28 runs. The exclusion rule:
remove a task only if it is unsolvable for any agent — either the task
data is invalid, the eval site is broken, or the grader penalizes
correct work. Tasks that fail because of our agent's tool fidelity
(drag, custom-widget fill, click on React submit, etc.) STAY in — those
are real capability gaps the team should see in the score.

New exclusions:

- fly-unified-9: goal references "Dec 18 2024 at 10:00" but the live
  eval site has only 2025 inventory and no 10:00 slot. Both models
  successfully booked the closest available flight and were penalized
  on a grader expectation that can never be met.

- fly-unified-4: eval site stores wall-clock flight times as bare UTC
  (T08:00:00.000Z) while the grader expects them shifted by 8h
  (T16:00:00.000Z = 8 AM PST). Opus 4.6 completed the entire booking
  correctly. Eval-site TZ-storage bug.

- gomail-8: goal says "Clear all emails from GitHub in the inbox", but
  criterion 3 expects exactly 1 email updated. Both K2.5 and Opus
  correctly cleared all 4 GitHub emails. Grader contradicts goal.

- networkin-6: goal says "Choose a random person you haven't connected
  with"; grader hardcodes profilesDiff.updated."4".connectionGrade.
  Both models randomized correctly and missed id 4. Grader contradicts
  goal.

- networkin-9: eval site's searchHistoryDiff doesn't record queries
  submitted via the autocomplete + Enter path. Opus 4.6 completed the
  task end-to-end (Stanford alum, connection request, message); only
  failed because the search-history criterion was never written
  server-side. Eval-site bug.

Dataset goes from 45 -> 40 tasks. Score impact (same K2.5/Opus runs,
recomputed against the cleaned 40-task denominator):

  K2.5:     21/45 (46.7%) -> 21/40 (52.5%)
  Opus 4.6: 28/45 (62.2%) -> 28/40 (70.0%)
  Δ:        15.6 pp -> 17.5 pp (real model gap, less pipeline noise)
2026-04-28 23:19:31 +05:30
Nikhil
c5ff8d75bc fix(dogfood): clarify init prompts (#839) 2026-04-28 07:48:42 -07:00
Nikhil
445a6a6c45 fix(dogfood): use alpha dock icon (#837) 2026-04-27 21:47:10 -07:00
Nikhil
72d39b9a0f docs(dogfood): simplify alpha workflow readme (#838) 2026-04-27 21:44:03 -07:00
Nikhil
3b47f330f5 fix(dogfood): separate BrowserOS state root (#836) 2026-04-27 17:38:15 -07:00
Nikhil
15a82ff9cb feat: add dogfood background daemon mode (#833) 2026-04-27 17:15:50 -07:00
Nikhil
427549f081 feat: Add BrowserOS Dock icon variants (#835) 2026-04-27 17:10:36 -07:00
Nikhil
a11f9caa64 fix(dogfood): colorize cli output (#834)
* fix(dogfood): colorize cli output

* fix: address dogfood cli review comments
2026-04-27 16:29:25 -07:00
Nikhil
da1397900b refactor: rename internal BrowserOS CLIs (#832)
* refactor: rename internal BrowserOS CLIs

* fix: update dogfood binary gitignore
2026-04-27 16:18:45 -07:00
Nikhil
368c7dcfe8 fix(alpha): write balpha process logs (#830)
* fix(alpha): write balpha process logs

* fix(alpha): address log review feedback
2026-04-27 15:48:40 -07:00
Nikhil
599f8b6b9c fix: address balpha CLI dogfooding feedback (#831) 2026-04-27 15:43:22 -07:00
Nikhil
27834b1d31 fix: udpate readme (#829) 2026-04-27 15:27:16 -07:00
Nikhil
aa30eb3aaa feat: add balpha dogfooding CLI (#828)
* feat(alpha): scaffold balpha cli

* fix(alpha): address scaffold review

* feat(alpha): add balpha config

* feat(alpha): parse browseros profiles

* feat(alpha): import browseros profile

* feat(alpha): add browser launch helpers

* feat(alpha): add repo build and env pipeline

* feat(alpha): add process supervision

* feat(alpha): add balpha commands

* docs(alpha): document balpha setup

* fix(alpha): reuse dev setup script

* fix(alpha): address review feedback

* fix(alpha): normalize imported browser profile

* fix(alpha): use generic profile fixture names
2026-04-27 15:03:37 -07:00
shivammittal274
e045e34b73 fix(eval): switch weekly eval configs from Fireworks to OpenRouter (#827)
The 2026-04-23 weekly run had 42% of AGISDK and 46% of Infinity tasks
fail with `AI_RetryError: ... the service is overloaded` from Fireworks
(20 concurrent kimi-k2p5 streams across both runs at 10 workers each).

Switching to OpenRouter (which fronts the same Moonshot K2.5 weights
and falls back across providers) for the three weekly configs:
- browseros-agent-weekly.json
- agisdk-real-smoke.json
- infinity-hard-50.json

Model accounts/fireworks/models/kimi-k2p5 -> moonshotai/kimi-k2.5
(same weights, same 262K context). API key env var, base URL updated.

OPENROUTER_API_KEY is already wired into .github/workflows/eval-weekly.yml
and present in repo secrets — no GH config changes needed.

Orchestrator-executor configs and test_webvoyager left on Fireworks
intentionally; can switch later if needed.
2026-04-27 21:52:26 +05:30
shivammittal274
01d649da9a feat(eval): bring deterministic graders to dev + drop omnizon (#824)
* feat: deterministic eval graders (AGI SDK + WebArena-Infinity) (#664)

* feat: add deterministic eval graders (AGI SDK + WebArena-Infinity)

Two new benchmark integrations with programmatic grading — no LLM judge.

AGI SDK / REAL Bench (52 tasks):
- 11 React/Next.js clones of consumer apps (DoorDash, Amazon, Gmail, etc.)
- Grader navigates browser to /finish, extracts state diff from <pre> tag
- Python verifier checks exact values via jmespath queries

WebArena-Infinity (50 hard tasks):
- 13 LLM-generated SaaS clones (Gmail, GitLab, Linear, Figma, etc.)
- InfinityAppManager starts fresh app server per task per worker
- Python verifier calls /api/state and asserts on JSON state

Infrastructure:
- GraderInput extended with mcpUrl + infinityAppUrl for parallel workers
- Each worker gets isolated ports (no cross-worker state contamination)
- CI workflow: pip install agisdk, clone webarena-infinity repo

* chore: switch eval configs back to kimi-k2p5

* fix: register deterministic graders in pass rate calculation

Add agisdk_state_diff and infinity_state to PASS_FAIL_GRADER_ORDER
in both runner types and weekly report script, so scores show correctly
in the dashboard.

* chore: temp switch to opus 4.6 for eval run

* chore: restore kimi-k2p5 as default eval config

* ci: add timeout and continue-on-error for trend report step

* fix(eval): drop omnizon from AGISDK dataset (DMCA takedown)

evals-omnizon.vercel.app returns HTTP 451 ("This content has been
blocked for legal reasons / DMCA_TAKEDOWN"). All 5 omnizon-* tasks
fail grading with "Failed to fetch /finish endpoint: JSON Parse error".

Adds an EXCLUDED_WEBSITES set to the dataset builder and regenerates
agisdk-real.jsonl (52 → 47 tasks).

* fix(eval): correct Infinity port-assignment bugs

Two related bugs in the Infinity eval runner that cause silent port
collisions / fallbacks under parallel execution:

1. build-infinity-dataset.py emitted "app_port" but task-executor and
   the committed JSONL both read "app_base_port". Re-running the build
   script would silently make every task fall back to the 8000 default,
   ignoring per-app port assignments. Renamed the key to match.

2. task-executor derived workerIndex as `base_server_port - 9110`, but
   parallel-executor doesn't override base_server_port per worker —
   only server_url. Every worker computed workerIndex = 0, causing all
   parallel workers to spawn Infinity app servers on the same port.
   Threading workerIndex explicitly through TaskExecutor instead.

Also drops an unused app_name parameter from load_tasks().
2026-04-27 21:35:43 +05:30
Dani Akash
ddbb2cf492 feat(agent): composer attachments + server-side outbound message queue (#826)
* feat(agent): attach images and text files to chat messages

Adds end-to-end support for image and text file attachments in the chat
composer, with the staged files round-tripping through the OpenClaw
gateway as OpenAI-compatible content blocks and persisting in the JSONL
so they show up in the historical view.

Server
- HTTP client: new OpenClawChatContentPart union and a buildUserContent
  helper that emits multimodal content arrays when messageParts is
  supplied, falls back to the legacy string content otherwise.
- Service: chatStream takes an optional messageParts array and forwards
  it; BrowserOSChatHistoryItem gains an attachments field.
- JSONL reader: PiContentBlock learns the OpenAI image_url and Anthropic
  image source/data shapes; user messages now emit user.attachment
  events that the history mapper accumulates onto the next user item.
- Route: validates an inbound attachments[] (kind/mime/size/count),
  inlines text-shaped files as <attachment> blocks in the message body,
  attaches images via image_url parts. Replaces the immediate 409 on
  active monitoring session with a 30s waitForSessionFree(agentId) wait
  (registry now exposes onSessionEnd) so cron/hook contention does not
  reject a user-chat send outright. Returns 503 if the wait times out.

Client
- New lib/attachments.ts: validateAttachment / compressImageIfNeeded
  (canvas downscale to 2048px long edge, JPEG 0.85 re-encode for >1.5
  MB inputs) / stageAttachment / stageAttachments that produces the
  staged-attachment shape the composer renders and the payload the
  server accepts.
- ConversationInput: drag-and-drop, paperclip button, clipboard paste,
  staged attachment chip strip with thumbnails for images and a
  paperclip+name chip for text files. Send button enables on either
  text or attachments. Drop-zone overlay during drag.
- chatWithAgent forwards attachments[]; useAgentConversation.send
  accepts a SendInput shape and renders user attachments on the
  optimistic streaming turn via MessageAttachments / MessageAttachment.
- ClawChatMessage groups historical attachment parts into a single
  MessageAttachments strip, ordered before reasoning/tools/text.
- claw-chat-types adds an attachment ClawChatMessagePart variant; the
  history mapper emits attachment parts first and skips the text part
  when the user only sent media.
- AgentCommandHome forwards the new SendInput shape — home composer
  drops attachments at the boundary in v1 (the conversation page is
  where staging is most useful; carrying bytes through the URL bar
  is not sensible).

Limits: 10 attachments per message, 5 MB per image (post compression),
1 MB per text file, mime types png/jpeg/webp/gif and text/* +
application/json. PDFs and other binaries are deferred to v2.

* feat(agent): outbound message queue for chats while agent is mid-turn

Lets users keep typing and submitting messages while the agent is still
streaming a previous turn. Each press is appended to a single-flight
queue and dispatched as soon as `streaming` flips false; the queued
state renders as a strip above the composer so the user sees what's
pending vs. what's already sending.

- New `useOutboundQueue` hook owns the queue, the worker effect, and
  cancel/retry actions. Single-flight by design — a re-entrancy ref
  guard prevents two simultaneous dispatches when `streaming` flickers.
- Composer (`ConversationInput`) accepts optional `outboundQueue`,
  `onCancelQueued`, `onRetryQueued` props. When the queue is provided
  the send-button gate stops blocking on `streaming`; the spinner stays
  as the visual cue that the agent is still busy. Legacy direct-send
  callers keep the old streaming-blocks-send semantic.
- Renders an OutboundQueueStrip above the staged-attachment strip with
  per-item status (queued / sending / failed), a cancel button on
  queued items, and retry + discard on failed items.
- AgentCommandConversation wires `onSend` to `queue.enqueue` and routes
  the home composer's `?q=` initial-message handoff through the queue
  too, so it inherits the same single-flight serialization.

The server-side `waitForSessionFree` (added with attachments) and this
client-side queue together cover both contention sources: cron / hook
turns and back-to-back user sends. Persistence across reloads is
intentionally out of scope for v1 — losing the queue on extension
reload is documented as a known limitation.

* feat(server): server-side outbound message queue

Replaces the client-only React-state queue from 123ef21d with a
proper server-owned queue. Closing the tab is now safe — the server
holds queued messages and dispatches them through the existing
chatStream path the moment the agent's ClawSession status flips to
idle.

Server
- New OutboundQueueService (apps/server/src/api/services/queue) — per
  agent FIFO, in-memory. Subscribes to ClawSession.onStateChange
  through OpenClawService.onAgentStatusChange, and dispatches via
  OpenClawService.chatStream so attachments / history / monitoring
  all behave identically to the existing /chat route. The worker
  drains the SSE response server-side so the gateway run finalizes
  cleanly even with no client connected.
- Four new routes under /claw/agents/:id/queue:
  POST   /queue            enqueue
  DELETE /queue/:itemId    cancel a queued item
  POST   /queue/:itemId/retry  re-queue a failed item
  GET    /queue/stream     SSE feed of the per-agent queue state.
  Validation reuses validateChatAttachments and
  buildMessagePartsFromAttachments from the existing chat route.
- Singleton wired in apps/server/src/main.ts; shutdown on SIGTERM.
- New OpenClawService.getAgentState getter for the queue worker's
  pre-dispatch sanity check.

Client
- useOutboundQueue rewritten as an SSE-backed projection over server
  state. Public API unchanged so the composer still works.
- enqueue POSTs to /queue and shows an optimistic local entry until
  the server's SSE snapshot reflects it; local-only entries get a
  `local-` id prefix so cancel can short-circuit them without
  hitting the server.
- AgentCommandConversation watches the queue for sending items
  dropping out and refetches history so the new assistant turn shows
  up in the conversation view (the server worker streams the
  dispatched turn into OpenClaw without exposing per-turn SSE to
  the client).

Out of scope (documented in the plan as v2 follow-ups): disk
persistence (server restart loses queue), per-turn live streaming
of queued sends in the conversation view, and switching the
underlying dispatch from /v1/chat/completions to the chat.send RPC
(which would also fix the multimodal attachment routing problem).

* fix(server): outbound queue must reuse existing session, not spawn UUIDs

The queue worker was generating a fresh randomUUID() as the sessionKey
when the queued item didn't carry one — and the client wasn't sending
one. Result: every queued message kicked off a brand-new OpenClaw
session, orphaning the user's active conversation behind the new
"most recent" entry in sessions.json. The history endpoint then
resolved to the orphan and the chat appeared to disappear.

Fix is layered:
- Client (useOutboundQueue): forward the current resolvedSessionKey
  in the POST /queue body so every queued message targets the same
  conversation the user is viewing. AgentCommandConversation passes
  resolvedSessionKey into the hook.
- Server (OutboundQueueService): the worker now resolves to the
  agent's existing user-chat session when no sessionKey is provided
  on the queued item, via OpenClawService.resolveAgentSession. UUID
  fallback is now reserved for the first-ever message on a brand
  new agent — same semantic the existing /chat route has implicitly
  through the catalog of historical sessions.

No JSONL data was lost by the original bug (the prior conversations
are intact on disk); the orphan sessions just shadowed the original
in sessions.json.

* fix(agent,server): address PR review feedback for chat queue

- Tighten image data URL cap to base64-aware ~6.7 MB (was ~7.5 MB
  through `MAX_IMAGE_BYTES * 2`).
- Forward chat history from useOutboundQueue.enqueue so queued sends
  preserve conversation context like direct sends do.
- Match local attachment previews to server snapshots by id (not by
  message text), and prune the preview map as items drain.
- Pass an AbortSignal into chatStream so a queue shutdown cancels the
  initial OpenClaw handshake, not just the SSE drain loop.
- Track previously gitignored apps/agent/lib/attachments.ts (was caught
  by global lib/ ignore) so CI typecheck can resolve @/lib/attachments.
- Update server-api openclaw route tests to the new chatStream signature
  and the waitForSessionFree-based busy-agent path.

* fix(agent): dedupe optimistic queue entries for text-only sends

The localId↔serverId map was only populated when the message had
attachments, so plain-text sends left the optimistic local entry in
place after the server snapshot arrived — the user saw the same
message rendered twice in the queue strip.

* fix(agent): prune optimistic queue entry on POST ack, not just SSE

The server broadcasts the new queue snapshot before its POST response
returns, so the SSE handler often runs first — at that point the
localId↔serverId map has no entry for the new server id yet, so the
SSE-based dedupe path can't drop the optimistic local entry. Pruning
on POST success closes the race deterministically.

* fix(agent): hand off optimistic queue entry without a render gap

Pruning the local entry on POST success only worked when the SSE
snapshot had already overwritten it; if the POST response landed
first, the optimistic row disappeared for a frame before the SSE
snapshot brought back the server-keyed row, producing a visible
flicker. Gate the POST-side prune on the SSE snapshot already
carrying the server id, and rely on the SSE-based dedupe (now
guaranteed to find the localId↔serverId link in the map) to clean
up when SSE arrives later.

* fix(agent,server): client-generated queue id eliminates render flicker

The server used to assign its own UUID when an item was enqueued, so
the optimistic client row carried a `local-` id while the SSE snapshot
carried a server UUID — the client had to wait for the POST response
to learn the mapping before it could dedupe, and during that window
both rows rendered.

Now the browser generates the id, sends it in the POST body, and the
server uses it verbatim (falling back to a fresh UUID only if the id
collides with an existing item). The client collapses to a single
id-keyed list, so the optimistic row and the SSE row reconcile on the
same key from the very first render.
2026-04-27 21:31:03 +05:30
Dani Akash
711934555d feat(agent): enrich chat UI with tool activity, reasoning duration, and cost (#825)
* feat: pass per-turn cost and token data through chat history items

- Add costUsd, tokensIn, tokensOut to BrowserOSChatHistoryItem (server)
- Pass through from JSONL agent.message events in jsonlEventsToHistoryItems()
- Add same fields to client-side BrowserOSChatHistoryItem and ClawChatMessage
- Map cost/token data in mapHistoryItemToClawMessage()

Data flows: JSONL message.usage → server history item → API response →
client ClawChatMessage. Available for rendering in ClawChatMessage
component (message toolbar, cost badges).

* feat: add message toolbar with copy button and per-turn cost display

Add MessageToolbar to historical assistant messages in ClawChatMessage:
- Copy button copies message text to clipboard via MessageAction
- Per-turn token count (22.7K → 238) and cost ($0.003) shown as muted
  tabular-nums text on the right side of the toolbar
- Toolbar appears on hover (opacity transition via group-hover)
- Only shown when the message has text content
- Cost/token display only shown when data is available from JSONL

* fix: toolbar only on assistant messages, always visible, cost only

- Only render toolbar on assistant messages (not user messages)
- Remove hover-only opacity — toolbar is always visible
- Remove token counts (22.7K → 238 is meaningless to users)
- Show only cost as a budget signal ($0.003)

* feat: group all tool activity into single Task collapsible per turn

Replace flat tool rows with a single ai-elements Task collapsible per
assistant turn that lists every tool/MCP call in sequence.

Live streaming (ConversationMessage):
- Aggregate all tool-batch parts into one Task
- Title: "Working… (N actions)" while running, "Agent activity (N actions)" when done
- Default open while turn is in progress
- Wrench icon in trigger

Historical (ClawChatMessage):
- Group all tool-call parts into one Task
- Title includes failed count if any tools errored
- Default collapsed — expandable on click
- Tool name + status icon + error text per row

Both views show one clean collapsible per turn instead of N individual
tool cards. Collapsed reads "5 actions"; expanded shows the timeline.

* feat: include tool calls in chat history responses

Server: jsonlEventsToHistoryItems() now walks ALL events (not just
messages) and pairs agent.tool_use with agent.tool_result by toolCallId.
The resulting tool call list is attached to the next assistant text
message as toolCalls[]. Each entry includes status, input arguments,
output text, error string, and duration computed from event timestamps.

Client:
- BrowserOSChatHistoryItem gets optional toolCalls field
- Tool-call message part type gets durationMs field
- mapHistoryItemToClawMessage() emits tool-call parts BEFORE the text
  part (the order the agent produced them)
- ClawChatMessage Task view now shows tool duration in seconds

Result: historical messages now display the full tool activity
timeline grouped into the single Task collapsible per turn (designed
in step 3), instead of showing only the final text response.

* feat: render activity rows as human verbs sourced from tool registry

Tool calls in the chat activity view now read as sentences:
"Opened tab · news.ycombinator.com" instead of "browseros__new_page".

Server (tool-label-registry.ts):
- Curated verb override map for ~70 BrowserOS first-party tools
- Per-tool subject extractors that pull the meaningful argument from
  input (URL → host, query → quoted, element → ID, etc.)
- Generic fallback humanizes snake_case for any unmapped tool
- Strips MCP namespace prefixes (browseros__, mcp_)

Server (openclaw-service.ts):
- jsonlEventsToHistoryItems calls buildToolLabel for each tool_use,
  attaches label and subject to the BrowserOSChatHistoryToolCall

Client:
- Mirrored label module at lib/tool-labels.ts
- useAgentConversation tool-start handler computes label/subject
  from the SSE tool args
- ClawChatMessage and ConversationMessage render label · subject
  with foreground/muted styling, no font-mono
- ToolEntry, BrowserOSChatHistoryToolCall, and tool-call message
  part types all carry label and optional subject

* fix: drop meaningless tab N subject from page-read tool rows

Page IDs are internal numbers, not URLs. 'Took screenshot · tab 4'
tells the user nothing. Removed subject extractors for take_snapshot,
take_enhanced_snapshot, get_page_content, get_page_links, get_dom,
and take_screenshot. The verb alone is the right signal.

* fix: gate initial loading on historyQuery.isFetched not isLoading

The session and history queries are sequential: the history query is
disabled until session resolves. After session resolves, there's a render
frame where historyQuery.isLoading is still false (the query hasn't
been kicked off yet). isInitialLoading flipped to false during that
window, exposing an empty chat shell with just Task collapsibles and
copy buttons before the messages filled in.

Switching the guard to isFetched closes that window — the loading state
stays true until the first history fetch actually completes.

* fix: render historical messages immediately instead of through Streamdown's idle-callback debounce

Streamdown defaults to mode="streaming" which uses requestIdleCallback (300ms
debounce, 500ms idle timeout) and lazy/Suspense to optimize for token-by-token
live streams. For finalized historical messages this caused tool collapsibles
and copy buttons to paint while text bodies stayed blank for ~300-500ms after
load. Pass mode="static" + parseIncompleteMarkdown=false on the historical
MessageResponse so completed text paints in the same frame as the surrounding
chrome. Live streaming turns still use the default streaming mode.

Also collapse the redundant /agents/:id/session round-trip into the existing
/history endpoint (server already resolves the most recent user-chat session
when sessionKey is omitted) and tighten the initial-loading gate to stay true
across the render frame where the query is enabled but hasn't started fetching.

* feat: surface thinking duration on historical reasoning collapsibles

Server accumulates agent.thinking events per turn from JSONL and attaches a
single reasoning block (joined text + durationMs from first thinking event
to the closing agent.message) on each assistant history item. Reasoning
buffer resets on user.message alongside the tool-call buffer.

Client mirrors the type, emits the reasoning part before tool calls in
mapHistoryItemToClawMessage (chronological: think → act → answer), and
passes duration in seconds to <Reasoning> so the trigger reads "Thought
for N seconds" instead of just "Thinking" on collapsed historical turns.

* fix: read thinking blocks from the correct JSONL field name

OpenClaw stores reasoning blocks as {type:'thinking', thinking:'...'} but
the JSONL parser was reading block.text, so every thinking event was
silently dropped before it ever reached jsonlEventsToHistoryItems. As a
result the reasoning field on history items was always empty even though
the new accumulator was wired up correctly.

Also guard the client mapping: when durationMs is 0 (think + answer
emitted in the same JSONL line, no real elapsed wall-clock) pass
undefined to <Reasoning> so it renders the static "Thinking" trigger
instead of the streaming shimmer / "Thought for 0 seconds".

* fix: reset reasoning buffer on discarded turns and drop dead session hook

Two cleanups from PR review:

1. jsonlEventsToHistoryItems: when an agent.message is discarded (the
   "[Chat messages since your last reply" wrapper without a current-message
   marker) the tool buffers were already reset but the reasoning buffer
   was not. Accumulated thinking from the discarded turn would bleed onto
   the next assistant message. Reset pendingReasoningTexts and
   pendingReasoningFirstAt alongside the tool buffers.

2. useClawAgentSession, the AgentSessionResponse type, and the unused
   session entry in CLAW_CHAT_QUERY_KEYS became dead code after the
   session round-trip was folded into the history endpoint. Removed.
2026-04-27 18:29:15 +05:30
Nikhil
5125dffbf3 fix: sign limactl with VZ entitlement (#822) 2026-04-26 13:30:09 -07:00
Dani Akash
0035893f33 feat: dashboard API, JSONL reader, and OpenClaw observer for enriched home page (#810)
* feat: draft agent chat ui exploration

* feat: refine agent chat ui draft

* feat: remove outer frame from agent chat workspace

* fix: offset agent chat for app sidebar

* fix: simplify agent conversation shell

* fix: remove redundant chat header actions

* fix: unify agent conversation headers

* fix: tighten agent chat spacing

* fix: bound agent chat composer height

* fix: remove agent chat page inset

* fix: align agent header height with sidepanel

* fix: center agent composer resting state

* fix: anchor multiline composer controls

* fix: remove focus grid from agent home

* fix: remove redundant agent home header

* fix: constrain home agent composer

* fix: match home composer default posture

* feat: add openclaw chat history APIs

* feat: add claw chat history hydration

* fix: stabilize claw chat viewport layout

* fix: use conversation scroll base for claw chat

* refactor: split claw chat controller responsibilities

* fix: keep active agent turns in memory

* fix: normalize openclaw chat sessions

* refactor: use HTTP client for agent history instead of CLI client

Replace the CLI-based getChatHistory() call in getAgentHistoryPage()
with the HTTP client's getSessionHistory() from PR #795. This uses
the direct HTTP transport to OpenClaw's /sessions/<key>/history
endpoint instead of shelling out through the CLI.

- Add filterHttpSessionHistoryMessages() for flat-string content format
- Add normalizeHttpHistoryMessages() for OpenClawSessionHistoryMessage shape
- Update getAgentHistoryPage() to call getSessionHistory() via httpClient
- Remove unused getChatHistory(), filterOpenClawSystemMessages(),
  normalizeChatHistoryMessages(), and getTextContent()
- Update test mocks from cliClient.getChatHistory to httpClient.getSessionHistory
- Update MutableOpenClawService type: chatClient -> httpClient

* fix: fetch all session messages by iterating OpenClaw pagination

OpenClaw's HTTP history endpoint returns a limited page by default.
When called without a limit, only the first ~27 messages were returned,
causing all newer conversation messages to be silently dropped.

Add fetchAllSessionMessages() that iterates through OpenClaw's cursor-
based pagination (200 messages per page) until hasMore is false, then
feeds the complete message list into the existing BrowserOS normalization
and in-memory pagination layer.

* refactor: migrate chat history from HTTP gateway to direct JSONL file reads

Replace the HTTP-based chat history pipeline (BrowserOS server → OpenClaw
gateway /sessions/:key/history pagination loop) with direct JSONL file reads
from the host filesystem via Lima's virtiofs mount.

- Add OpenClawJsonlReader that reads session JSONL files directly from
  ~/.browseros/vm/openclaw/.openclaw/agents/<id>/sessions/
- Replace fetchAllSessionMessages() HTTP pagination with single file read
- Replace CLI-based listSessions() with sessions.json file reads
- Make listSessions, resolveAgentSession, getAgentHistoryPage synchronous
- Remove unused toBrowserOSSession, filterHttpSessionHistoryMessages,
  normalizeHttpHistoryMessages helpers
- Update route handlers to drop unnecessary async/await
- Update tests to use temp JSONL files instead of mocked HTTP/CLI clients

* fix: restore async route handlers for test compatibility with mocked service

* fix: address review feedback — path traversal guard, lazy reader, exists flag

- Add safePath() to OpenClawJsonlReader that validates resolved paths stay
  within stateRoot, preventing path traversal via crafted agentId values
- Use lazy initialization for jsonlReader (nulled on rebuildRuntimeClients)
  instead of creating a new instance per property access
- Return exists: false from resolveSpecificAgentSession when no session
  matches instead of fabricating a ghost session with sessionId: ''

* feat: add dashboard API and enrich home page agent cards

Server:
- Add summarizeToolActivity() that converts tool events into natural
  language descriptions ("Browsed 3 pages, took 2 screenshots")
- Add getDashboard() to OpenClawService that aggregates per-agent stats
  from JSONL: latest message, activity summary, cost, session count
- Add GET /claw/dashboard endpoint

Client:
- Add useAgentDashboard() React Query hook (10s refetch, 5s stale)
- Rewrite useAgentCardData from async IndexedDB hook to pure
  buildAgentCardData() function merging agent entries with dashboard data
- Add activity summary and cost to AgentCardExpanded footer
- Add activitySummary and costUsd fields to AgentCardData type
- Remove IndexedDB dependency from the home page

* feat: add OpenClawObserver for real-time per-agent status via gateway WS

- Add OpenClawObserver that connects to the OpenClaw gateway WebSocket
  control plane and subscribes to chat broadcast events
- Track per-agent status in real time: working (streaming), idle (turn
  complete), error (run failed), with current tool name
- Auto-connect when gateway control plane becomes available, auto-
  reconnect on disconnect with 5s backoff
- Disconnect observer on stop/shutdown
- Wire live status + currentTool into getDashboard() response
- Update client: AgentOverview includes status + currentTool, card shows
  spinning loader + tool name when agent is working
- Status resolution: per-agent WS status takes precedence over gateway-
  level status for working/error states

* feat: add SSE dashboard stream for real-time agent status on home page

Server:
- Add GET /claw/dashboard/stream SSE endpoint that sends an initial
  snapshot then pushes per-agent status events as they arrive from
  the OpenClaw observer
- Add onAgentStatusChange() to OpenClawService exposing the observer's
  listener for the route layer
- Heartbeat every 15s to keep connections alive

Client:
- useAgentDashboard() now subscribes to EventSource at /claw/dashboard/stream
- SSE snapshot event hydrates the React Query cache immediately
- SSE status events patch individual agent status + currentTool in the
  cache without refetching — agent cards update instantly
- Polling fallback raised to 30s since SSE handles real-time

* fix: observer WS handshake — wait for challenge before sending connect

The OpenClaw gateway sends a connect.challenge event before accepting
the connect request. The observer was sending the connect request on
ws.open which raced with the challenge. Now waits for the challenge
event before sending the handshake.

Also add dangerouslyDisableDeviceAuth to the gateway setup config
batch so the observer can connect without device identity on new
installs.

* fix: JSONL reader falls back to most recent file when sessions.json is stale

OpenClaw's sessions.json can record a Pi session ID that doesn't match
the actual JSONL filename on disk. This happens after context compaction
or session restart — the JSONL file gets a new UUID but sessions.json
keeps the old one.

Previously this caused history to silently disappear (the reader tried
to open a non-existent file and returned empty). Now resolveJsonlPath()
checks if the mapped file exists and, when it doesn't, scans the
sessions directory for the most recently modified .jsonl file as a
fallback.

* feat: add ClawSession state machine for reliable per-agent status

The OpenClawObserver only knows about status changes it witnesses via
WS events. If an agent was already running when the observer connected,
or after a reconnect, statuses were stuck at "unknown".

ClawSession is an in-memory state machine that solves this:

1. Seeds from JSONL on first control plane call — reads the latest
   events for each agent and infers working/idle. A session is "working"
   if the last event is a user.message with no subsequent agent.message,
   or an agent.tool_use with no matching agent.tool_result.

2. Receives live transitions from the WS observer — the observer now
   delegates all state management to ClawSession instead of maintaining
   its own status map.

3. Applies a 5-minute staleness threshold — if the last JSONL event
   is older than 5 minutes, assume idle (handles agent crashes).

Consumers (SSE stream, dashboard endpoint) read from ClawSession and
get correct state from the first call — no "unknown" period.

* fix: remove staleTime so dashboard refetches on every mount

* fix: reset stale working status on WS disconnect, eliminate redundant JSONL reads

- Observer resets all "working" agents to "unknown" when the WS closes,
  preventing agents from appearing stuck as Working indefinitely after
  a gateway restart. ClawSession re-seeds correct state on reconnect.

- getDashboard() now derives latestAgentMessage and cost from the
  already-loaded events array for the latest session instead of calling
  latestAgentMessage() and getSessionStats() which each re-read the
  same JSONL file. Reduces file reads from 3x to 1x per agent.
2026-04-25 19:03:03 +05:30
256 changed files with 17217 additions and 12213 deletions

View File

@@ -30,8 +30,9 @@ jobs:
- name: Install BrowserOS
run: |
wget -q https://github.com/browseros-ai/BrowserOS/releases/download/v0.44.0.1/BrowserOS_v0.44.0.1_amd64.deb
sudo dpkg -i BrowserOS_v0.44.0.1_amd64.deb
# Rolling stable channel — see https://cdn.browseros.com/download/BrowserOS.deb
wget -q -O BrowserOS.deb https://cdn.browseros.com/download/BrowserOS.deb
sudo dpkg -i BrowserOS.deb
browseros --version || echo "BrowserOS installed at $(which browseros)"
- name: Install Bun
@@ -43,6 +44,12 @@ jobs:
working-directory: packages/browseros-agent
run: bun install --ignore-scripts && bun run build:agent-sdk
- name: Install Python eval dependencies
run: pip install agisdk requests
- name: Clone WebArena-Infinity
run: git clone --depth 1 https://github.com/web-arena-x/webarena-infinity.git /tmp/webarena-infinity
- name: Install xvfb
run: sudo apt-get update && sudo apt-get install -y xvfb
@@ -57,9 +64,11 @@ jobs:
working-directory: packages/browseros-agent/apps/eval
env:
FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
NOPECHA_API_KEY: ${{ secrets.NOPECHA_API_KEY }}
BROWSEROS_BINARY: /usr/bin/browseros
WEBARENA_INFINITY_DIR: /tmp/webarena-infinity
EVAL_CONFIG: ${{ github.event.inputs.config || 'configs/browseros-agent-weekly.json' }}
run: |
echo "Running eval with config: $EVAL_CONFIG"
@@ -81,6 +90,8 @@ jobs:
- name: Generate trend report
if: success()
timeout-minutes: 5
continue-on-error: true
working-directory: packages/browseros-agent
env:
EVAL_R2_ACCOUNT_ID: ${{ secrets.EVAL_R2_ACCOUNT_ID }}

View File

@@ -54,6 +54,10 @@ jobs:
command: (cd apps/server && bun run test:integration)
junit_path: test-results/server-integration.xml
needs_browser: true
- suite: server-lib
command: (cd apps/server && bun run test:lib)
junit_path: test-results/server-lib.xml
needs_browser: false
- suite: server-sdk
command: (cd apps/server && bun run test:sdk)
junit_path: test-results/server-sdk.xml

View File

@@ -180,6 +180,7 @@ packages/*/dist
browseros-server
browseros-server.exe
browseros-server-*
tools/dogfood/browseros-dogfood
tools/dev/browseros-dev
log.txt

View File

@@ -1,4 +1,4 @@
import { Bot } from 'lucide-react'
import { Bot, Loader2, Wrench } from 'lucide-react'
import type { FC } from 'react'
import type { AgentCardData } from '@/lib/agent-conversations/types'
import { cn } from '@/lib/utils'
@@ -32,6 +32,11 @@ function getStatusTone(status: AgentCardData['status']): string {
return 'bg-emerald-500'
}
function formatCost(usd: number): string {
if (usd < 0.005) return `$${usd.toFixed(4)}`
return `$${usd.toFixed(2)}`
}
export const AgentCardExpanded: FC<AgentCardProps> = ({
agent,
onClick,
@@ -81,9 +86,26 @@ export const AgentCardExpanded: FC<AgentCardProps> = ({
</p>
</div>
<div className="mt-4 flex items-center justify-between gap-3 text-muted-foreground text-xs">
<span>{formatTimestamp(agent.lastMessageTimestamp)}</span>
<span>Open conversation</span>
<div className="mt-4 space-y-1.5 text-muted-foreground text-xs">
<div className="flex items-center justify-between gap-3">
<span>{formatTimestamp(agent.lastMessageTimestamp)}</span>
{agent.costUsd ? (
<span className="tabular-nums opacity-70">
{formatCost(agent.costUsd)}
</span>
) : null}
</div>
{agent.status === 'working' && agent.currentTool ? (
<div className="flex items-center gap-1.5 text-[var(--accent-orange)]/70">
<Loader2 className="size-3 shrink-0 animate-spin" />
<span className="truncate">{agent.currentTool}</span>
</div>
) : agent.activitySummary ? (
<div className="flex items-center gap-1.5 text-muted-foreground/60">
<Wrench className="size-3 shrink-0" />
<span className="truncate">{agent.activitySummary}</span>
</div>
) : null}
</div>
</button>
)

View File

@@ -1,4 +1,3 @@
import { useQueryClient } from '@tanstack/react-query'
import { ArrowLeft, Bot, Home } from 'lucide-react'
import { type FC, useEffect, useMemo, useRef, useState } from 'react'
import { Navigate, useNavigate, useParams, useSearchParams } from 'react-router'
@@ -13,14 +12,13 @@ import { ClawChat } from './ClawChat'
import { ConversationInput } from './ConversationInput'
import {
buildChatHistoryFromClawMessages,
filterTurnsPersistedInHistory,
flattenHistoryPages,
} from './claw-chat-types'
import { useAgentConversation } from './useAgentConversation'
import {
CLAW_CHAT_QUERY_KEYS,
useClawAgentSession,
useClawChatHistory,
} from './useClawChatHistory'
import { useClawChatHistory } from './useClawChatHistory'
import { useHarnessChatHistory } from './useHarnessChatHistory'
import { useOutboundQueue } from './useOutboundQueue'
function StatusBadge({ status }: { status: string }) {
return (
@@ -136,7 +134,7 @@ function AgentRailList({
<div className="styled-scrollbar min-h-0 flex-1 space-y-2 overflow-y-auto px-3 py-3">
{agents.map((entry) => {
const active = entry.agentId === activeAgentId
const modelName = getModelDisplayName(entry.model) ?? 'OpenClaw agent'
const modelName = getAgentEntryMeta(entry)
return (
<button
@@ -171,6 +169,13 @@ function AgentRailList({
)
}
function getAgentEntryMeta(agent: AgentEntry | undefined): string {
if (agent?.source === 'agent-harness') {
return getModelDisplayName(agent.model) ?? 'ACP agent'
}
return getModelDisplayName(agent?.model) ?? 'OpenClaw agent'
}
function getConversationStatusCopy(status: string | undefined): string {
if (status === 'running') return 'Ready'
if (status === 'starting') return 'Connecting'
@@ -196,55 +201,126 @@ function AgentConversationController({
agentPathPrefix: string
createAgentPath: string
}) {
const queryClient = useQueryClient()
const navigate = useNavigate()
const initialMessageSentRef = useRef<string | null>(null)
const onInitialMessageConsumedRef = useRef(onInitialMessageConsumed)
const [streamSessionKey, setStreamSessionKey] = useState<string | null>(null)
const agent = agents.find((entry) => entry.agentId === agentId)
const agentName = agent?.name || agentId || 'Agent'
const sessionQuery = useClawAgentSession(agentId)
const resolvedSessionKey =
streamSessionKey ?? sessionQuery.data?.sessionKey ?? null
const historyQuery = useClawChatHistory({
const isAgentHarnessAgent = agent?.source === 'agent-harness'
const clawHistoryQuery = useClawChatHistory({
agentId,
sessionKey: resolvedSessionKey,
enabled: Boolean(resolvedSessionKey),
sessionKey: streamSessionKey,
enabled: Boolean(agent) && !isAgentHarnessAgent,
})
const harnessHistoryQuery = useHarnessChatHistory(
agentId,
Boolean(agent) && isAgentHarnessAgent,
)
const historyMessages = useMemo(
() => flattenHistoryPages(historyQuery.data?.pages ?? []),
[historyQuery.data?.pages],
() =>
flattenHistoryPages(
isAgentHarnessAgent
? harnessHistoryQuery.data
? [harnessHistoryQuery.data]
: []
: (clawHistoryQuery.data?.pages ?? []),
),
[
clawHistoryQuery.data?.pages,
harnessHistoryQuery.data,
isAgentHarnessAgent,
],
)
const chatHistory = useMemo(
() => buildChatHistoryFromClawMessages(historyMessages),
[historyMessages],
)
const resolvedSessionKey =
streamSessionKey ??
(isAgentHarnessAgent
? null
: (clawHistoryQuery.data?.pages?.[0]?.sessionKey ?? null))
const { turns, streaming, send } = useAgentConversation(agentId, {
runtime: isAgentHarnessAgent ? 'agent-harness' : 'openclaw',
sessionKey: resolvedSessionKey,
history: chatHistory,
onComplete: () => {
if (isAgentHarnessAgent) {
void harnessHistoryQuery.refetch()
}
},
onSessionKeyChange: (sessionKey) => {
setStreamSessionKey(sessionKey)
void queryClient.invalidateQueries({
queryKey: [CLAW_CHAT_QUERY_KEYS.session],
})
},
})
const sendRef = useRef(send)
sendRef.current = send
const visibleTurns = useMemo(
() =>
isAgentHarnessAgent
? filterTurnsPersistedInHistory(turns, historyMessages)
: turns,
[historyMessages, isAgentHarnessAgent, turns],
)
const outboundQueue = useOutboundQueue({
agentId,
sessionKey: resolvedSessionKey,
enabled: Boolean(agent) && !isAgentHarnessAgent,
})
onInitialMessageConsumedRef.current = onInitialMessageConsumed
const disabled = status?.status !== 'running'
// Refetch history whenever a server-dispatched queue item completes.
// The server worker streams the queued turn into OpenClaw directly, so
// the client never observes the live tokens — we only see the new
// assistant turn once the JSONL is updated. Watching the queue for
// any 'sending' item dropping out is the cleanest "turn finalized"
// signal we have without exposing per-turn SSE.
const previousSendingIdsRef = useRef<Set<string>>(new Set())
useEffect(() => {
if (isAgentHarnessAgent) return
const currentSending = new Set(
outboundQueue.queue
.filter((item) => item.status === 'sending')
.map((item) => item.id),
)
const dropped = [...previousSendingIdsRef.current].filter(
(id) => !currentSending.has(id),
)
previousSendingIdsRef.current = currentSending
if (dropped.length > 0) {
void clawHistoryQuery.refetch()
}
}, [clawHistoryQuery, isAgentHarnessAgent, outboundQueue.queue])
const disabled =
!agent || (!isAgentHarnessAgent && status?.status !== 'running')
// Two-part gate: cover both "still fetching" AND "just got enabled but
// hasn't started fetching yet". When `enabled` flips true (baseUrl
// resolves), there's a render frame where React Query reports
// isLoading=false but hasn't run the queryFn yet — `isFetched` is still
// false. Without this we render EmptyState during that one frame.
const isInitialLoading =
sessionQuery.isLoading ||
(Boolean(resolvedSessionKey) && historyQuery.isLoading)
!isAgentHarnessAgent &&
(clawHistoryQuery.isLoading ||
(!clawHistoryQuery.isFetched && !clawHistoryQuery.isError))
const historyReady =
!resolvedSessionKey || historyQuery.isFetched || historyQuery.isError
(isAgentHarnessAgent &&
(harnessHistoryQuery.isFetched || harnessHistoryQuery.isError)) ||
(!isAgentHarnessAgent &&
(clawHistoryQuery.isFetched || clawHistoryQuery.isError))
const initialMessageKey = initialMessage
? `${agentId}:${initialMessage}`
: null
const error = sessionQuery.error ?? historyQuery.error ?? null
const error = isAgentHarnessAgent
? (harnessHistoryQuery.error ?? null)
: (clawHistoryQuery.error ?? null)
const enqueueRef = useRef(outboundQueue.enqueue)
enqueueRef.current = outboundQueue.enqueue
const sendRef = useRef(send)
sendRef.current = send
useEffect(() => {
const query = initialMessage?.trim()
@@ -257,23 +333,24 @@ function AgentConversationController({
!query ||
initialMessageSentRef.current === initialMessageKey ||
disabled ||
sessionQuery.isLoading ||
!historyReady ||
streaming
!historyReady
) {
return
}
initialMessageSentRef.current = initialMessageKey
onInitialMessageConsumedRef.current()
void sendRef.current(query)
if (isAgentHarnessAgent) {
void sendRef.current({ text: query })
} else {
enqueueRef.current({ text: query })
}
}, [
disabled,
historyReady,
initialMessage,
initialMessageKey,
sessionQuery.isLoading,
streaming,
isAgentHarnessAgent,
])
const handleSelectAgent = (entry: AgentEntry) => {
@@ -285,18 +362,29 @@ function AgentConversationController({
<ClawChat
agentName={agentName}
historyMessages={historyMessages}
turns={turns}
turns={visibleTurns}
streaming={streaming}
isInitialLoading={isInitialLoading}
isInitialLoading={
isAgentHarnessAgent ? harnessHistoryQuery.isLoading : isInitialLoading
}
error={error}
hasNextPage={Boolean(historyQuery.hasNextPage)}
isFetchingNextPage={historyQuery.isFetchingNextPage}
hasNextPage={
isAgentHarnessAgent ? false : Boolean(clawHistoryQuery.hasNextPage)
}
isFetchingNextPage={
isAgentHarnessAgent ? false : clawHistoryQuery.isFetchingNextPage
}
onFetchNextPage={() => {
void historyQuery.fetchNextPage()
if (!isAgentHarnessAgent) {
void clawHistoryQuery.fetchNextPage()
}
}}
onRetry={() => {
void sessionQuery.refetch()
void historyQuery.refetch()
if (isAgentHarnessAgent) {
void harnessHistoryQuery.refetch()
} else {
void clawHistoryQuery.refetch()
}
}}
/>
@@ -307,14 +395,41 @@ function AgentConversationController({
agents={agents}
selectedAgentId={agentId}
onSelectAgent={handleSelectAgent}
onSend={(text) => {
void send(text)
onSend={(input) => {
const attachments = input.attachments.map((a) => a.payload)
const attachmentPreviews = input.attachments.map((a) => ({
id: a.id,
kind: a.kind,
mediaType: a.mediaType,
name: a.name,
dataUrl: a.dataUrl,
}))
if (isAgentHarnessAgent) {
void send({ text: input.text, attachments, attachmentPreviews })
} else {
outboundQueue.enqueue({
text: input.text,
attachments,
attachmentPreviews,
history: chatHistory,
})
}
}}
onCreateAgent={() => navigate(createAgentPath)}
streaming={streaming}
disabled={disabled}
status={status?.status}
status={isAgentHarnessAgent ? 'running' : status?.status}
attachmentsEnabled={!isAgentHarnessAgent}
placeholder={`Message ${agentName}...`}
outboundQueue={
isAgentHarnessAgent ? undefined : outboundQueue.queue
}
onCancelQueued={
isAgentHarnessAgent ? undefined : outboundQueue.cancel
}
onRetryQueued={
isAgentHarnessAgent ? undefined : outboundQueue.retry
}
/>
</div>
</div>
@@ -343,7 +458,7 @@ export const AgentCommandConversation: FC<AgentCommandConversationProps> = ({
const resolvedAgentId = agentId ?? ''
const agent = agents.find((entry) => entry.agentId === resolvedAgentId)
const agentName = agent?.name || resolvedAgentId || 'Agent'
const agentMeta = getModelDisplayName(agent?.model) ?? 'OpenClaw agent'
const agentMeta = getAgentEntryMeta(agent)
const initialMessage = searchParams.get('q')
const isPageVariant = variant === 'page'
const backLabel = isPageVariant ? 'Back to agents' : 'Back to home'
@@ -356,7 +471,10 @@ export const AgentCommandConversation: FC<AgentCommandConversationProps> = ({
navigate(`${agentPathPrefix}/${entry.agentId}`)
}
const statusCopy = getConversationStatusCopy(status?.status)
const statusCopy =
agent?.source === 'agent-harness'
? 'Ready'
: getConversationStatusCopy(status?.status)
return (
<div className="absolute inset-0 overflow-hidden bg-background md:pl-[theme(spacing.14)]">

View File

@@ -1,4 +1,4 @@
import { ArrowRight, Bot, Plus, Settings2 } from 'lucide-react'
import { Plus } from 'lucide-react'
import { type FC, useEffect, useState } from 'react'
import { useNavigate } from 'react-router'
import { Button } from '@/components/ui/button'
@@ -8,37 +8,11 @@ import type { AgentEntry } from '@/entrypoints/app/agents/useOpenClaw'
import { ImportDataHint } from '@/entrypoints/newtab/index/ImportDataHint'
import { SignInHint } from '@/entrypoints/newtab/index/SignInHint'
import { useActiveHint } from '@/entrypoints/newtab/index/useActiveHint'
import type { AgentCardData } from '@/lib/agent-conversations/types'
import { AgentCardDock } from './AgentCardDock'
import { useAgentCommandData } from './agent-command-layout'
import { ConversationInput } from './ConversationInput'
import { useAgentCardData } from './useAgentCardData'
function AgentCommandSetupState({
onOpenAgents,
}: {
onOpenAgents: () => void
}) {
return (
<Card className="border-border/60 bg-card/90 shadow-sm">
<CardContent className="flex flex-col items-center gap-4 p-8 text-center">
<div className="flex size-12 items-center justify-center rounded-2xl bg-muted text-muted-foreground">
<Bot className="size-5" />
</div>
<div className="space-y-2">
<h2 className="font-semibold text-lg">Set up your first agent</h2>
<p className="max-w-md text-muted-foreground text-sm leading-6">
Connect OpenClaw and create an agent before using the new tab as
your workspace.
</p>
</div>
<Button onClick={onOpenAgents} className="gap-2 rounded-xl">
Open Agent Setup
<ArrowRight className="size-4" />
</Button>
</CardContent>
</Card>
)
}
import { buildAgentCardData } from './useAgentCardData'
function EmptyAgentsState({ onOpenAgents }: { onOpenAgents: () => void }) {
return (
@@ -61,33 +35,6 @@ function EmptyAgentsState({ onOpenAgents }: { onOpenAgents: () => void }) {
)
}
function OpenClawUnavailableState({
onOpenAgents,
}: {
onOpenAgents: () => void
}) {
return (
<Card className="border-border/60 bg-card/90 shadow-sm">
<CardContent className="flex flex-col items-center gap-4 p-8 text-center">
<div className="flex size-12 items-center justify-center rounded-2xl bg-muted text-muted-foreground">
<Settings2 className="size-5" />
</div>
<div className="space-y-2">
<h2 className="font-semibold text-lg">OpenClaw is unavailable</h2>
<p className="max-w-md text-muted-foreground text-sm leading-6">
Review your agent setup to restart the gateway or reconnect the
local service.
</p>
</div>
<Button onClick={onOpenAgents} className="gap-2 rounded-xl">
Open Agent Setup
<ArrowRight className="size-4" />
</Button>
</CardContent>
</Card>
)
}
function RecentThreads({
activeAgentId,
agents,
@@ -95,7 +42,7 @@ function RecentThreads({
onSelectAgent,
}: {
activeAgentId?: string | null
agents: ReturnType<typeof useAgentCardData>
agents: AgentCardData[]
onOpenAgents: () => void
onSelectAgent: (agentId: string) => void
}) {
@@ -132,9 +79,9 @@ function RecentThreads({
export const AgentCommandHome: FC = () => {
const navigate = useNavigate()
const activeHint = useActiveHint()
const { status, agents } = useAgentCommandData()
const { agents, status } = useAgentCommandData()
const [selectedAgentId, setSelectedAgentId] = useState<string | null>(null)
const cardData = useAgentCardData(agents, status?.status)
const cardData = buildAgentCardData(agents, status?.status, undefined)
useEffect(() => {
if (agents.length === 0) {
@@ -152,80 +99,76 @@ export const AgentCommandHome: FC = () => {
}
}, [agents, selectedAgentId])
const handleSend = (text: string) => {
const handleSend = (input: { text: string }) => {
if (!selectedAgentId) return
navigate(`/home/agents/${selectedAgentId}?q=${encodeURIComponent(text)}`)
navigate(
`/home/agents/${selectedAgentId}?q=${encodeURIComponent(input.text)}`,
)
}
const handleSelectAgent = (agent: AgentEntry) => {
setSelectedAgentId(agent.agentId)
}
const openClawStatus = status?.status
const isSetup = openClawStatus != null && openClawStatus !== 'uninitialized'
const shouldShowUnavailableState =
openClawStatus != null &&
openClawStatus !== 'running' &&
openClawStatus !== 'uninitialized' &&
cardData.length === 0
const selectedAgent = agents.find(
(agent) => agent.agentId === selectedAgentId,
)
const selectedAgentReady = selectedAgent
? selectedAgent.source === 'agent-harness' || status?.status === 'running'
: false
const selectedAgentStatus =
selectedAgent?.source === 'agent-harness' ? 'running' : status?.status
const selectedCard =
cardData.find((agent) => agent.agentId === selectedAgentId) ?? cardData[0]
return (
<div className="min-h-full px-4 py-6">
<div className="mx-auto flex w-full max-w-5xl flex-col gap-8">
{isSetup ? (
shouldShowUnavailableState ? (
<OpenClawUnavailableState
onOpenAgents={() => navigate('/agents')}
/>
) : cardData.length > 0 ? (
<>
<div className="flex flex-col items-center gap-5 pt-[max(10vh,24px)] text-center">
<div className="space-y-3">
<h1 className="font-semibold text-[clamp(2rem,4vw,3.25rem)] leading-tight tracking-tight">
What should your agent work on next?
</h1>
<p className="mx-auto max-w-2xl text-muted-foreground text-sm leading-6">
Start with a task, continue a thread, or switch to another
agent without leaving the new tab.
</p>
</div>
<div className="w-full max-w-3xl">
<ConversationInput
variant="home"
agents={agents}
selectedAgentId={selectedAgentId}
onSelectAgent={handleSelectAgent}
onSend={handleSend}
onCreateAgent={() => navigate('/agents')}
streaming={false}
disabled={status?.status !== 'running'}
status={status?.status}
placeholder={
status?.status === 'running'
? `Ask ${selectedCard?.name ?? 'your agent'} to handle a task...`
: 'OpenClaw is not running...'
}
/>
</div>
{cardData.length > 0 ? (
<>
<div className="flex flex-col items-center gap-5 pt-[max(10vh,24px)] text-center">
<div className="space-y-3">
<h1 className="font-semibold text-[clamp(2rem,4vw,3.25rem)] leading-tight tracking-tight">
What should your agent work on next?
</h1>
<p className="mx-auto max-w-2xl text-muted-foreground text-sm leading-6">
Start with a task, continue a thread, or switch to another
agent without leaving the new tab.
</p>
</div>
<Separator />
<div className="w-full max-w-3xl">
<ConversationInput
variant="home"
agents={agents}
selectedAgentId={selectedAgentId}
onSelectAgent={handleSelectAgent}
onSend={handleSend}
onCreateAgent={() => navigate('/agents')}
streaming={false}
disabled={!selectedAgentReady}
status={selectedAgentStatus}
attachmentsEnabled={false}
placeholder={
selectedAgentReady
? `Ask ${selectedCard?.name ?? 'your agent'} to handle a task...`
: 'Agent runtime is not running...'
}
/>
</div>
</div>
<RecentThreads
activeAgentId={selectedAgentId}
agents={cardData}
onOpenAgents={() => navigate('/agents')}
onSelectAgent={(agentId) => navigate(`/home/agents/${agentId}`)}
/>
</>
) : (
<EmptyAgentsState onOpenAgents={() => navigate('/agents')} />
)
<Separator />
<RecentThreads
activeAgentId={selectedAgentId}
agents={cardData}
onOpenAgents={() => navigate('/agents')}
onSelectAgent={(agentId) => navigate(`/home/agents/${agentId}`)}
/>
</>
) : (
<AgentCommandSetupState onOpenAgents={() => navigate('/agents')} />
<EmptyAgentsState onOpenAgents={() => navigate('/agents')} />
)}
</div>

View File

@@ -1,36 +1,155 @@
import { CheckCircle2, Loader2, XCircle } from 'lucide-react'
import type { FC } from 'react'
import { CheckCircle2, Copy, Loader2, Wrench, XCircle } from 'lucide-react'
import { type FC, useCallback, useMemo } from 'react'
import {
Message,
MessageAction,
MessageActions,
MessageAttachment,
MessageAttachments,
MessageContent,
MessageResponse,
MessageToolbar,
} from '@/components/ai-elements/message'
import {
Reasoning,
ReasoningContent,
ReasoningTrigger,
} from '@/components/ai-elements/reasoning'
import {
Task,
TaskContent,
TaskItem,
TaskTrigger,
} from '@/components/ai-elements/task'
import { cn } from '@/lib/utils'
import type { ClawChatMessage as ClawChatMessageType } from './claw-chat-types'
import type {
ClawChatMessagePart,
ClawChatMessage as ClawChatMessageType,
} from './claw-chat-types'
function formatCost(usd: number): string {
if (usd < 0.005) return `$${usd.toFixed(4)}`
return `$${usd.toFixed(2)}`
}
type ToolCallPart = Extract<ClawChatMessagePart, { type: 'tool-call' }>
type AttachmentPart = Extract<ClawChatMessagePart, { type: 'attachment' }>
interface RenderEntry {
kind: 'text' | 'reasoning' | 'meta' | 'task' | 'attachments'
partIndex: number
part?: ClawChatMessagePart
tools?: ToolCallPart[]
attachments?: AttachmentPart[]
}
/**
* Build a render plan that groups all tool-call parts into a single Task
* collapsible and all attachment parts into a single attachment strip at
* their respective first-appearance positions. Other parts render in place.
*/
function buildRenderEntries(parts: ClawChatMessagePart[]): RenderEntry[] {
const entries: RenderEntry[] = []
const tools: ToolCallPart[] = []
const attachments: AttachmentPart[] = []
let taskInserted = false
let attachmentsInserted = false
parts.forEach((part, partIndex) => {
if (part.type === 'tool-call') {
tools.push(part)
if (!taskInserted) {
entries.push({ kind: 'task', partIndex, tools })
taskInserted = true
}
} else if (part.type === 'attachment') {
attachments.push(part)
if (!attachmentsInserted) {
entries.push({ kind: 'attachments', partIndex, attachments })
attachmentsInserted = true
}
} else if (part.type === 'text') {
entries.push({ kind: 'text', partIndex, part })
} else if (part.type === 'reasoning') {
entries.push({ kind: 'reasoning', partIndex, part })
} else if (part.type === 'meta') {
entries.push({ kind: 'meta', partIndex, part })
}
})
return entries
}
function ToolStatusIcon({ status }: { status: ToolCallPart['status'] }) {
if (status === 'running' || status === 'pending') {
return (
<Loader2 className="size-3.5 shrink-0 animate-spin text-muted-foreground" />
)
}
if (status === 'completed') {
return <CheckCircle2 className="size-3.5 shrink-0 text-green-500" />
}
return <XCircle className="size-3.5 shrink-0 text-destructive" />
}
interface ClawChatMessageProps {
message: ClawChatMessageType
}
export const ClawChatMessage: FC<ClawChatMessageProps> = ({ message }) => (
<Message
from={message.role}
className="max-w-full group-[.is-user]:max-w-[80%]"
>
<MessageContent className="max-w-full overflow-hidden group-[.is-assistant]:w-full group-[.is-user]:max-w-full">
{message.parts.map((part, index) => {
const key = `${message.id}-part-${index}`
export const ClawChatMessage: FC<ClawChatMessageProps> = ({ message }) => {
const messageText = message.parts
.filter((p) => p.type === 'text')
.map((p) => p.text)
.join('\n')
switch (part.type) {
case 'text':
const handleCopy = useCallback(() => {
if (messageText) navigator.clipboard.writeText(messageText)
}, [messageText])
const entries = useMemo(
() => buildRenderEntries(message.parts),
[message.parts],
)
return (
<Message
from={message.role}
className="max-w-full group-[.is-user]:max-w-[80%]"
>
<MessageContent className="max-w-full overflow-hidden group-[.is-assistant]:w-full group-[.is-user]:max-w-full">
{entries.map((entry) => {
const key = `${message.id}-entry-${entry.partIndex}`
if (entry.kind === 'attachments' && entry.attachments) {
return (
<MessageAttachments key={key}>
{entry.attachments.map((attachment, idx) => (
<MessageAttachment
// biome-ignore lint/suspicious/noArrayIndexKey: attachment order is stable within a finalized message
key={`${attachment.kind}-${idx}`}
data={{
type: 'file',
url: attachment.dataUrl ?? '',
mediaType: attachment.mediaType,
filename: attachment.name,
}}
/>
))}
</MessageAttachments>
)
}
if (entry.kind === 'text' && entry.part?.type === 'text') {
return (
<MessageResponse
key={key}
// Historical messages are finalized — render immediately.
// Streamdown's default "streaming" mode uses an idle-callback
// debounce (300ms / 500ms idle) that paints empty content
// first, which made history flash blank tool collapsibles
// before text on every load.
mode="static"
parseIncompleteMarkdown={false}
className={cn(
'max-w-full overflow-hidden break-words',
'[&_[data-streamdown="code-block"]]:!w-full [&_[data-streamdown="code-block"]]:!max-w-full [&_[data-streamdown="code-block"]]:overflow-x-auto',
@@ -38,53 +157,92 @@ export const ClawChatMessage: FC<ClawChatMessageProps> = ({ message }) => (
'[&_table]:w-max [&_table]:min-w-full',
)}
>
{part.text}
{entry.part.text}
</MessageResponse>
)
}
case 'reasoning':
if (entry.kind === 'reasoning' && entry.part?.type === 'reasoning') {
return (
<Reasoning key={key} className="w-full" defaultOpen={false}>
<Reasoning
key={key}
className="w-full"
defaultOpen={false}
duration={entry.part.duration}
>
<ReasoningTrigger />
<ReasoningContent>{part.text}</ReasoningContent>
<ReasoningContent>{entry.part.text}</ReasoningContent>
</Reasoning>
)
}
case 'tool-call':
return (
<div
key={key}
className="flex items-center gap-2 rounded-md border px-3 py-2 text-sm"
>
{part.status === 'running' || part.status === 'pending' ? (
<Loader2 className="size-3.5 animate-spin text-muted-foreground" />
) : null}
{part.status === 'completed' ? (
<CheckCircle2 className="size-3.5 text-green-500" />
) : null}
{part.status === 'failed' ? (
<XCircle className="size-3.5 text-destructive" />
) : null}
<span className="font-mono text-xs">{part.name}</span>
{part.error ? (
<span className="ml-auto text-destructive text-xs">
{part.error}
</span>
) : null}
</div>
)
case 'meta':
if (entry.kind === 'meta' && entry.part?.type === 'meta') {
return (
<div key={key} className="text-muted-foreground text-xs">
{part.label}: {part.value}
{entry.part.label}: {entry.part.value}
</div>
)
}
default:
return null
}
})}
</MessageContent>
</Message>
)
if (entry.kind === 'task' && entry.tools) {
const tools = entry.tools
const errorCount = tools.filter((t) => t.status === 'failed').length
const taskTitle = `Agent activity (${tools.length} ${tools.length === 1 ? 'action' : 'actions'}${errorCount > 0 ? `, ${errorCount} failed` : ''})`
return (
<Task key={key} defaultOpen={false}>
<TaskTrigger title={taskTitle} TriggerIcon={Wrench} />
<TaskContent>
{tools.map((tool, idx) => (
<TaskItem
// biome-ignore lint/suspicious/noArrayIndexKey: tool order is stable within a finalized historical message
key={`${tool.name}-${tool.status}-${idx}`}
className="flex items-center gap-2"
>
<ToolStatusIcon status={tool.status} />
<span className="text-foreground text-xs">
{tool.label}
</span>
{tool.subject ? (
<span className="ml-1.5 truncate text-muted-foreground/70 text-xs">
· {tool.subject}
</span>
) : null}
{tool.error ? (
<span className="ml-2 truncate text-destructive text-xs">
{tool.error}
</span>
) : null}
{tool.durationMs != null ? (
<span className="ml-auto text-muted-foreground/60 text-xs tabular-nums">
{(tool.durationMs / 1000).toFixed(1)}s
</span>
) : null}
</TaskItem>
))}
</TaskContent>
</Task>
)
}
return null
})}
{message.role === 'assistant' && messageText ? (
<MessageToolbar>
<MessageActions>
<MessageAction tooltip="Copy" onClick={handleCopy}>
<Copy className="size-3.5" />
</MessageAction>
</MessageActions>
{message.costUsd ? (
<span className="text-[11px] text-muted-foreground/50 tabular-nums">
{formatCost(message.costUsd)}
</span>
) : null}
</MessageToolbar>
) : null}
</MessageContent>
</Message>
)
}

View File

@@ -1,14 +1,20 @@
import {
AlertTriangle,
ArrowRight,
Bot,
ChevronDown,
FileText,
Folder,
Layers,
Loader2,
Mic,
Paperclip,
RefreshCw,
Square,
X,
} from 'lucide-react'
import {
type DragEvent,
type FC,
type ReactNode,
useEffect,
@@ -24,6 +30,7 @@ import { Textarea } from '@/components/ui/textarea'
import type { AgentEntry } from '@/entrypoints/app/agents/useOpenClaw'
import { McpServerIcon } from '@/entrypoints/app/connect-mcp/McpServerIcon'
import { useGetUserMCPIntegrations } from '@/entrypoints/app/connect-mcp/useGetUserMCPIntegrations'
import { type StagedAttachment, stageAttachments } from '@/lib/attachments'
import { Feature } from '@/lib/browseros/capabilities'
import { useCapabilities } from '@/lib/browseros/useCapabilities'
import { useMcpServers } from '@/lib/mcp/mcpServerStorage'
@@ -31,18 +38,34 @@ import { cn } from '@/lib/utils'
import { useVoiceInput } from '@/lib/voice/useVoiceInput'
import { useWorkspace } from '@/lib/workspace/use-workspace'
import { AgentSelector } from './AgentSelector'
import type { OutboundMessage } from './useOutboundQueue'
export interface ConversationInputSendInput {
text: string
attachments: StagedAttachment[]
}
interface ConversationInputProps {
agents: AgentEntry[]
selectedAgentId: string | null
onSelectAgent: (agent: AgentEntry) => void
onSend: (text: string) => void
onSend: (input: ConversationInputSendInput) => void
onCreateAgent?: () => void
streaming: boolean
disabled?: boolean
status?: string
placeholder?: string
attachmentsEnabled?: boolean
variant?: 'home' | 'conversation'
// Outbound queue: when present, the composer renders the queue strip
// above the textarea and lets the user keep sending while a previous
// turn is in flight. Optional so non-conversation variants (the home
// page) can opt out — the queue only makes sense in the conversation
// page where each enqueued message will eventually be delivered to the
// active agent.
outboundQueue?: OutboundMessage[]
onCancelQueued?: (id: string) => void
onRetryQueued?: (id: string) => void
}
function InputActionButton({
@@ -131,6 +154,9 @@ function ContextControls({
onToggleTab,
showAgentSelector,
status,
onAttachClick,
attachDisabled,
attachmentsEnabled,
}: {
agents: AgentEntry[]
onCreateAgent?: () => void
@@ -140,6 +166,9 @@ function ContextControls({
onToggleTab: (tab: chrome.tabs.Tab) => void
showAgentSelector: boolean
status?: string
onAttachClick: () => void
attachDisabled: boolean
attachmentsEnabled: boolean
}) {
const { supports } = useCapabilities()
const { selectedFolder } = useWorkspace()
@@ -199,6 +228,20 @@ function ContextControls({
<span>Tabs</span>
</Button>
</TabPickerPopover>
<Button
type="button"
variant="ghost"
onClick={onAttachClick}
disabled={attachDisabled || !attachmentsEnabled}
title="Attach files"
className={cn(
'flex items-center gap-2 rounded-lg px-3 py-1.5 font-medium text-sm transition-all',
'bg-transparent text-muted-foreground hover:bg-accent hover:text-accent-foreground',
)}
>
<Paperclip className="h-4 w-4" />
<span>Attach</span>
</Button>
</div>
{supports(Feature.MANAGED_MCP_SUPPORT) ? (
@@ -266,11 +309,20 @@ export const ConversationInput: FC<ConversationInputProps> = ({
disabled,
status,
placeholder,
attachmentsEnabled = true,
variant = 'conversation',
outboundQueue,
onCancelQueued,
onRetryQueued,
}) => {
const [input, setInput] = useState('')
const [selectedTabs, setSelectedTabs] = useState<chrome.tabs.Tab[]>([])
const [isExpandedDraft, setIsExpandedDraft] = useState(false)
const [attachments, setAttachments] = useState<StagedAttachment[]>([])
const [attachmentError, setAttachmentError] = useState<string | null>(null)
const [isStaging, setIsStaging] = useState(false)
const [isDragOver, setIsDragOver] = useState(false)
const fileInputRef = useRef<HTMLInputElement>(null)
const voice = useVoiceInput()
const textareaRef = useRef<HTMLTextAreaElement>(null)
const selectedAgent = agents.find(
@@ -278,6 +330,32 @@ export const ConversationInput: FC<ConversationInputProps> = ({
)
const isConversation = variant === 'conversation'
const stageFiles = async (files: File[]) => {
if (files.length === 0) return
if (!attachmentsEnabled) {
setAttachmentError('Attachments are not supported for this agent yet.')
return
}
setIsStaging(true)
setAttachmentError(null)
try {
const result = await stageAttachments(files, attachments.length)
if (result.staged.length > 0) {
setAttachments((prev) => [...prev, ...result.staged])
}
if (result.errors.length > 0) {
setAttachmentError(result.errors.map((e) => e.message).join(' \u2022 '))
}
} finally {
setIsStaging(false)
}
}
const removeAttachment = (id: string) => {
setAttachments((prev) => prev.filter((a) => a.id !== id))
setAttachmentError(null)
}
useLayoutEffect(() => {
const element = textareaRef.current
if (!element) return
@@ -299,6 +377,12 @@ export const ConversationInput: FC<ConversationInputProps> = ({
}
}, [voice.transcript, voice.isTranscribing, voice])
useEffect(() => {
if (attachmentsEnabled) return
setAttachments([])
setAttachmentError(null)
}, [attachmentsEnabled])
const toggleTab = (tab: chrome.tabs.Tab) => {
setSelectedTabs((prev) => {
const isSelected = prev.some((selected) => selected.id === tab.id)
@@ -309,11 +393,75 @@ export const ConversationInput: FC<ConversationInputProps> = ({
})
}
const hasContent = input.trim().length > 0 || attachments.length > 0
const queueEnabled = outboundQueue !== undefined
const handleSend = () => {
const text = input.trim()
if (!text || streaming || disabled) return
onSend(text)
// The outbound queue accepts new messages while streaming; legacy
// direct-send callers (e.g., the home composer) keep the original
// streaming-blocks-send semantic.
if (disabled || isStaging) return
if (!queueEnabled && streaming) return
if (!text && attachments.length === 0) return
onSend({ text, attachments })
setInput('')
setAttachments([])
setAttachmentError(null)
}
const handlePaste = (event: React.ClipboardEvent<HTMLTextAreaElement>) => {
const items = event.clipboardData?.items
if (!items) return
const files: File[] = []
for (const item of items) {
if (item.kind === 'file') {
const file = item.getAsFile()
if (file) files.push(file)
}
}
if (files.length > 0) {
event.preventDefault()
void stageFiles(files)
}
}
const handleDrop = (event: DragEvent<HTMLDivElement>) => {
event.preventDefault()
setIsDragOver(false)
const files = Array.from(event.dataTransfer?.files ?? [])
if (files.length > 0) {
void stageFiles(files)
}
}
const handleDragOver = (event: DragEvent<HTMLDivElement>) => {
if (!event.dataTransfer?.types.includes('Files')) return
event.preventDefault()
setIsDragOver(true)
}
const handleDragLeave = (event: DragEvent<HTMLDivElement>) => {
if (event.currentTarget.contains(event.relatedTarget as Node | null)) {
return
}
setIsDragOver(false)
}
const openFilePicker = () => {
if (!attachmentsEnabled) {
setAttachmentError('Attachments are not supported for this agent yet.')
return
}
fileInputRef.current?.click()
}
const handleFileInputChange = (
event: React.ChangeEvent<HTMLInputElement>,
) => {
const files = Array.from(event.target.files ?? [])
event.target.value = ''
if (files.length > 0) void stageFiles(files)
}
const shell = variant === 'home' ? HomeShell : ConversationShell
@@ -321,82 +469,314 @@ export const ConversationInput: FC<ConversationInputProps> = ({
return (
<Shell>
<div
className={cn(
'flex gap-3',
variant === 'home' ? 'px-4 py-3' : 'px-4 py-3',
isExpandedDraft ? 'items-end' : 'items-center',
)}
<section
// Drag/drop on a region isn't a click affordance — wrap the
// composer in a labeled <section> so the a11y rule is satisfied
// without misrepresenting the surface as interactive.
aria-label="Message composer"
className={cn('relative', isDragOver && 'ring-2 ring-primary/60')}
onDragOver={handleDragOver}
onDragLeave={handleDragLeave}
onDrop={handleDrop}
>
<BotInputIcon variant={variant} />
<div className="flex-1">
<Textarea
ref={textareaRef}
value={input}
onChange={(event) => setInput(event.currentTarget.value)}
onKeyDown={(event) => {
if (event.key === 'Enter' && !event.shiftKey) {
event.preventDefault()
handleSend()
<input
ref={fileInputRef}
type="file"
multiple
accept="image/png,image/jpeg,image/webp,image/gif,text/*,application/json"
className="hidden"
onChange={handleFileInputChange}
/>
{attachments.length > 0 || attachmentError ? (
<AttachmentStrip
attachments={attachments}
onRemove={removeAttachment}
error={attachmentError}
/>
) : null}
{queueEnabled && outboundQueue && outboundQueue.length > 0 ? (
<OutboundQueueStrip
messages={outboundQueue}
onCancel={onCancelQueued}
onRetry={onRetryQueued}
/>
) : null}
<div
className={cn(
'flex gap-3',
variant === 'home' ? 'px-4 py-3' : 'px-4 py-3',
isExpandedDraft ? 'items-end' : 'items-center',
)}
>
<BotInputIcon variant={variant} />
<div className="flex-1">
<Textarea
ref={textareaRef}
value={input}
onChange={(event) => setInput(event.currentTarget.value)}
onKeyDown={(event) => {
if (event.key === 'Enter' && !event.shiftKey) {
event.preventDefault()
handleSend()
}
}}
onPaste={handlePaste}
rows={1}
placeholder={
voice.isTranscribing
? 'Transcribing...'
: (placeholder ??
`Message ${selectedAgent?.name ?? 'agent'}...`)
}
disabled={disabled || voice.isTranscribing}
className={cn(
'resize-none border-none bg-transparent px-0 text-[15px] shadow-none focus-visible:ring-0',
'[field-sizing:fixed]',
variant === 'home'
? 'min-h-[40px] py-2 leading-6'
: 'min-h-[40px] py-2 leading-6',
'placeholder:text-muted-foreground/80',
)}
/>
</div>
<VoiceButton
isRecording={voice.isRecording}
isTranscribing={voice.isTranscribing}
onStart={() => {
void voice.startRecording()
}}
rows={1}
placeholder={
voice.isTranscribing
? 'Transcribing...'
: (placeholder ??
`Message ${selectedAgent?.name ?? 'agent'}...`)
onStop={() => {
void voice.stopRecording()
}}
/>
<InputActionButton
disabled={
!hasContent ||
isStaging ||
!!disabled ||
voice.isRecording ||
voice.isTranscribing ||
// Only block on `streaming` for the legacy direct-send path
// (no queue). With the queue active the press always
// succeeds — it just enqueues instead of dispatching.
(!queueEnabled && streaming)
}
disabled={disabled || voice.isTranscribing}
className={cn(
'resize-none border-none bg-transparent px-0 text-[15px] shadow-none focus-visible:ring-0',
'[field-sizing:fixed]',
variant === 'home'
? 'min-h-[40px] py-2 leading-6'
: 'min-h-[40px] py-2 leading-6',
'placeholder:text-muted-foreground/80',
)}
onClick={handleSend}
// Spinner stays the user-facing "agent is busy" hint; with the
// queue active we still spin while a turn is in flight.
streaming={streaming}
/>
</div>
<VoiceButton
isRecording={voice.isRecording}
isTranscribing={voice.isTranscribing}
onStart={() => {
void voice.startRecording()
}}
onStop={() => {
void voice.stopRecording()
}}
{voice.error ? (
<div className="px-5 pb-2 text-destructive text-xs">
{voice.error}
</div>
) : null}
<ContextControls
agents={agents}
onCreateAgent={onCreateAgent}
onSelectAgent={onSelectAgent}
selectedAgentId={selectedAgentId}
selectedTabs={selectedTabs}
onToggleTab={toggleTab}
showAgentSelector={variant === 'home'}
status={status}
onAttachClick={openFilePicker}
attachDisabled={attachments.length >= 10 || isStaging || !!disabled}
attachmentsEnabled={attachmentsEnabled}
/>
<InputActionButton
disabled={
!input.trim() ||
streaming ||
!!disabled ||
voice.isRecording ||
voice.isTranscribing
}
onClick={handleSend}
streaming={streaming}
/>
</div>
{voice.error ? (
<div className="px-5 pb-2 text-destructive text-xs">{voice.error}</div>
) : null}
<ContextControls
agents={agents}
onCreateAgent={onCreateAgent}
onSelectAgent={onSelectAgent}
selectedAgentId={selectedAgentId}
selectedTabs={selectedTabs}
onToggleTab={toggleTab}
showAgentSelector={variant === 'home'}
status={status}
/>
{isDragOver ? (
<div className="pointer-events-none absolute inset-0 flex items-center justify-center rounded-[inherit] bg-background/80 font-medium text-foreground text-sm backdrop-blur-sm">
Drop files to attach
</div>
) : null}
</section>
</Shell>
)
}
function OutboundQueueStrip({
messages,
onCancel,
onRetry,
}: {
messages: OutboundMessage[]
onCancel?: (id: string) => void
onRetry?: (id: string) => void
}) {
return (
<div className="border-border/40 border-b px-4 pt-3 pb-2">
<ul className="flex flex-col gap-1">
{messages.map((message) => (
<OutboundQueueItem
key={message.id}
message={message}
onCancel={onCancel}
onRetry={onRetry}
/>
))}
</ul>
</div>
)
}
function OutboundQueueItem({
message,
onCancel,
onRetry,
}: {
message: OutboundMessage
onCancel?: (id: string) => void
onRetry?: (id: string) => void
}) {
const preview = message.text.trim() || '(attachments only)'
return (
<li className="flex items-center gap-2 rounded-md px-2 py-1 text-xs">
<OutboundQueueStatusIcon status={message.status} />
<span className="min-w-0 flex-1 truncate text-muted-foreground">
{preview}
</span>
{message.attachmentPreviews.length > 0 ? (
<span className="inline-flex items-center gap-1 text-muted-foreground/70">
<Paperclip className="size-3" />
<span className="tabular-nums">
{message.attachmentPreviews.length}
</span>
</span>
) : null}
{message.status === 'queued' && onCancel ? (
<button
type="button"
onClick={() => onCancel(message.id)}
className="ml-1 inline-flex size-5 items-center justify-center rounded-full text-muted-foreground hover:bg-accent hover:text-foreground"
aria-label="Cancel queued message"
title="Cancel"
>
<X className="size-3" />
</button>
) : null}
{message.status === 'failed' ? (
<span className="ml-1 inline-flex items-center gap-2 text-destructive">
<span className="max-w-[160px] truncate" title={message.error}>
{message.error ?? 'Failed'}
</span>
{onRetry ? (
<button
type="button"
onClick={() => onRetry(message.id)}
className="inline-flex size-5 items-center justify-center rounded-full hover:bg-accent hover:text-foreground"
aria-label="Retry failed message"
title="Retry"
>
<RefreshCw className="size-3" />
</button>
) : null}
{onCancel ? (
<button
type="button"
onClick={() => onCancel(message.id)}
className="inline-flex size-5 items-center justify-center rounded-full hover:bg-accent hover:text-foreground"
aria-label="Discard failed message"
title="Discard"
>
<X className="size-3" />
</button>
) : null}
</span>
) : null}
</li>
)
}
function OutboundQueueStatusIcon({
status,
}: {
status: OutboundMessage['status']
}) {
if (status === 'sending') {
return (
<Loader2 className="size-3.5 shrink-0 animate-spin text-muted-foreground" />
)
}
if (status === 'failed') {
return <AlertTriangle className="size-3.5 shrink-0 text-destructive" />
}
return (
<span className="inline-block size-2 shrink-0 rounded-full bg-muted-foreground/40" />
)
}
function AttachmentStrip({
attachments,
onRemove,
error,
}: {
attachments: StagedAttachment[]
onRemove: (id: string) => void
error: string | null
}) {
return (
<div className="border-border/40 border-b px-4 pt-3 pb-2">
{attachments.length > 0 ? (
<div className="flex flex-wrap gap-2">
{attachments.map((attachment) => (
<AttachmentChip
key={attachment.id}
attachment={attachment}
onRemove={() => onRemove(attachment.id)}
/>
))}
</div>
) : null}
{error ? (
<div className="mt-2 text-destructive text-xs">{error}</div>
) : null}
</div>
)
}
function AttachmentChip({
attachment,
onRemove,
}: {
attachment: StagedAttachment
onRemove: () => void
}) {
if (attachment.kind === 'image' && attachment.dataUrl) {
return (
<div className="group relative size-16 overflow-hidden rounded-md border border-border/60">
<img
src={attachment.dataUrl}
alt={attachment.name}
className="size-full object-cover"
/>
<button
type="button"
onClick={onRemove}
className="absolute top-1 right-1 inline-flex size-5 items-center justify-center rounded-full bg-background/80 text-muted-foreground opacity-0 transition-opacity hover:text-foreground group-hover:opacity-100"
aria-label={`Remove ${attachment.name}`}
>
<X className="size-3" />
</button>
</div>
)
}
return (
<div className="group flex max-w-[220px] items-center gap-2 rounded-md border border-border/60 bg-background/60 px-2 py-1.5">
<FileText className="size-4 shrink-0 text-muted-foreground" />
<span className="truncate text-xs">{attachment.name}</span>
<button
type="button"
onClick={onRemove}
className="ml-1 inline-flex size-4 items-center justify-center text-muted-foreground hover:text-foreground"
aria-label={`Remove ${attachment.name}`}
>
<X className="size-3" />
</button>
</div>
)
}
function BotInputIcon({ variant }: { variant: 'home' | 'conversation' }) {
return (
<div

View File

@@ -1,7 +1,9 @@
import { Bot, CheckCircle2, Loader2, XCircle } from 'lucide-react'
import type { FC } from 'react'
import { Bot, CheckCircle2, Loader2, Wrench, XCircle } from 'lucide-react'
import { type FC, useMemo } from 'react'
import {
Message,
MessageAttachment,
MessageAttachments,
MessageContent,
MessageResponse,
} from '@/components/ai-elements/message'
@@ -10,96 +12,191 @@ import {
ReasoningContent,
ReasoningTrigger,
} from '@/components/ai-elements/reasoning'
import type { AgentConversationTurn } from '@/lib/agent-conversations/types'
import {
Task,
TaskContent,
TaskItem,
TaskTrigger,
} from '@/components/ai-elements/task'
import type {
AgentConversationTurn,
ToolEntry,
} from '@/lib/agent-conversations/types'
interface ConversationMessageProps {
turn: AgentConversationTurn
streaming: boolean
}
interface RenderEntry {
kind: 'thinking' | 'text' | 'task'
partIndex: number
text?: string
done?: boolean
tools?: ToolEntry[]
}
/**
* Build the render plan for an assistant turn:
* - thinking and text parts render in place
* - all tool-batch parts collapse into a single Task entry at their first
* appearance position, with tools listed in arrival order
*/
function buildRenderEntries(turn: AgentConversationTurn): RenderEntry[] {
const entries: RenderEntry[] = []
const aggregatedTools: ToolEntry[] = []
let taskInserted = false
turn.parts.forEach((part, partIndex) => {
if (part.kind === 'thinking') {
entries.push({
kind: 'thinking',
partIndex,
text: part.text,
done: part.done,
})
} else if (part.kind === 'text') {
entries.push({ kind: 'text', partIndex, text: part.text })
} else if (part.kind === 'tool-batch') {
aggregatedTools.push(...part.tools)
if (!taskInserted) {
entries.push({
kind: 'task',
partIndex,
tools: aggregatedTools,
})
taskInserted = true
}
}
})
return entries
}
function ToolStatusIcon({ status }: { status: ToolEntry['status'] }) {
if (status === 'running') {
return (
<Loader2 className="size-3.5 shrink-0 animate-spin text-muted-foreground" />
)
}
if (status === 'completed') {
return <CheckCircle2 className="size-3.5 shrink-0 text-green-500" />
}
return <XCircle className="size-3.5 shrink-0 text-destructive" />
}
export const ConversationMessage: FC<ConversationMessageProps> = ({
turn,
streaming,
}) => (
<div className="space-y-3">
<Message from="user">
<MessageContent>
<pre className="whitespace-pre-wrap font-sans text-sm">
{turn.userText}
</pre>
</MessageContent>
</Message>
}) => {
const entries = useMemo(() => buildRenderEntries(turn), [turn])
{turn.parts.length > 0 && (
<Message from="assistant">
return (
<div className="space-y-3">
<Message from="user">
<MessageContent>
{turn.parts.map((part, i) => {
const key = `${turn.id}-part-${i}`
{turn.userAttachments && turn.userAttachments.length > 0 && (
<MessageAttachments>
{turn.userAttachments.map((attachment) => (
<MessageAttachment
key={attachment.id}
data={{
type: 'file',
url: attachment.dataUrl ?? '',
mediaType: attachment.mediaType,
filename: attachment.name,
}}
/>
))}
</MessageAttachments>
)}
{turn.userText && (
<pre className="whitespace-pre-wrap font-sans text-sm">
{turn.userText}
</pre>
)}
</MessageContent>
</Message>
switch (part.kind) {
case 'thinking':
{entries.length > 0 && (
<Message from="assistant">
<MessageContent>
{entries.map((entry) => {
const key = `${turn.id}-entry-${entry.partIndex}`
if (entry.kind === 'thinking') {
return (
<Reasoning
key={key}
className="w-full"
isStreaming={!part.done}
defaultOpen={!part.done}
isStreaming={!entry.done}
defaultOpen={!entry.done}
>
<ReasoningTrigger />
<ReasoningContent>{part.text}</ReasoningContent>
<ReasoningContent>{entry.text ?? ''}</ReasoningContent>
</Reasoning>
)
}
case 'tool-batch':
if (entry.kind === 'text') {
return (
<div key={key} className="w-full space-y-1">
{part.tools.map((tool) => (
<div
<MessageResponse key={key}>
{entry.text ?? ''}
</MessageResponse>
)
}
const tools = entry.tools ?? []
const allDone = tools.every((t) => t.status !== 'running')
const taskTitle = allDone
? `Agent activity (${tools.length} ${tools.length === 1 ? 'action' : 'actions'})`
: `Working… (${tools.length} ${tools.length === 1 ? 'action' : 'actions'})`
return (
<Task key={key} defaultOpen={!turn.done}>
<TaskTrigger title={taskTitle} TriggerIcon={Wrench} />
<TaskContent>
{tools.map((tool) => (
<TaskItem
key={tool.id}
className="flex items-center gap-2 rounded-md border px-3 py-2 text-sm"
className="flex items-center gap-2"
>
{tool.status === 'running' && (
<Loader2 className="size-3.5 animate-spin text-muted-foreground" />
)}
{tool.status === 'completed' && (
<CheckCircle2 className="size-3.5 text-green-500" />
)}
{tool.status === 'error' && (
<XCircle className="size-3.5 text-destructive" />
)}
<span className="font-mono text-xs">{tool.name}</span>
<ToolStatusIcon status={tool.status} />
<span className="text-foreground text-xs">
{tool.label}
</span>
{tool.subject ? (
<span className="ml-1.5 truncate text-muted-foreground/70 text-xs">
· {tool.subject}
</span>
) : null}
{tool.durationMs != null && (
<span className="ml-auto text-muted-foreground text-xs">
<span className="ml-auto text-muted-foreground/60 text-xs tabular-nums">
{(tool.durationMs / 1000).toFixed(1)}s
</span>
)}
</div>
</TaskItem>
))}
</div>
)
</TaskContent>
</Task>
)
})}
</MessageContent>
</Message>
)}
case 'text':
return <MessageResponse key={key}>{part.text}</MessageResponse>
default:
return null
}
})}
</MessageContent>
</Message>
)}
{!turn.done && turn.parts.length === 0 && streaming && (
<div className="flex gap-2">
<div className="flex size-7 shrink-0 items-center justify-center rounded-full bg-[var(--accent-orange)] text-white">
<Bot className="size-3.5" />
{!turn.done && turn.parts.length === 0 && streaming && (
<div className="flex gap-2">
<div className="flex size-7 shrink-0 items-center justify-center rounded-full bg-[var(--accent-orange)] text-white">
<Bot className="size-3.5" />
</div>
<div className="flex items-center gap-1 rounded-xl rounded-tl-none border border-border/50 bg-card px-3 py-2.5 shadow-sm">
<span className="size-1.5 animate-bounce rounded-full bg-[var(--accent-orange)] [animation-delay:-0.3s]" />
<span className="size-1.5 animate-bounce rounded-full bg-[var(--accent-orange)] [animation-delay:-0.15s]" />
<span className="size-1.5 animate-bounce rounded-full bg-[var(--accent-orange)]" />
</div>
</div>
<div className="flex items-center gap-1 rounded-xl rounded-tl-none border border-border/50 bg-card px-3 py-2.5 shadow-sm">
<span className="size-1.5 animate-bounce rounded-full bg-[var(--accent-orange)] [animation-delay:-0.3s]" />
<span className="size-1.5 animate-bounce rounded-full bg-[var(--accent-orange)] [animation-delay:-0.15s]" />
<span className="size-1.5 animate-bounce rounded-full bg-[var(--accent-orange)]" />
</div>
</div>
)}
</div>
)
)}
</div>
)
}

View File

@@ -1,8 +1,11 @@
import type { FC } from 'react'
import { Outlet, useOutletContext } from 'react-router'
import { useHarnessAgents } from '@/entrypoints/app/agents/useAgents'
import type {
AgentEntry,
OpenClawStatus,
} from '@/entrypoints/app/agents/useOpenClaw'
import {
type AgentEntry,
type OpenClawStatus,
useOpenClawAgents,
useOpenClawStatus,
} from '@/entrypoints/app/agents/useOpenClaw'
@@ -16,16 +19,24 @@ interface AgentCommandContextValue {
export const AgentCommandLayout: FC = () => {
const { status, loading: statusLoading } = useOpenClawStatus(5000)
const { agents, loading: agentsLoading } = useOpenClawAgents(
status?.status === 'running' && status.controlPlaneStatus === 'connected',
)
const openClawEnabled =
status?.status === 'running' && status.controlPlaneStatus === 'connected'
const { agents: openClawAgents, loading: openClawAgentsLoading } =
useOpenClawAgents(openClawEnabled)
const { agents: harnessAgents, loading: harnessAgentsLoading } =
useHarnessAgents()
const visibleOpenClawAgents = openClawEnabled ? openClawAgents : []
const agents = [...visibleOpenClawAgents, ...harnessAgents]
return (
<Outlet
context={
{
agents,
agentsLoading,
agentsLoading:
harnessAgentsLoading ||
statusLoading ||
(openClawEnabled && openClawAgentsLoading),
status,
statusLoading,
} satisfies AgentCommandContextValue

View File

@@ -0,0 +1,12 @@
import { describe, expect, it } from 'bun:test'
import { mapAgentHarnessToolStatus } from './agent-stream-events'
describe('mapAgentHarnessToolStatus', () => {
it('normalizes ACP tool statuses for the chat renderer', () => {
expect(mapAgentHarnessToolStatus('running')).toBe('running')
expect(mapAgentHarnessToolStatus('completed')).toBe('completed')
expect(mapAgentHarnessToolStatus('failed')).toBe('error')
expect(mapAgentHarnessToolStatus('incomplete')).toBe('running')
expect(mapAgentHarnessToolStatus(undefined)).toBe('running')
})
})

View File

@@ -0,0 +1,19 @@
import type { ToolEntry } from '@/lib/agent-conversations/types'
export function mapAgentHarnessToolStatus(
status: string | undefined,
): ToolEntry['status'] {
if (!status) return 'running'
const normalized = status.toLowerCase()
if (['error', 'failed', 'failure', 'denied'].includes(normalized)) {
return 'error'
}
if (
['complete', 'completed', 'done', 'success', 'succeeded'].includes(
normalized,
)
) {
return 'completed'
}
return 'running'
}

View File

@@ -1,8 +1,10 @@
import { describe, expect, it } from 'bun:test'
import type { AgentConversationTurn } from '@/lib/agent-conversations/types'
import {
type AgentHistoryPageResponse,
type BrowserOSChatHistoryItem,
buildChatHistoryFromClawMessages,
filterTurnsPersistedInHistory,
flattenHistoryPages,
mapHistoryItemToClawMessage,
} from './claw-chat-types'
@@ -118,4 +120,64 @@ describe('claw-chat-types', () => {
{ role: 'assistant', content: 'Assistant answer' },
])
})
it('hides completed live turns once harness history contains the same turn', () => {
const turn: AgentConversationTurn = {
id: 'live-turn',
userText: 'hello',
parts: [{ kind: 'text', text: 'hi there' }],
done: true,
timestamp: 1_000,
}
const visible = filterTurnsPersistedInHistory(
[turn],
[
{
id: 'history-user',
role: 'user',
sessionKey: 'main',
timestamp: 1_050,
status: 'historical',
parts: [{ type: 'text', text: 'hello' }],
},
{
id: 'history-assistant',
role: 'assistant',
sessionKey: 'main',
timestamp: 1_100,
status: 'historical',
parts: [{ type: 'text', text: 'hi there' }],
},
],
)
expect(visible).toEqual([])
})
it('keeps completed live turns until matching assistant history arrives', () => {
const turn: AgentConversationTurn = {
id: 'live-turn',
userText: 'hello',
parts: [{ kind: 'text', text: 'hi there' }],
done: true,
timestamp: 1_000,
}
const visible = filterTurnsPersistedInHistory(
[turn],
[
{
id: 'history-user',
role: 'user',
sessionKey: 'main',
timestamp: 1_050,
status: 'historical',
parts: [{ type: 'text', text: 'hello' }],
},
],
)
expect(visible).toEqual([turn])
})
})

View File

@@ -1,4 +1,5 @@
import type { OpenClawChatHistoryMessage } from '@/entrypoints/app/agents/useOpenClaw'
import type { AgentConversationTurn } from '@/lib/agent-conversations/types'
export type ClawChatRole = 'user' | 'assistant'
@@ -17,11 +18,31 @@ export interface BrowserOSOpenClawSession {
modelProvider?: string
}
export interface AgentSessionResponse {
agentId: string
exists: boolean
sessionKey: string | null
session: BrowserOSOpenClawSession | null
export interface BrowserOSChatHistoryToolCall {
toolCallId?: string
toolName: string
label: string
subject?: string
status: 'completed' | 'failed'
input?: Record<string, unknown>
output?: string
error?: string
durationMs?: number
}
export interface BrowserOSChatHistoryReasoning {
text: string
durationMs?: number
}
export interface BrowserOSChatHistoryAttachment {
kind: 'image' | 'file'
mediaType: string
// Images carry a `data:` URL so we can render directly without any
// additional fetch; files (text/PDF) currently round-trip via inline
// text in the message body and do not populate this field in v1.
dataUrl?: string
name?: string
}
export interface BrowserOSChatHistoryItem {
@@ -32,6 +53,12 @@ export interface BrowserOSChatHistoryItem {
messageSeq: number
sessionKey: string
source: ClawChatSource
costUsd?: number
tokensIn?: number
tokensOut?: number
toolCalls?: BrowserOSChatHistoryToolCall[]
reasoning?: BrowserOSChatHistoryReasoning
attachments?: BrowserOSChatHistoryAttachment[]
}
export interface AgentHistoryPageResponse {
@@ -58,10 +85,20 @@ export type ClawChatMessagePart =
| {
type: 'tool-call'
name: string
label: string
subject?: string
status: 'pending' | 'running' | 'completed' | 'failed'
input?: unknown
output?: unknown
error?: string
durationMs?: number
}
| {
type: 'attachment'
kind: 'image' | 'file'
mediaType: string
dataUrl?: string
name?: string
}
| { type: 'meta'; label: string; value: string }
@@ -74,11 +111,70 @@ export interface ClawChatMessage {
messageSeq?: number
status?: ClawChatMessageStatus
parts: ClawChatMessagePart[]
costUsd?: number
tokensIn?: number
tokensOut?: number
}
export function mapHistoryItemToClawMessage(
item: BrowserOSChatHistoryItem,
): ClawChatMessage {
const parts: ClawChatMessagePart[] = []
// Attachments first — they belong above the text in user messages and
// never appear on assistant messages today (assistant images come back
// through tool results, which render via the Task collapsible).
if (item.attachments && item.attachments.length > 0) {
for (const attachment of item.attachments) {
parts.push({
type: 'attachment',
kind: attachment.kind,
mediaType: attachment.mediaType,
dataUrl: attachment.dataUrl,
name: attachment.name,
})
}
}
// Reasoning, then tool calls, then text — the chronological order the
// agent produced them (think → act → answer).
if (item.reasoning && item.reasoning.text.trim().length > 0) {
// 0ms means thinking and the final answer were emitted in the same JSONL
// line (no tool calls between them) — there's no real elapsed wall-clock,
// so fall through to the "Thinking" trigger instead of "Thought for 0
// seconds" / streaming shimmer. Real multi-line turns floor at 1s.
const durationMs = item.reasoning.durationMs ?? 0
const duration =
durationMs > 0 ? Math.max(1, Math.round(durationMs / 1000)) : undefined
parts.push({
type: 'reasoning',
text: item.reasoning.text,
duration,
})
}
if (item.toolCalls && item.toolCalls.length > 0) {
for (const tc of item.toolCalls) {
parts.push({
type: 'tool-call',
name: tc.toolName,
label: tc.label,
subject: tc.subject,
status: tc.status,
input: tc.input,
output: tc.output,
error: tc.error,
durationMs: tc.durationMs,
})
}
}
// Only emit a text part when there's actual content. User messages with
// only attachments and no caption shouldn't render an empty bubble.
if (item.text.trim().length > 0) {
parts.push({ type: 'text', text: item.text })
}
return {
id: item.id,
role: item.role,
@@ -87,7 +183,10 @@ export function mapHistoryItemToClawMessage(
source: item.source,
messageSeq: item.messageSeq,
status: 'historical',
parts: [{ type: 'text', text: item.text }],
parts,
costUsd: item.costUsd,
tokensIn: item.tokensIn,
tokensOut: item.tokensOut,
}
}
@@ -123,3 +222,66 @@ export function buildChatHistoryFromClawMessages(
Boolean(message),
)
}
const TURN_HISTORY_MATCH_WINDOW_MS = 5_000
export function filterTurnsPersistedInHistory(
turns: AgentConversationTurn[],
historyMessages: ClawChatMessage[],
): AgentConversationTurn[] {
return turns.filter(
(turn) => !isTurnPersistedInHistory(turn, historyMessages),
)
}
function isTurnPersistedInHistory(
turn: AgentConversationTurn,
historyMessages: ClawChatMessage[],
): boolean {
if (!turn.done) return false
const assistantText = getTurnAssistantText(turn)
if (!assistantText) return false
const minTimestamp = turn.timestamp - TURN_HISTORY_MATCH_WINDOW_MS
const userText = turn.userText.trim()
const userPersisted =
!userText ||
historyMessages.some(
(message) =>
message.role === 'user' &&
isHistoryMessageAfter(message, minTimestamp) &&
getClawMessageText(message) === userText,
)
const assistantPersisted = historyMessages.some(
(message) =>
message.role === 'assistant' &&
isHistoryMessageAfter(message, minTimestamp) &&
getClawMessageText(message) === assistantText,
)
return userPersisted && assistantPersisted
}
function isHistoryMessageAfter(
message: ClawChatMessage,
minTimestamp: number,
): boolean {
return message.timestamp == null || message.timestamp >= minTimestamp
}
function getTurnAssistantText(turn: AgentConversationTurn): string {
return turn.parts
.filter((part) => part.kind === 'text')
.map((part) => part.text)
.join('')
.trim()
}
function getClawMessageText(message: ClawChatMessage): string {
return message.parts
.filter((part) => part.type === 'text')
.map((part) => part.text)
.join('')
.trim()
}

View File

@@ -1,69 +1,53 @@
import { useEffect, useState } from 'react'
import {
type AgentEntry,
getModelDisplayName,
type OpenClawStatus,
} from '@/entrypoints/app/agents/useOpenClaw'
import { getLatestConversation } from '@/lib/agent-conversations/storage'
import type { AgentCardData } from '@/lib/agent-conversations/types'
import type { AgentOverview } from './useAgentDashboard'
function getAgentStatusTone(
status: OpenClawStatus['status'] | undefined,
function resolveAgentStatus(
gatewayStatus: OpenClawStatus['status'] | undefined,
liveStatus: AgentOverview['status'] | undefined,
): AgentCardData['status'] {
if (status === 'error') return 'error'
if (status === 'starting') return 'working'
// Gateway-level errors take precedence
if (gatewayStatus === 'error') return 'error'
if (gatewayStatus === 'starting') return 'working'
// Per-agent live status from the WS observer
if (liveStatus === 'working') return 'working'
if (liveStatus === 'error') return 'error'
return 'idle'
}
async function getAgentCardData(
agent: AgentEntry,
status: OpenClawStatus['status'] | undefined,
): Promise<AgentCardData> {
const conversation = await getLatestConversation(agent.agentId)
const lastTurn = conversation?.turns[conversation.turns.length - 1]
const lastTextPart = lastTurn?.parts.findLast((part) => part.kind === 'text')
return {
agentId: agent.agentId,
name: agent.name,
model: getModelDisplayName(agent.model),
status: getAgentStatusTone(status),
lastMessage:
lastTextPart?.kind === 'text'
? lastTextPart.text.slice(0, 120)
: undefined,
lastMessageTimestamp: lastTurn?.timestamp,
}
}
export function useAgentCardData(
/**
* Build agent card display data by merging the raw agent entries from
* the gateway with enriched overview data from the dashboard API.
*
* Pure function — no hooks, no IndexedDB, no async.
*/
export function buildAgentCardData(
agents: AgentEntry[],
status: OpenClawStatus['status'] | undefined,
) {
const [cardData, setCardData] = useState<AgentCardData[]>([])
dashboard: AgentOverview[] | undefined,
): AgentCardData[] {
return agents.map((agent) => {
const overview = dashboard?.find((d) => d.agentId === agent.agentId)
useEffect(() => {
let active = true
const loadCardData = async () => {
const nextCardData = await Promise.all(
agents.map((agent) => getAgentCardData(agent, status)),
)
if (active) {
setCardData(nextCardData)
}
return {
agentId: agent.agentId,
name: agent.name,
model: getModelDisplayName(agent.model),
status:
agent.source === 'agent-harness'
? 'idle'
: resolveAgentStatus(status, overview?.status),
lastMessage: overview?.latestMessage?.slice(0, 200) ?? undefined,
lastMessageTimestamp: overview?.latestMessageAt ?? undefined,
activitySummary: overview?.activitySummary ?? undefined,
currentTool: overview?.currentTool ?? undefined,
costUsd: overview?.totalCostUsd ?? undefined,
}
if (agents.length > 0) {
void loadCardData()
} else {
setCardData([])
}
return () => {
active = false
}
}, [agents, status])
return cardData
})
}

View File

@@ -1,4 +1,8 @@
import { useEffect, useRef, useState } from 'react'
import {
type AgentHarnessStreamEvent,
chatWithHarnessAgent,
} from '@/entrypoints/app/agents/useAgents'
import {
chatWithAgent,
type OpenClawChatHistoryMessage,
@@ -7,12 +11,28 @@ import {
import type {
AgentConversationTurn,
AssistantPart,
ToolEntry,
UserAttachmentPreview,
} from '@/lib/agent-conversations/types'
import type { ServerAttachmentPayload } from '@/lib/attachments'
import { consumeSSEStream } from '@/lib/sse'
import { buildToolLabel } from '@/lib/tool-labels'
import { mapAgentHarnessToolStatus } from './agent-stream-events'
export interface SendInput {
text: string
attachments?: ServerAttachmentPayload[]
// Optional preview metadata used to render the optimistic user turn.
// Built by the composer at staging time; the server only sees the
// payload array.
attachmentPreviews?: UserAttachmentPreview[]
}
interface UseAgentConversationOptions {
runtime?: 'openclaw' | 'agent-harness'
sessionKey?: string | null
history?: OpenClawChatHistoryMessage[]
onComplete?: () => void
onSessionKeyChange?: (sessionKey: string) => void
}
@@ -27,6 +47,7 @@ export function useAgentConversation(
const textAccRef = useRef('')
const thinkAccRef = useRef('')
const streamAbortRef = useRef<AbortController | null>(null)
const onCompleteRef = useRef(options.onComplete)
const onSessionKeyChangeRef = useRef(options.onSessionKeyChange)
useEffect(() => {
@@ -37,6 +58,10 @@ export function useAgentConversation(
historyRef.current = options.history ?? []
}, [options.history])
useEffect(() => {
onCompleteRef.current = options.onComplete
}, [options.onComplete])
useEffect(() => {
onSessionKeyChangeRef.current = options.onSessionKeyChange
}, [options.onSessionKeyChange])
@@ -60,41 +85,24 @@ export function useAgentConversation(
const processStreamEvent = (event: OpenClawStreamEvent) => {
switch (event.type) {
case 'text-delta': {
const delta = (event.data.text as string) ?? ''
textAccRef.current += delta
const text = textAccRef.current
updateCurrentTurnParts((parts) => {
const last = parts[parts.length - 1]
if (last?.kind === 'text') {
return [...parts.slice(0, -1), { ...last, text }]
}
return [...parts, { kind: 'text', text }]
})
appendTextDelta((event.data.text as string) ?? '')
break
}
case 'thinking': {
const delta = (event.data.text as string) ?? ''
thinkAccRef.current += delta
const text = thinkAccRef.current
updateCurrentTurnParts((parts) => {
const idx = parts.findIndex((p) => p.kind === 'thinking' && !p.done)
if (idx >= 0) {
return [
...parts.slice(0, idx),
{ ...parts[idx], text, done: false },
...parts.slice(idx + 1),
]
}
return [...parts, { kind: 'thinking', text, done: false }]
})
appendThinkingDelta((event.data.text as string) ?? '')
break
}
case 'tool-start': {
const rawName = (event.data.toolName as string) ?? 'unknown'
const args = event.data.args as Record<string, unknown> | undefined
const { label, subject } = buildToolLabel(rawName, args)
const tool = {
id: (event.data.toolCallId as string) ?? crypto.randomUUID(),
name: (event.data.toolName as string) ?? 'unknown',
name: rawName,
label,
subject,
status: 'running' as const,
}
updateCurrentTurnParts((parts) => {
@@ -138,16 +146,7 @@ export function useAgentConversation(
}
case 'done': {
updateCurrentTurnParts((parts) =>
parts.map((part) =>
part.kind === 'thinking' ? { ...part, done: true } : part,
),
)
setTurns((prev) => {
const last = prev[prev.length - 1]
if (!last) return prev
return [...prev.slice(0, -1), { ...last, done: true }]
})
markCurrentTurnDone()
break
}
@@ -156,21 +155,143 @@ export function useAgentConversation(
(event.data.message as string) ??
(event.data.error as string) ??
'Unknown error'
updateCurrentTurnParts((parts) => [
...parts,
{ kind: 'text', text: `Error: ${msg}` },
])
appendErrorText(msg)
break
}
}
}
const send = async (text: string) => {
if (!text.trim() || streaming) return
const appendTextDelta = (delta: string) => {
textAccRef.current += delta
const text = textAccRef.current
updateCurrentTurnParts((parts) => {
const last = parts[parts.length - 1]
if (last?.kind === 'text') {
return [...parts.slice(0, -1), { ...last, text }]
}
return [...parts, { kind: 'text', text }]
})
}
const appendThinkingDelta = (delta: string) => {
thinkAccRef.current += delta
const text = thinkAccRef.current
updateCurrentTurnParts((parts) => {
const idx = parts.findIndex((p) => p.kind === 'thinking' && !p.done)
if (idx >= 0) {
return [
...parts.slice(0, idx),
{ ...parts[idx], text, done: false },
...parts.slice(idx + 1),
]
}
return [...parts, { kind: 'thinking', text, done: false }]
})
}
const appendErrorText = (message: string) => {
updateCurrentTurnParts((parts) => [
...parts,
{ kind: 'text', text: `Error: ${message}` },
])
}
const markCurrentTurnDone = () => {
updateCurrentTurnParts((parts) =>
parts.map((part) =>
part.kind === 'thinking' ? { ...part, done: true } : part,
),
)
setTurns((prev) => {
const last = prev[prev.length - 1]
if (!last) return prev
return [...prev.slice(0, -1), { ...last, done: true }]
})
}
const upsertAgentHarnessTool = (event: AgentHarnessStreamEvent) => {
if (event.type !== 'tool_call') return
const rawName = event.title || event.rawType || 'tool call'
const { label, subject } = buildToolLabel(
rawName,
event.text ? { description: event.text } : undefined,
)
const tool: ToolEntry = {
id: event.id ?? crypto.randomUUID(),
name: rawName,
label,
subject,
status: mapAgentHarnessToolStatus(event.status),
}
updateCurrentTurnParts((parts) => {
for (let i = parts.length - 1; i >= 0; i--) {
const part = parts[i]
if (
part.kind === 'tool-batch' &&
part.tools.some((existing) => existing.id === tool.id)
) {
const tools = part.tools.map((existing) =>
existing.id === tool.id ? { ...existing, ...tool } : existing,
)
return [
...parts.slice(0, i),
{ ...part, tools },
...parts.slice(i + 1),
]
}
}
const last = parts[parts.length - 1]
if (last?.kind === 'tool-batch') {
return [
...parts.slice(0, -1),
{ ...last, tools: [...last.tools, tool] },
]
}
return [...parts, { kind: 'tool-batch', tools: [tool] }]
})
}
const processAgentHarnessStreamEvent = (event: AgentHarnessStreamEvent) => {
switch (event.type) {
case 'text_delta':
if (event.stream === 'thought') {
appendThinkingDelta(event.text)
} else {
appendTextDelta(event.text)
}
break
case 'tool_call':
upsertAgentHarnessTool(event)
break
case 'done':
markCurrentTurnDone()
break
case 'error':
appendErrorText(event.message)
break
case 'status':
break
}
}
const send = async (input: string | SendInput) => {
const normalized: SendInput =
typeof input === 'string' ? { text: input } : input
const trimmed = normalized.text.trim()
const attachments = normalized.attachments ?? []
if (streaming) return
if (!trimmed && attachments.length === 0) return
const turn: AgentConversationTurn = {
id: crypto.randomUUID(),
userText: text.trim(),
userText: trimmed,
userAttachments:
normalized.attachmentPreviews &&
normalized.attachmentPreviews.length > 0
? normalized.attachmentPreviews
: undefined,
parts: [],
done: false,
timestamp: Date.now(),
@@ -183,14 +304,20 @@ export function useAgentConversation(
streamAbortRef.current = abortController
try {
const response = await chatWithAgent(
agentId,
text.trim(),
sessionKeyRef.current || undefined,
historyRef.current,
abortController.signal,
)
const responseSessionKey = response.headers.get('X-Session-Key')
const response =
options.runtime === 'agent-harness'
? await chatWithHarnessAgent(agentId, trimmed, abortController.signal)
: await chatWithAgent(
agentId,
trimmed,
sessionKeyRef.current || undefined,
historyRef.current,
abortController.signal,
attachments,
)
const responseSessionKey =
response.headers.get('X-Session-Key') ??
response.headers.get('X-Session-Id')
if (responseSessionKey) {
sessionKeyRef.current = responseSessionKey
onSessionKeyChangeRef.current?.(responseSessionKey)
@@ -203,11 +330,19 @@ export function useAgentConversation(
])
return
}
await consumeSSEStream(
response,
processStreamEvent,
abortController.signal,
)
if (options.runtime === 'agent-harness') {
await consumeSSEStream<AgentHarnessStreamEvent>(
response,
processAgentHarnessStreamEvent,
abortController.signal,
)
} else {
await consumeSSEStream<OpenClawStreamEvent>(
response,
processStreamEvent,
abortController.signal,
)
}
} catch (err) {
if (abortController.signal.aborted) return
const msg = err instanceof Error ? err.message : String(err)
@@ -219,6 +354,7 @@ export function useAgentConversation(
if (streamAbortRef.current === abortController) {
streamAbortRef.current = null
}
onCompleteRef.current?.()
setStreaming(false)
}
}

View File

@@ -0,0 +1,95 @@
import { useQuery, useQueryClient } from '@tanstack/react-query'
import { useEffect } from 'react'
import { useAgentServerUrl } from '@/lib/browseros/useBrowserOSProviders'
export interface AgentOverview {
agentId: string
status: 'working' | 'idle' | 'error' | 'unknown'
latestMessage: string | null
latestMessageAt: number | null
activitySummary: string | null
currentTool: string | null
totalCostUsd: number
sessionCount: number
}
export interface DashboardResponse {
agents: AgentOverview[]
summary: {
totalAgents: number
totalCostUsd: number
}
}
interface StatusEvent {
agentId: string
status: AgentOverview['status']
currentTool: string | null
error: string | null
timestamp: number
}
const DASHBOARD_QUERY_KEY = ['claw', 'dashboard']
export function useAgentDashboard(enabled: boolean) {
const { baseUrl, isLoading: urlLoading } = useAgentServerUrl()
const queryClient = useQueryClient()
const ready = enabled && Boolean(baseUrl) && !urlLoading
// Initial data load + periodic refresh as fallback
const query = useQuery<DashboardResponse>({
queryKey: [...DASHBOARD_QUERY_KEY, baseUrl],
queryFn: async () => {
const url = new URL('/claw/dashboard', baseUrl as string)
const response = await fetch(url.toString())
if (!response.ok) throw new Error('Failed to fetch dashboard')
return response.json()
},
enabled: ready,
})
// SSE subscription for real-time status patches
useEffect(() => {
if (!ready || !baseUrl) return
const streamUrl = new URL('/claw/dashboard/stream', baseUrl)
const eventSource = new EventSource(streamUrl.toString())
eventSource.addEventListener('snapshot', (event) => {
try {
const dashboard = JSON.parse(event.data) as DashboardResponse
queryClient.setQueryData([...DASHBOARD_QUERY_KEY, baseUrl], dashboard)
} catch {}
})
eventSource.addEventListener('status', (event) => {
try {
const status = JSON.parse(event.data) as StatusEvent
queryClient.setQueryData<DashboardResponse>(
[...DASHBOARD_QUERY_KEY, baseUrl],
(prev) => {
if (!prev) return prev
return {
...prev,
agents: prev.agents.map((agent) =>
agent.agentId === status.agentId
? {
...agent,
status: status.status,
currentTool: status.currentTool,
}
: agent,
),
}
},
)
} catch {}
})
return () => {
eventSource.close()
}
}, [ready, baseUrl, queryClient])
return query
}

View File

@@ -1,14 +1,8 @@
import { useInfiniteQuery, useQuery } from '@tanstack/react-query'
import { useInfiniteQuery } from '@tanstack/react-query'
import { useAgentServerUrl } from '@/lib/browseros/useBrowserOSProviders'
import type {
AgentHistoryPageResponse,
AgentSessionResponse,
} from './claw-chat-types'
import type { AgentHistoryPageResponse } from './claw-chat-types'
export const CLAW_CHAT_QUERY_KEYS = {
session: 'claw-agent-session',
history: 'claw-agent-history',
} as const
const HISTORY_QUERY_KEY = 'claw-agent-history'
async function fetchClawJson<T>(url: string): Promise<T> {
const response = await fetch(url)
@@ -29,38 +23,17 @@ function buildClawUrl(baseUrl: string, path: string): URL {
return new URL(`/claw${path}`, baseUrl)
}
export function useClawAgentSession(agentId: string) {
const {
baseUrl,
isLoading: urlLoading,
error: urlError,
} = useAgentServerUrl()
const query = useQuery<AgentSessionResponse, Error>({
queryKey: [CLAW_CHAT_QUERY_KEYS.session, baseUrl, agentId],
queryFn: () => {
const url = buildClawUrl(baseUrl as string, `/agents/${agentId}/session`)
return fetchClawJson<AgentSessionResponse>(url.toString())
},
enabled: Boolean(baseUrl) && !urlLoading && Boolean(agentId),
})
return {
...query,
error: query.error ?? urlError,
isLoading: query.isLoading || urlLoading,
}
}
export function useClawChatHistory({
agentId,
sessionKey,
enabled,
enabled = true,
limit = 50,
}: {
agentId: string
// null lets the server resolve the most recent user-chat session for the
// agent — avoids an extra /session round-trip and the race that came with it.
sessionKey: string | null
enabled: boolean
enabled?: boolean
limit?: number
}) {
const {
@@ -70,9 +43,9 @@ export function useClawChatHistory({
} = useAgentServerUrl()
const query = useInfiniteQuery<AgentHistoryPageResponse, Error>({
queryKey: [CLAW_CHAT_QUERY_KEYS.history, baseUrl, agentId, sessionKey],
queryKey: [HISTORY_QUERY_KEY, baseUrl, agentId, sessionKey],
initialPageParam: undefined as string | undefined,
queryFn: ({ pageParam }) => {
queryFn: async ({ pageParam }) => {
const url = buildClawUrl(baseUrl as string, `/agents/${agentId}/history`)
url.searchParams.set('limit', String(limit))
@@ -87,12 +60,7 @@ export function useClawChatHistory({
},
getNextPageParam: (lastPage) =>
lastPage.page.hasMore ? lastPage.page.cursor : undefined,
enabled:
enabled &&
Boolean(baseUrl) &&
!urlLoading &&
Boolean(agentId) &&
Boolean(sessionKey),
enabled: enabled && Boolean(baseUrl) && !urlLoading && Boolean(agentId),
})
return {

View File

@@ -0,0 +1,68 @@
import { useQuery } from '@tanstack/react-query'
import type { HarnessAgentHistoryPage } from '@/entrypoints/app/agents/agent-harness-types'
import { fetchHarnessAgentHistory } from '@/entrypoints/app/agents/useAgents'
import { useAgentServerUrl } from '@/lib/browseros/useBrowserOSProviders'
import type {
AgentHistoryPageResponse,
BrowserOSChatHistoryItem,
} from './claw-chat-types'
const HISTORY_QUERY_KEY = 'harness-agent-history'
export function useHarnessChatHistory(agentId: string, enabled = true) {
const {
baseUrl,
isLoading: urlLoading,
error: urlError,
} = useAgentServerUrl()
const query = useQuery<AgentHistoryPageResponse, Error>({
queryKey: [HISTORY_QUERY_KEY, baseUrl, agentId, 'main'],
queryFn: async () => {
return mapHarnessHistoryPage(await fetchHarnessAgentHistory(agentId))
},
enabled: Boolean(baseUrl) && !urlLoading && enabled && Boolean(agentId),
})
return {
...query,
error: query.error ?? urlError,
isLoading: query.isLoading || urlLoading,
}
}
function mapHarnessHistoryPage(
page: HarnessAgentHistoryPage,
): AgentHistoryPageResponse {
const items: BrowserOSChatHistoryItem[] = page.items.map((item, index) => ({
id: item.id,
role: item.role,
text: item.text,
timestamp: item.createdAt,
messageSeq: index + 1,
sessionKey: 'main',
source: 'user-chat',
}))
const updatedAt =
page.items.length > 0
? Math.max(...page.items.map((item) => item.createdAt))
: Date.now()
return {
agentId: page.agentId,
sessionKey: 'main',
session: {
key: 'main',
updatedAt,
sessionId: 'main',
agentId: page.agentId,
kind: 'agent-harness',
source: 'user-chat',
},
items,
page: {
hasMore: false,
limit: items.length,
},
}
}

View File

@@ -0,0 +1,271 @@
import { useCallback, useEffect, useRef, useState } from 'react'
import type { OpenClawChatHistoryMessage } from '@/entrypoints/app/agents/useOpenClaw'
import type { UserAttachmentPreview } from '@/lib/agent-conversations/types'
import type { ServerAttachmentPayload } from '@/lib/attachments'
import { useAgentServerUrl } from '@/lib/browseros/useBrowserOSProviders'
export type OutboundMessageStatus = 'queued' | 'sending' | 'failed'
export interface OutboundMessage {
id: string
text: string
attachments: ServerAttachmentPayload[]
attachmentPreviews: UserAttachmentPreview[]
status: OutboundMessageStatus
error?: string
createdAt: number
}
export interface OutboundQueueEnqueueInput {
text: string
attachments?: ServerAttachmentPayload[]
attachmentPreviews?: UserAttachmentPreview[]
history?: OpenClawChatHistoryMessage[]
}
export interface OutboundQueueApi {
queue: OutboundMessage[]
enqueue(input: OutboundQueueEnqueueInput): void
cancel(id: string): void
retry(id: string): void
}
interface UseOutboundQueueOptions {
agentId: string | null | undefined
sessionKey?: string | null
enabled?: boolean
}
interface ServerQueuedItem {
id: string
status: 'queued' | 'dispatching' | 'failed'
message: string
attachmentsPreview: Array<{
kind: 'image' | 'file'
mediaType: string
name?: string
}>
error?: string
createdAt: number
}
function makeId(): string {
if (typeof crypto !== 'undefined' && crypto.randomUUID) {
return crypto.randomUUID()
}
return `${Date.now().toString(36)}-${Math.random().toString(36).slice(2, 10)}`
}
/**
* Server-backed outbound message queue. The browser is purely a
* projection of server state — closing the tab is safe because the queue
* keeps draining server-side via the OutboundQueueService.
*
* Single id-keyed list: the client generates the queue id and hands it
* to the server in the POST body, so the optimistic row and the SSE
* snapshot reconcile on the same key from frame zero — there is no
* window in which the message renders twice.
*/
export function useOutboundQueue(
options: UseOutboundQueueOptions,
): OutboundQueueApi {
const { agentId, enabled = true, sessionKey } = options
const { baseUrl } = useAgentServerUrl()
const sessionKeyRef = useRef<string | null | undefined>(sessionKey)
sessionKeyRef.current = sessionKey
const [items, setItems] = useState<OutboundMessage[]>([])
// Track which ids the server has confirmed seeing in any SSE snapshot.
// We use this to know whether a missing-from-snapshot id is "drained
// by the server" (drop it) or "still in flight client-side" (keep
// showing the optimistic row).
const everSeenByServerRef = useRef<Set<string>>(new Set())
// Local-only attachment previews, keyed by queue id. Data URLs never
// leave the browser — the SSE feed only carries metadata, so we hold
// them here so the chip strip keeps rendering after server takeover.
const previewMapRef = useRef<Map<string, UserAttachmentPreview[]>>(new Map())
useEffect(() => {
if (!enabled || !baseUrl || !agentId) {
setItems([])
everSeenByServerRef.current = new Set()
previewMapRef.current = new Map()
return
}
let cancelled = false
const url = `${baseUrl}/claw/agents/${encodeURIComponent(agentId)}/queue/stream`
const source = new EventSource(url)
source.onmessage = (event) => {
if (cancelled) return
try {
const parsed = JSON.parse(event.data) as { items: ServerQueuedItem[] }
const snapshotIds = new Set(parsed.items.map((item) => item.id))
for (const id of snapshotIds) everSeenByServerRef.current.add(id)
setItems((prev) => {
const next: OutboundMessage[] = parsed.items.map((item) => ({
id: item.id,
text: item.message,
attachments: [],
attachmentPreviews: previewMapRef.current.get(item.id) ?? [],
status: serverStatusToClient(item.status),
error: item.error,
createdAt: item.createdAt,
}))
// Carry forward any optimistic / failed entries the server
// doesn't know about yet (POST in flight) or has finished
// dispatching but the client wants to keep visible (failed).
const carried = prev.filter((local) => {
if (snapshotIds.has(local.id)) return false
if (everSeenByServerRef.current.has(local.id)) {
// Server saw it before and it's gone now — drained.
previewMapRef.current.delete(local.id)
return false
}
return local.status !== 'failed' || Boolean(local.error)
})
return [...carried, ...next]
})
} catch {
// Malformed event — ignore; next snapshot will recover.
}
}
source.onerror = () => {
// Auto-reconnects; nothing to do here.
}
return () => {
cancelled = true
source.close()
}
}, [baseUrl, agentId, enabled])
const enqueue = useCallback(
(input: OutboundQueueEnqueueInput) => {
if (!enabled || !baseUrl || !agentId) return
const trimmed = input.text.trim()
const attachments = input.attachments ?? []
if (!trimmed && attachments.length === 0) return
const id = makeId()
const previews = input.attachmentPreviews ?? []
previewMapRef.current.set(id, previews)
setItems((prev) => [
...prev,
{
id,
text: trimmed,
attachments,
attachmentPreviews: previews,
status: 'queued',
createdAt: Date.now(),
},
])
void (async () => {
try {
const response = await fetch(
`${baseUrl}/claw/agents/${encodeURIComponent(agentId)}/queue`,
{
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
id,
message: trimmed,
attachments: attachments.length > 0 ? attachments : undefined,
sessionKey: sessionKeyRef.current ?? undefined,
history: input.history,
}),
},
)
if (!response.ok) {
const text = await response.text().catch(() => '')
previewMapRef.current.delete(id)
setItems((prev) =>
prev.map((item) =>
item.id === id
? {
...item,
status: 'failed',
error:
text || `Failed to enqueue (status ${response.status})`,
}
: item,
),
)
}
} catch (err) {
// Only mark as failed if the SSE snapshot hasn't already
// taken ownership of the entry (i.e. the request actually
// reached the server).
if (everSeenByServerRef.current.has(id)) return
previewMapRef.current.delete(id)
setItems((prev) =>
prev.map((item) =>
item.id === id
? {
...item,
status: 'failed',
error:
err instanceof Error
? err.message
: 'Failed to enqueue message',
}
: item,
),
)
}
})()
},
[baseUrl, agentId, enabled],
)
const cancel = useCallback(
(id: string) => {
// If the server has never seen this id, just drop it locally.
if (!everSeenByServerRef.current.has(id)) {
previewMapRef.current.delete(id)
setItems((prev) => prev.filter((item) => item.id !== id))
return
}
if (!enabled || !baseUrl || !agentId) return
void fetch(
`${baseUrl}/claw/agents/${encodeURIComponent(agentId)}/queue/${encodeURIComponent(id)}`,
{ method: 'DELETE' },
).catch(() => {})
},
[baseUrl, agentId, enabled],
)
const retry = useCallback(
(id: string) => {
if (!everSeenByServerRef.current.has(id)) {
// Optimistic-only entry, never made it to the server. Reset
// status so the user can press Send again.
setItems((prev) =>
prev.map((item) =>
item.id === id
? { ...item, status: 'queued', error: undefined }
: item,
),
)
return
}
if (!enabled || !baseUrl || !agentId) return
void fetch(
`${baseUrl}/claw/agents/${encodeURIComponent(agentId)}/queue/${encodeURIComponent(id)}/retry`,
{ method: 'POST' },
).catch(() => {})
},
[baseUrl, agentId, enabled],
)
return { queue: items, enqueue, cancel, retry }
}
function serverStatusToClient(
status: ServerQueuedItem['status'],
): OutboundMessageStatus {
if (status === 'dispatching') return 'sending'
if (status === 'failed') return 'failed'
return 'queued'
}

View File

@@ -0,0 +1,117 @@
import { Bot, Cpu, Loader2, MessageSquare, Plus, Trash2 } from 'lucide-react'
import type { FC } from 'react'
import { Badge } from '@/components/ui/badge'
import { Button } from '@/components/ui/button'
import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/card'
import type { AgentListItem } from './agents-page-types'
interface AgentListProps {
agents: AgentListItem[]
loading: boolean
deletingAgentKey: string | null
onChatAgent: (agent: AgentListItem) => void
onCreateAgent: () => void
onDeleteAgent: (agent: AgentListItem) => void
}
export const AgentList: FC<AgentListProps> = ({
agents,
loading,
deletingAgentKey,
onChatAgent,
onCreateAgent,
onDeleteAgent,
}) => {
if (loading && agents.length === 0) {
return (
<div className="flex h-36 items-center justify-center rounded-lg border border-border/70">
<Loader2 className="size-5 animate-spin text-muted-foreground" />
</div>
)
}
if (agents.length === 0) {
return (
<Card>
<CardContent className="flex h-48 flex-col items-center justify-center gap-4 text-center">
<div className="flex size-10 items-center justify-center rounded-lg bg-muted text-muted-foreground">
<Bot className="size-5" />
</div>
<div className="space-y-1">
<h2 className="font-medium text-base">No agents</h2>
<p className="text-muted-foreground text-sm">
Create an OpenClaw, Claude Code, or Codex agent.
</p>
</div>
<Button variant="outline" onClick={onCreateAgent}>
<Plus className="mr-2 size-4" />
New Agent
</Button>
</CardContent>
</Card>
)
}
return (
<div className="grid gap-3">
{agents.map((agent) => (
<Card key={agent.key} className="rounded-lg border-border/70">
<CardHeader className="flex flex-row items-center justify-between gap-4 py-3">
<div className="flex min-w-0 items-center gap-3">
<div className="flex size-10 shrink-0 items-center justify-center rounded-lg bg-muted text-muted-foreground">
{agent.source === 'openclaw' ? (
<Cpu className="size-5" />
) : (
<Bot className="size-5" />
)}
</div>
<div className="min-w-0">
<CardTitle className="truncate text-base">
{agent.name}
</CardTitle>
<div className="mt-1 flex flex-wrap items-center gap-2 text-muted-foreground text-xs">
<Badge variant="outline" className="rounded-md">
{agent.runtimeLabel}
</Badge>
<span>{agent.modelLabel}</span>
<Badge variant="outline" className="rounded-md">
main
</Badge>
</div>
<p className="mt-1 truncate font-mono text-muted-foreground text-xs">
{agent.detail}
</p>
</div>
</div>
<div className="flex shrink-0 items-center gap-1">
<Button
variant="ghost"
size="sm"
onClick={() => onChatAgent(agent)}
disabled={!agent.canChat}
>
<MessageSquare className="mr-1 size-4" />
Chat
</Button>
{agent.canDelete ? (
<Button
variant="ghost"
size="icon"
title="Delete agent"
onClick={() => onDeleteAgent(agent)}
disabled={deletingAgentKey === agent.key}
>
{deletingAgentKey === agent.key ? (
<Loader2 className="size-4 animate-spin" />
) : (
<Trash2 className="size-4 text-destructive" />
)}
</Button>
) : null}
</div>
</CardHeader>
</Card>
))}
</div>
)
}

View File

@@ -0,0 +1,261 @@
import { AlertCircle, Loader2 } from 'lucide-react'
import type { FC } from 'react'
import { Alert, AlertDescription, AlertTitle } from '@/components/ui/alert'
import { Button } from '@/components/ui/button'
import {
Dialog,
DialogContent,
DialogFooter,
DialogHeader,
DialogTitle,
} from '@/components/ui/dialog'
import { Input } from '@/components/ui/input'
import { Label } from '@/components/ui/label'
import {
Select,
SelectContent,
SelectItem,
SelectTrigger,
SelectValue,
} from '@/components/ui/select'
import type {
HarnessAdapterDescriptor,
HarnessAgentAdapter,
} from './agent-harness-types'
import type { CreateAgentRuntime, ProviderOption } from './agents-page-types'
import { ProviderSelector } from './OpenClawControls'
import {
type OpenClawCliProvider,
type OpenClawCliProviderAuthStatus,
OpenClawCliProviderStatusPanel,
} from './openclaw-cli-providers'
interface NewAgentDialogProps {
adapters: HarnessAdapterDescriptor[]
canManageOpenClaw: boolean
createError: string | null
createRuntime: CreateAgentRuntime
creating: boolean
defaultProviderId: string
harnessAdapterId: HarnessAgentAdapter
harnessModelId: string
harnessReasoningEffort: string
name: string
open: boolean
providers: ProviderOption[]
selectedCliProvider: OpenClawCliProvider | undefined
selectedProviderId: string
cliAuthError: Error | null
cliAuthLoading: boolean
cliAuthStatus: OpenClawCliProviderAuthStatus | undefined
onConnectCliProvider: () => void
onCreate: () => void
onOpenChange: (open: boolean) => void
onRuntimeChange: (runtime: CreateAgentRuntime) => void
onHarnessAdapterChange: (adapter: HarnessAgentAdapter) => void
onHarnessModelChange: (modelId: string) => void
onHarnessReasoningChange: (reasoningEffort: string) => void
onNameChange: (name: string) => void
onProviderChange: (providerId: string) => void
}
export const NewAgentDialog: FC<NewAgentDialogProps> = ({
adapters,
canManageOpenClaw,
createError,
createRuntime,
creating,
defaultProviderId,
harnessAdapterId,
harnessModelId,
harnessReasoningEffort,
name,
open,
providers,
selectedCliProvider,
selectedProviderId,
cliAuthError,
cliAuthLoading,
cliAuthStatus,
onConnectCliProvider,
onCreate,
onOpenChange,
onRuntimeChange,
onHarnessAdapterChange,
onHarnessModelChange,
onHarnessReasoningChange,
onNameChange,
onProviderChange,
}) => {
const selectedHarnessAdapter =
adapters.find((adapter) => adapter.id === harnessAdapterId) ?? adapters[0]
const isHarnessRuntime = createRuntime !== 'openclaw'
const openClawBlocked = createRuntime === 'openclaw' && !canManageOpenClaw
const cliBlocked =
createRuntime === 'openclaw' &&
!!selectedCliProvider &&
!cliAuthStatus?.loggedIn
const canCreate =
Boolean(name.trim()) &&
!creating &&
!openClawBlocked &&
!cliBlocked &&
(createRuntime === 'openclaw'
? providers.length > 0
: Boolean(selectedHarnessAdapter))
return (
<Dialog open={open} onOpenChange={onOpenChange}>
<DialogContent>
<DialogHeader>
<DialogTitle>New Agent</DialogTitle>
</DialogHeader>
<div className="grid gap-4 py-2">
{createError ? (
<Alert variant="destructive">
<AlertCircle className="size-4" />
<AlertTitle>Create failed</AlertTitle>
<AlertDescription>{createError}</AlertDescription>
</Alert>
) : null}
<div className="grid gap-2">
<Label htmlFor="agent-name">Name</Label>
<Input
id="agent-name"
value={name}
onChange={(event) => onNameChange(event.target.value)}
placeholder={
createRuntime === 'openclaw' ? 'research-agent' : 'Review bot'
}
onKeyDown={(event) => {
if (event.key === 'Enter' && canCreate) onCreate()
}}
/>
</div>
<div className="grid gap-2">
<Label htmlFor="agent-runtime">Adapter</Label>
<Select
value={createRuntime}
onValueChange={(value) => {
if (
value === 'openclaw' ||
value === 'claude' ||
value === 'codex'
) {
onRuntimeChange(value)
if (value !== 'openclaw') onHarnessAdapterChange(value)
}
}}
>
<SelectTrigger id="agent-runtime">
<SelectValue />
</SelectTrigger>
<SelectContent>
<SelectItem value="openclaw">OpenClaw</SelectItem>
{adapters.map((adapter) => (
<SelectItem key={adapter.id} value={adapter.id}>
{adapter.name}
</SelectItem>
))}
</SelectContent>
</Select>
</div>
{createRuntime === 'openclaw' ? (
<>
{openClawBlocked ? (
<Alert>
<AlertCircle className="size-4" />
<AlertTitle>OpenClaw is not ready</AlertTitle>
<AlertDescription>
Start or set up the OpenClaw gateway before creating an
OpenClaw agent.
</AlertDescription>
</Alert>
) : null}
<ProviderSelector
providers={providers}
defaultProviderId={defaultProviderId}
selectedId={selectedProviderId}
onSelect={onProviderChange}
hideApiKeyHint={!!selectedCliProvider}
/>
{selectedCliProvider ? (
<OpenClawCliProviderStatusPanel
provider={selectedCliProvider}
status={cliAuthStatus}
loading={cliAuthLoading}
fetchError={cliAuthError}
onConnect={onConnectCliProvider}
/>
) : null}
</>
) : null}
{isHarnessRuntime ? (
<>
<div className="grid gap-2">
<Label htmlFor="harness-model">Model</Label>
<Select
value={harnessModelId}
onValueChange={onHarnessModelChange}
>
<SelectTrigger id="harness-model">
<SelectValue />
</SelectTrigger>
<SelectContent>
{(selectedHarnessAdapter?.models ?? []).map((model) => (
<SelectItem key={model.id} value={model.id}>
{model.label}
</SelectItem>
))}
</SelectContent>
</Select>
</div>
<div className="grid gap-2">
<Label htmlFor="harness-effort">Reasoning</Label>
<Select
value={harnessReasoningEffort}
onValueChange={onHarnessReasoningChange}
>
<SelectTrigger id="harness-effort">
<SelectValue />
</SelectTrigger>
<SelectContent>
{(selectedHarnessAdapter?.reasoningEfforts ?? []).map(
(effort) => (
<SelectItem key={effort.id} value={effort.id}>
{effort.label}
</SelectItem>
),
)}
</SelectContent>
</Select>
</div>
</>
) : null}
</div>
<DialogFooter>
<Button
variant="outline"
onClick={() => onOpenChange(false)}
disabled={creating}
>
Cancel
</Button>
<Button disabled={!canCreate} onClick={onCreate}>
{creating ? <Loader2 className="mr-2 size-4 animate-spin" /> : null}
Create
</Button>
</DialogFooter>
</DialogContent>
</Dialog>
)
}

View File

@@ -0,0 +1,387 @@
import {
AlertCircle,
Cpu,
Loader2,
Plus,
RefreshCw,
ShieldAlert,
Square,
TerminalSquare,
WifiOff,
Wrench,
} from 'lucide-react'
import type { FC } from 'react'
import { Alert, AlertDescription, AlertTitle } from '@/components/ui/alert'
import { Badge } from '@/components/ui/badge'
import { Button } from '@/components/ui/button'
import { Card, CardContent } from '@/components/ui/card'
import { Label } from '@/components/ui/label'
import {
Select,
SelectContent,
SelectItem,
SelectTrigger,
SelectValue,
} from '@/components/ui/select'
import type { ProviderOption } from './agents-page-types'
import {
CONTROL_PLANE_COPY,
FALLBACK_CONTROL_PLANE_COPY,
} from './agents-page-types'
import type { getControlPlaneCopy } from './agents-page-utils'
import type { OpenClawStatus } from './useOpenClaw'
const StatusBadge: FC<{ status: OpenClawStatus['status'] }> = ({ status }) => {
const variants: Record<
OpenClawStatus['status'],
{
variant: 'default' | 'secondary' | 'outline' | 'destructive'
label: string
}
> = {
running: { variant: 'default', label: 'Running' },
starting: { variant: 'secondary', label: 'Starting...' },
stopped: { variant: 'outline', label: 'Stopped' },
error: { variant: 'destructive', label: 'Error' },
uninitialized: { variant: 'outline', label: 'Not Set Up' },
}
const current = variants[status] ?? {
variant: 'outline' as const,
label: 'Unknown',
}
return <Badge variant={current.variant}>{current.label}</Badge>
}
const ControlPlaneBadge: FC<{
status: OpenClawStatus['controlPlaneStatus']
}> = ({ status }) => {
const current = CONTROL_PLANE_COPY[status] ?? FALLBACK_CONTROL_PLANE_COPY
return <Badge variant={current.badgeVariant}>{current.badgeLabel}</Badge>
}
interface ProviderSelectorProps {
providers: ProviderOption[]
defaultProviderId: string
selectedId: string
onSelect: (id: string) => void
hideApiKeyHint?: boolean
}
export const ProviderSelector: FC<ProviderSelectorProps> = ({
providers,
defaultProviderId,
selectedId,
onSelect,
hideApiKeyHint,
}) => {
if (providers.length === 0) {
return (
<div className="space-y-2">
<p className="font-medium text-sm">LLM Provider</p>
<p className="text-muted-foreground text-sm">
No compatible LLM providers configured.{' '}
<a href="#/settings/ai" className="underline">
Add one in AI settings
</a>{' '}
first.
</p>
</div>
)
}
return (
<div className="space-y-2">
<Label htmlFor="provider-select">LLM Provider</Label>
<Select value={selectedId} onValueChange={onSelect}>
<SelectTrigger id="provider-select">
<SelectValue placeholder="Select a provider" />
</SelectTrigger>
<SelectContent>
{providers.map((provider) => (
<SelectItem key={provider.id} value={provider.id}>
{provider.name} - {provider.modelId}
{provider.id === defaultProviderId ? ' (default)' : ''}
</SelectItem>
))}
</SelectContent>
</Select>
{!hideApiKeyHint && (
<p className="text-muted-foreground text-xs">
Uses your existing API key from BrowserOS settings. The key is passed
to the container and never leaves your machine.
</p>
)}
</div>
)
}
interface AgentsPageHeaderProps {
actionInProgress: boolean
controlPlaneBusy: boolean
reconnecting: boolean
status: OpenClawStatus | null
onCreateAgent: () => void
onOpenTerminal: () => void
onReconnect: () => void
onRefresh: () => void
onRestart: () => void
onStop: () => void
}
export const AgentsPageHeader: FC<AgentsPageHeaderProps> = ({
actionInProgress,
controlPlaneBusy,
reconnecting,
status,
onCreateAgent,
onOpenTerminal,
onReconnect,
onRefresh,
onRestart,
onStop,
}) => (
<div className="flex flex-wrap items-center justify-between gap-3">
<div>
<h1 className="font-semibold text-2xl tracking-normal">Agents</h1>
<p className="text-muted-foreground text-sm">
OpenClaw, Claude Code, and Codex agents
</p>
</div>
<div className="flex flex-wrap items-center gap-2">
{status ? (
<>
<StatusBadge status={status.status} />
{status.status !== 'uninitialized' && (
<ControlPlaneBadge status={status.controlPlaneStatus} />
)}
</>
) : null}
{status?.status === 'running' &&
status.controlPlaneStatus !== 'connected' ? (
<Button
variant="outline"
onClick={onReconnect}
disabled={actionInProgress || controlPlaneBusy}
>
{reconnecting ? (
<Loader2 className="mr-2 size-4 animate-spin" />
) : (
<RefreshCw className="mr-2 size-4" />
)}
Retry Connection
</Button>
) : null}
{status?.status === 'running' ? (
<>
<Button
variant="ghost"
size="icon"
onClick={onRestart}
disabled={actionInProgress}
title="Restart gateway"
>
<RefreshCw className="size-4" />
</Button>
<Button
variant="ghost"
size="icon"
onClick={onStop}
disabled={actionInProgress}
title="Stop gateway"
>
<Square className="size-4" />
</Button>
<Button variant="outline" onClick={onOpenTerminal}>
<TerminalSquare className="mr-2 size-4" />
Terminal
</Button>
</>
) : null}
<Button variant="ghost" size="icon" onClick={onRefresh} title="Refresh">
<RefreshCw className="size-4" />
</Button>
<Button onClick={onCreateAgent}>
<Plus className="mr-2 size-4" />
New Agent
</Button>
</div>
</div>
)
export function LifecycleAlert({ message }: { message: string }) {
return (
<Alert>
<Loader2 className="size-4 animate-spin" />
<AlertTitle>{message}</AlertTitle>
</Alert>
)
}
export function InlineErrorAlert({
message,
onDismiss,
}: {
message: string
onDismiss: () => void
}) {
return (
<Alert variant="destructive">
<AlertCircle className="size-4" />
<AlertTitle>Agent action failed</AlertTitle>
<AlertDescription>
<p>{message}</p>
<div className="mt-2">
<Button variant="outline" size="sm" onClick={onDismiss}>
Dismiss
</Button>
</div>
</AlertDescription>
</Alert>
)
}
interface ControlPlaneAlertProps {
actionInProgress: boolean
controlPlaneBusy: boolean
controlPlaneCopy: ReturnType<typeof getControlPlaneCopy>
reconnecting: boolean
recoveryDetail: string | null
status: OpenClawStatus
onReconnect: () => void
onRestart: () => void
}
export const ControlPlaneAlert: FC<ControlPlaneAlertProps> = ({
actionInProgress,
controlPlaneBusy,
controlPlaneCopy,
reconnecting,
recoveryDetail,
status,
onReconnect,
onRestart,
}) => (
<Alert
variant={status.controlPlaneStatus === 'failed' ? 'destructive' : 'default'}
>
{status.controlPlaneStatus === 'failed' ? (
<ShieldAlert className="size-4" />
) : status.controlPlaneStatus === 'recovering' ? (
<Wrench className="size-4" />
) : (
<WifiOff className="size-4" />
)}
<AlertTitle>{controlPlaneCopy.title}</AlertTitle>
<AlertDescription>
<p>{controlPlaneCopy.description}</p>
{recoveryDetail ? <p>{recoveryDetail}</p> : null}
<div className="mt-2 flex flex-wrap gap-2">
<Button
variant="outline"
size="sm"
onClick={onReconnect}
disabled={actionInProgress || controlPlaneBusy}
>
{reconnecting ? (
<Loader2 className="mr-2 size-4 animate-spin" />
) : (
<RefreshCw className="mr-2 size-4" />
)}
Retry Connection
</Button>
<Button
variant="outline"
size="sm"
onClick={onRestart}
disabled={actionInProgress}
>
Restart Gateway
</Button>
</div>
</AlertDescription>
</Alert>
)
interface GatewayStateCardsProps {
actionInProgress: boolean
status: OpenClawStatus | null
onOpenSetup: () => void
onRestart: () => void
onStart: () => void
}
export const GatewayStateCards: FC<GatewayStateCardsProps> = ({
actionInProgress,
status,
onOpenSetup,
onRestart,
onStart,
}) => (
<>
{status?.status === 'uninitialized' ? (
<Card>
<CardContent className="flex flex-col items-center gap-4 py-12">
<Cpu className="size-12 text-muted-foreground" />
<div className="text-center">
<h3 className="font-semibold text-lg">Set Up OpenClaw</h3>
<p className="text-muted-foreground text-sm">
{status.podmanAvailable
? 'Create a local BrowserOS VM to run autonomous agents with full tool access.'
: 'BrowserOS VM runtime is unavailable on this system.'}
</p>
</div>
{status.podmanAvailable ? (
<Button onClick={onOpenSetup}>Set Up Now</Button>
) : null}
</CardContent>
</Card>
) : null}
{status?.status === 'stopped' ? (
<Card>
<CardContent className="flex flex-col items-center gap-4 py-12">
<Cpu className="size-12 text-muted-foreground" />
<div className="text-center">
<h3 className="font-semibold text-lg">Gateway Stopped</h3>
<p className="text-muted-foreground text-sm">
The OpenClaw gateway is not running.
</p>
</div>
<Button onClick={onStart} disabled={actionInProgress}>
Start Gateway
</Button>
</CardContent>
</Card>
) : null}
{status?.status === 'error' ? (
<Card className="border-destructive">
<CardContent className="flex flex-col items-center gap-4 py-12">
<AlertCircle className="size-12 text-destructive" />
<div className="text-center">
<h3 className="font-semibold text-lg">Gateway Error</h3>
<p className="text-muted-foreground text-sm">
{status.error ?? status.lastGatewayError}
</p>
</div>
<div className="flex gap-2">
<Button onClick={onStart} disabled={actionInProgress}>
Start Gateway
</Button>
<Button
variant="outline"
onClick={onRestart}
disabled={actionInProgress}
>
Restart Gateway
</Button>
</div>
</CardContent>
</Card>
) : null}
</>
)

View File

@@ -0,0 +1,76 @@
import { Loader2 } from 'lucide-react'
import type { FC } from 'react'
import { Button } from '@/components/ui/button'
import {
Dialog,
DialogContent,
DialogHeader,
DialogTitle,
} from '@/components/ui/dialog'
import type { ProviderOption } from './agents-page-types'
import { ProviderSelector } from './OpenClawControls'
import type { OpenClawCliProvider } from './openclaw-cli-providers'
interface SetupOpenClawDialogProps {
defaultProviderId: string
open: boolean
providers: ProviderOption[]
selectedProviderId: string
selectedCliProvider: OpenClawCliProvider | undefined
settingUp: boolean
onOpenChange: (open: boolean) => void
onProviderChange: (providerId: string) => void
onSetup: () => void
}
export const SetupOpenClawDialog: FC<SetupOpenClawDialogProps> = ({
defaultProviderId,
open,
providers,
selectedProviderId,
selectedCliProvider,
settingUp,
onOpenChange,
onProviderChange,
onSetup,
}) => (
<Dialog open={open} onOpenChange={onOpenChange}>
<DialogContent>
<DialogHeader>
<DialogTitle>Set Up OpenClaw</DialogTitle>
</DialogHeader>
<div className="space-y-4 py-2">
<ProviderSelector
providers={providers}
defaultProviderId={defaultProviderId}
selectedId={selectedProviderId}
onSelect={onProviderChange}
hideApiKeyHint={!!selectedCliProvider}
/>
{selectedCliProvider ? (
<p className="rounded-md border border-border bg-muted/30 px-3 py-2 text-muted-foreground text-xs">
{selectedCliProvider.description}. Clicking{' '}
<span className="font-medium">Set Up &amp; Start</span> starts the
gateway and opens a terminal to sign in.
</p>
) : null}
<Button
onClick={onSetup}
disabled={settingUp || providers.length === 0}
className="w-full"
>
{settingUp ? (
<>
<Loader2 className="mr-2 size-4 animate-spin" />
Setting up...
</>
) : (
'Set Up & Start'
)}
</Button>
</div>
</DialogContent>
</Dialog>
)

View File

@@ -0,0 +1,4 @@
export function buildAgentApiUrl(baseUrl: string, path: string): string {
const normalizedPath = path === '/' ? '' : path
return `${baseUrl}/agents${normalizedPath}`
}

View File

@@ -0,0 +1,88 @@
import type { AgentEntry } from './useOpenClaw'
export type HarnessAgentAdapter = 'claude' | 'codex'
export type AgentHarnessStreamEvent =
| {
type: 'text_delta'
text: string
stream: 'output' | 'thought'
rawType?: string
}
| {
type: 'tool_call'
text: string
title: string
id?: string
status?: string
rawType?: string
}
| {
type: 'status'
text: string
rawType?: string
}
| {
type: 'done'
text?: string
stopReason?: string
}
| {
type: 'error'
message: string
code?: string
}
export interface HarnessAgent {
id: string
name: string
adapter: HarnessAgentAdapter
modelId?: string
reasoningEffort?: string
permissionMode: 'approve-all'
sessionKey: string
createdAt: number
updatedAt: number
}
export interface HarnessAdapterDescriptor {
id: HarnessAgentAdapter
name: string
defaultModelId: string
defaultReasoningEffort: string
modelControl: 'runtime-supported' | 'best-effort'
models: Array<{ id: string; label: string; recommended?: boolean }>
reasoningEfforts: Array<{ id: string; label: string; recommended?: boolean }>
}
export interface CreateHarnessAgentInput {
name: string
adapter: HarnessAgentAdapter
modelId?: string
reasoningEffort?: string
}
export interface HarnessTranscriptEntry {
id: string
agentId: string
sessionId: 'main'
role: 'user' | 'assistant'
text: string
createdAt: number
}
export interface HarnessAgentHistoryPage {
agentId: string
sessionId: 'main'
items: HarnessTranscriptEntry[]
}
export function mapHarnessAgentToEntry(agent: HarnessAgent): AgentEntry {
return {
agentId: agent.id,
name: agent.name,
workspace: `${agent.adapter}:main`,
model: agent.modelId,
source: 'agent-harness',
}
}

View File

@@ -0,0 +1,172 @@
import type { NavigateFunction } from 'react-router'
import {
AGENT_CREATED_EVENT,
AGENT_DELETED_EVENT,
} from '@/lib/constants/analyticsEvents'
import { track } from '@/lib/metrics/track'
import type { HarnessAgent, HarnessAgentAdapter } from './agent-harness-types'
import type {
AgentListItem,
CreateAgentRuntime,
ProviderOption,
} from './agents-page-types'
import { findOpenClawCliProviderById } from './openclaw-cli-providers'
import type {
AgentEntry,
OpenClawAgentMutationInput,
OpenClawSetupInput,
} from './useOpenClaw'
export interface AgentPageActionInput {
createProviderId: string
createRuntime: CreateAgentRuntime
harnessModelId: string
harnessReasoningEffort: string
navigate: NavigateFunction
newName: string
selectableOpenClawProviders: ProviderOption[]
setupProviderId: string
createHarnessAgent: (input: {
name: string
adapter: HarnessAgentAdapter
modelId?: string
reasoningEffort?: string
}) => Promise<HarnessAgent>
createOpenClawAgent: (
input: OpenClawAgentMutationInput,
) => Promise<{ agent: AgentEntry }>
deleteHarnessAgent: (agentId: string) => Promise<unknown>
deleteOpenClawAgent: (agentId: string) => Promise<unknown>
setCliAuthModalOpen: (open: boolean) => void
setCreateError: (error: string | null) => void
setCreateOpen: (open: boolean) => void
setDeletingAgentKey: (key: string | null) => void
setNewName: (name: string) => void
setPageError: (error: string | null) => void
setSetupOpen: (open: boolean) => void
setupOpenClaw: (input: OpenClawSetupInput) => Promise<unknown>
}
export function createAgentPageActions(input: AgentPageActionInput) {
const runWithPageErrorHandling = async (fn: () => Promise<unknown>) => {
input.setPageError(null)
try {
await fn()
} catch (err) {
input.setPageError(err instanceof Error ? err.message : String(err))
}
}
const handleSetup = async () => {
const option = input.selectableOpenClawProviders.find(
(item) => item.id === input.setupProviderId,
)
const isCli = !!option && !!findOpenClawCliProviderById(option.type)
const llmOption = !isCli && option ? option : undefined
await runWithPageErrorHandling(async () => {
await input.setupOpenClaw({
providerType: option?.type,
providerName: isCli ? undefined : option?.name,
baseUrl: llmOption?.baseUrl,
apiKey: llmOption?.apiKey,
modelId: option?.modelId,
})
input.setSetupOpen(false)
if (isCli) input.setCliAuthModalOpen(true)
})
}
const handleOpenClawCreate = async () => {
if (!input.newName.trim()) return
const option = input.selectableOpenClawProviders.find(
(item) => item.id === input.createProviderId,
)
const normalizedName = input.newName
.trim()
.toLowerCase()
.replace(/\s+/g, '-')
const isCli = !!option && !!findOpenClawCliProviderById(option.type)
const llmOption = !isCli && option ? option : undefined
input.setCreateError(null)
try {
const result = await input.createOpenClawAgent({
name: normalizedName,
providerType: option?.type,
providerName: isCli ? undefined : option?.name,
baseUrl: llmOption?.baseUrl,
apiKey: llmOption?.apiKey,
modelId: option?.modelId,
})
input.setCreateOpen(false)
input.setNewName('')
track(AGENT_CREATED_EVENT, {
runtime: 'openclaw',
provider_type: option?.type,
})
input.navigate(`/agents/${result.agent.agentId}`)
} catch (err) {
input.setCreateError(err instanceof Error ? err.message : String(err))
}
}
const handleHarnessCreate = async () => {
if (!input.newName.trim()) return
input.setCreateError(null)
try {
const agent = await input.createHarnessAgent({
name: input.newName.trim(),
adapter: input.createRuntime as HarnessAgentAdapter,
modelId: input.harnessModelId || undefined,
reasoningEffort: input.harnessReasoningEffort || undefined,
})
input.setCreateOpen(false)
input.setNewName('')
track(AGENT_CREATED_EVENT, {
runtime: input.createRuntime,
model_id: input.harnessModelId || undefined,
reasoning_effort: input.harnessReasoningEffort || undefined,
})
input.navigate(`/agents/${agent.id}`)
} catch (err) {
input.setCreateError(err instanceof Error ? err.message : String(err))
}
}
const handleCreate = () => {
const createByRuntime: Record<CreateAgentRuntime, () => Promise<void>> = {
openclaw: handleOpenClawCreate,
claude: handleHarnessCreate,
codex: handleHarnessCreate,
}
void createByRuntime[input.createRuntime]()
}
const handleDelete = async (agent: AgentListItem) => {
input.setDeletingAgentKey(agent.key)
await runWithPageErrorHandling(async () => {
const deleteBySource: Record<
AgentListItem['source'],
(agentId: string) => Promise<unknown>
> = {
openclaw: (agentId) => input.deleteOpenClawAgent(agentId),
'agent-harness': (agentId) => input.deleteHarnessAgent(agentId),
}
await deleteBySource[agent.source](agent.agentId)
track(AGENT_DELETED_EVENT, {
runtime: agent.source,
agent_id: agent.agentId,
})
})
input.setDeletingAgentKey(null)
}
return {
handleCreate,
handleDelete,
handleSetup,
runWithPageErrorHandling,
}
}

View File

@@ -0,0 +1,173 @@
import { type Dispatch, type SetStateAction, useEffect, useMemo } from 'react'
import type { LlmProviderConfig } from '@/lib/llm-providers/types'
import type {
HarnessAdapterDescriptor,
HarnessAgentAdapter,
} from './agent-harness-types'
import type { CreateAgentRuntime } from './agents-page-types'
import { toProviderOptions } from './agents-page-utils'
import {
buildOpenClawCliProviderOptions,
findOpenClawCliProviderById,
useOpenClawCliProviderAuthStatus,
} from './openclaw-cli-providers'
export function useDefaultAgentName(
createOpen: boolean,
setNewName: Dispatch<SetStateAction<string>>,
): void {
useEffect(() => {
if (!createOpen) return
setNewName((current) => current || 'agent')
}, [createOpen, setNewName])
}
export function useHarnessAgentDefaults(input: {
adapters: HarnessAdapterDescriptor[]
createOpen: boolean
harnessAdapterId: HarnessAgentAdapter
setHarnessAdapterId: Dispatch<SetStateAction<HarnessAgentAdapter>>
setHarnessModelId: Dispatch<SetStateAction<string>>
setHarnessReasoningEffort: Dispatch<SetStateAction<string>>
}): void {
const {
adapters,
createOpen,
harnessAdapterId,
setHarnessAdapterId,
setHarnessModelId,
setHarnessReasoningEffort,
} = input
useEffect(() => {
if (!createOpen) return
const adapter =
adapters.find((entry) => entry.id === harnessAdapterId) ?? adapters[0]
if (!adapter) return
setHarnessAdapterId(adapter.id)
setHarnessModelId((current) => current || adapter.defaultModelId)
setHarnessReasoningEffort(
(current) => current || adapter.defaultReasoningEffort,
)
}, [
adapters,
createOpen,
harnessAdapterId,
setHarnessAdapterId,
setHarnessModelId,
setHarnessReasoningEffort,
])
}
export function useOpenClawProviderSelection(input: {
providers: LlmProviderConfig[]
defaultProviderId: string
createOpen: boolean
createRuntime: CreateAgentRuntime
createProviderId: string
setCreateProviderId: Dispatch<SetStateAction<string>>
setupOpen: boolean
setupProviderId: string
setSetupProviderId: Dispatch<SetStateAction<string>>
cliAuthModalOpen: boolean
setCliAuthModalOpen: Dispatch<SetStateAction<boolean>>
}) {
const {
providers,
defaultProviderId,
createOpen,
createRuntime,
createProviderId,
setCreateProviderId,
setupOpen,
setupProviderId,
setSetupProviderId,
cliAuthModalOpen,
setCliAuthModalOpen,
} = input
const cliProviderOptions = useMemo(
() => buildOpenClawCliProviderOptions(),
[],
)
const selectableOpenClawProviders = useMemo(
() => toProviderOptions(providers, cliProviderOptions),
[providers, cliProviderOptions],
)
useEffect(() => {
if (selectableOpenClawProviders.length === 0) return
const fallbackId =
selectableOpenClawProviders.find(
(provider) => provider.id === defaultProviderId,
)?.id ?? selectableOpenClawProviders[0].id
if (createOpen && !createProviderId) {
setCreateProviderId(fallbackId)
}
}, [
createOpen,
createProviderId,
defaultProviderId,
selectableOpenClawProviders,
setCreateProviderId,
])
useEffect(() => {
if (selectableOpenClawProviders.length === 0) return
const fallbackId =
selectableOpenClawProviders.find(
(provider) => provider.id === defaultProviderId,
)?.id ?? selectableOpenClawProviders[0].id
if (setupOpen && !setupProviderId) {
setSetupProviderId(fallbackId)
}
}, [
defaultProviderId,
selectableOpenClawProviders,
setSetupProviderId,
setupOpen,
setupProviderId,
])
const selectedCreateOption = selectableOpenClawProviders.find(
(provider) => provider.id === createProviderId,
)
const selectedCliProvider = selectedCreateOption
? findOpenClawCliProviderById(selectedCreateOption.type)
: undefined
const selectedSetupOption = selectableOpenClawProviders.find(
(provider) => provider.id === setupProviderId,
)
const selectedSetupCliProvider = selectedSetupOption
? findOpenClawCliProviderById(selectedSetupOption.type)
: undefined
const activeCliProvider =
(setupOpen && selectedSetupCliProvider) ||
(createOpen && createRuntime === 'openclaw' && selectedCliProvider) ||
undefined
const {
data: cliAuthStatus,
isLoading: cliAuthLoading,
error: cliAuthError,
} = useOpenClawCliProviderAuthStatus(
activeCliProvider?.id ?? '',
!!activeCliProvider,
)
useEffect(() => {
if (cliAuthModalOpen && cliAuthStatus?.loggedIn) {
setCliAuthModalOpen(false)
}
}, [cliAuthModalOpen, cliAuthStatus?.loggedIn, setCliAuthModalOpen])
return {
selectableOpenClawProviders,
selectedCliProvider,
selectedSetupCliProvider,
authTerminalProvider: selectedSetupCliProvider ?? selectedCliProvider,
cliAuthStatus,
cliAuthLoading,
cliAuthError,
}
}

View File

@@ -0,0 +1,119 @@
import type { HarnessAgentAdapter } from './agent-harness-types'
import type { GatewayLifecycleAction, OpenClawStatus } from './useOpenClaw'
export type CreateAgentRuntime = 'openclaw' | HarnessAgentAdapter
export interface ProviderOption {
id: string
type: string
name: string
modelId: string
baseUrl?: string
apiKey?: string
}
export interface AgentListItem {
key: string
agentId: string
name: string
source: 'openclaw' | 'agent-harness'
runtimeLabel: string
modelLabel: string
detail: string
canChat: boolean
canDelete: boolean
}
export interface GatewayUiState {
canManageAgents: boolean
controlPlaneDegraded: boolean
controlPlaneBusy: boolean
}
export const DEFAULT_HARNESS_ADAPTER: HarnessAgentAdapter = 'claude'
export const DEFAULT_CREATE_RUNTIME: CreateAgentRuntime = 'openclaw'
export const LIFECYCLE_BANNER_COPY: Record<GatewayLifecycleAction, string> = {
setup: 'Setting up OpenClaw...',
start: 'Starting gateway...',
stop: 'Stopping gateway...',
restart: 'Restarting gateway...',
reconnect: 'Restoring gateway connection...',
}
export const CONTROL_PLANE_COPY: Record<
OpenClawStatus['controlPlaneStatus'],
{
badgeVariant: 'default' | 'secondary' | 'outline' | 'destructive'
badgeLabel: string
title: string
description: string
}
> = {
connected: {
badgeVariant: 'default',
badgeLabel: 'Control Plane Ready',
title: 'Gateway Connected',
description: 'OpenClaw can create, manage, and chat with agents normally.',
},
connecting: {
badgeVariant: 'secondary',
badgeLabel: 'Connecting',
title: 'Connecting to Gateway',
description:
'BrowserOS is establishing the OpenClaw control channel for agent operations.',
},
reconnecting: {
badgeVariant: 'secondary',
badgeLabel: 'Reconnecting',
title: 'Reconnecting Control Plane',
description:
'The gateway process is up, but BrowserOS is restoring the control channel.',
},
recovering: {
badgeVariant: 'secondary',
badgeLabel: 'Recovering',
title: 'Recovering Gateway Connection',
description:
'BrowserOS detected a control-plane fault and is trying a safe recovery path.',
},
disconnected: {
badgeVariant: 'outline',
badgeLabel: 'Disconnected',
title: 'Gateway Disconnected',
description: 'The gateway process is not available to BrowserOS right now.',
},
failed: {
badgeVariant: 'destructive',
badgeLabel: 'Needs Attention',
title: 'Gateway Recovery Failed',
description:
'BrowserOS could not restore the OpenClaw control channel automatically.',
},
}
export const FALLBACK_CONTROL_PLANE_COPY = {
badgeVariant: 'outline' as const,
badgeLabel: 'Unknown',
title: 'Gateway State Unknown',
description:
'BrowserOS received a gateway status it does not recognize yet. Refreshing or reconnecting should restore a known state.',
}
export const RECOVERY_REASON_COPY: Record<
NonNullable<OpenClawStatus['lastRecoveryReason']>,
string
> = {
transient_disconnect:
'The control channel dropped briefly and BrowserOS is retrying it.',
signature_expired:
'The gateway rejected the signed device handshake because its clock drifted.',
pairing_required:
'The gateway asked BrowserOS to approve its local device identity again.',
token_mismatch:
'BrowserOS had to reload the gateway token before reconnecting.',
container_not_ready:
'The OpenClaw gateway process is not ready yet, so control-plane recovery cannot start.',
unknown:
'BrowserOS hit an unexpected gateway error and could not classify it cleanly.',
}

View File

@@ -0,0 +1,172 @@
import type { LlmProviderConfig } from '@/lib/llm-providers/types'
import type { HarnessAgent, HarnessAgentAdapter } from './agent-harness-types'
import {
type AgentListItem,
CONTROL_PLANE_COPY,
FALLBACK_CONTROL_PLANE_COPY,
type GatewayUiState,
LIFECYCLE_BANNER_COPY,
type ProviderOption,
RECOVERY_REASON_COPY,
} from './agents-page-types'
import { getOpenClawSupportedProviders } from './openclaw-supported-providers'
import {
type AgentEntry,
type GatewayLifecycleAction,
getModelDisplayName,
type OpenClawStatus,
} from './useOpenClaw'
export function getControlPlaneCopy(
status: OpenClawStatus['controlPlaneStatus'],
) {
return CONTROL_PLANE_COPY[status] ?? FALLBACK_CONTROL_PLANE_COPY
}
export function getRecoveryDetail(status: OpenClawStatus): string | null {
if (!status.lastRecoveryReason && !status.lastGatewayError) return null
const detail = status.lastRecoveryReason
? RECOVERY_REASON_COPY[status.lastRecoveryReason]
: null
if (status.lastGatewayError && detail) {
return `${detail} Latest gateway error: ${status.lastGatewayError}`
}
return status.lastGatewayError ?? detail
}
export function formatHarnessAdapter(adapter: HarnessAgentAdapter): string {
return adapter === 'claude' ? 'Claude Code' : 'Codex'
}
export function toProviderOptions(
providers: LlmProviderConfig[],
cliProviders: ProviderOption[],
): ProviderOption[] {
return [...getOpenClawSupportedProviders(providers), ...cliProviders]
}
export function toOpenClawListItem(
agent: AgentEntry,
canManageAgents: boolean,
): AgentListItem {
return {
key: `openclaw:${agent.agentId}`,
agentId: agent.agentId,
name: agent.name,
source: 'openclaw',
runtimeLabel: 'OpenClaw',
modelLabel: getModelDisplayName(agent.model) ?? 'default',
detail: agent.workspace,
canChat: canManageAgents,
canDelete: canManageAgents && agent.agentId !== 'main',
}
}
export function toHarnessListItem(agent: HarnessAgent): AgentListItem {
return {
key: `agent-harness:${agent.id}`,
agentId: agent.id,
name: agent.name,
source: 'agent-harness',
runtimeLabel: formatHarnessAdapter(agent.adapter),
modelLabel: agent.modelId ?? 'default',
detail: `${agent.adapter}:main`,
canChat: true,
canDelete: true,
}
}
export function getGatewayUiState(
status: OpenClawStatus | null,
): GatewayUiState {
if (!status) {
return {
canManageAgents: false,
controlPlaneDegraded: false,
controlPlaneBusy: false,
}
}
const controlPlaneBusy =
status.controlPlaneStatus === 'connecting' ||
status.controlPlaneStatus === 'reconnecting' ||
status.controlPlaneStatus === 'recovering'
return {
canManageAgents:
status.status === 'running' && status.controlPlaneStatus === 'connected',
controlPlaneBusy,
controlPlaneDegraded:
status.status === 'running' && status.controlPlaneStatus !== 'connected',
}
}
export function getLifecycleBanner(
action: GatewayLifecycleAction | null,
): string | null {
return action ? LIFECYCLE_BANNER_COPY[action] : null
}
export function canManageOpenClawAgents(
state: GatewayUiState,
lifecyclePending: boolean,
): boolean {
return state.canManageAgents && !lifecyclePending
}
export function shouldShowControlPlaneDegraded(
state: GatewayUiState,
lifecyclePending: boolean,
): boolean {
return state.controlPlaneDegraded && !lifecyclePending
}
export function getControlPlaneCopyForStatus(status: OpenClawStatus | null) {
return status
? getControlPlaneCopy(status.controlPlaneStatus)
: FALLBACK_CONTROL_PLANE_COPY
}
export function getVisibleOpenClawAgents(
enabled: boolean,
agents: AgentEntry[],
): AgentEntry[] {
return enabled ? agents : []
}
export function getAgentsLoading(input: {
statusLoading: boolean
adaptersLoading: boolean
harnessAgentsLoading: boolean
openClawAgentsEnabled: boolean
openClawAgentsLoading: boolean
}): boolean {
return (
input.statusLoading ||
input.adaptersLoading ||
input.harnessAgentsLoading ||
(input.openClawAgentsEnabled && input.openClawAgentsLoading)
)
}
export function getInlineError(input: {
lifecyclePending: boolean
pageError: string | null
statusError: Error | null
openClawAgentsError: Error | null
adaptersError: Error | null
harnessAgentsError: Error | null
}): string | null {
if (input.lifecyclePending) return null
return (
input.pageError ??
input.statusError?.message ??
input.openClawAgentsError?.message ??
input.adaptersError?.message ??
input.harnessAgentsError?.message ??
null
)
}

View File

@@ -0,0 +1,38 @@
import { describe, expect, it } from 'bun:test'
import { buildAgentApiUrl } from './agent-api-url'
import { mapHarnessAgentToEntry } from './agent-harness-types'
describe('mapHarnessAgentToEntry', () => {
it('maps created harness agents into chat-compatible entries', () => {
expect(
mapHarnessAgentToEntry({
id: 'agent-1',
name: 'Review bot',
adapter: 'codex',
modelId: 'gpt-5.5',
reasoningEffort: 'medium',
permissionMode: 'approve-all',
sessionKey: 'agent:agent-1:main',
createdAt: 1000,
updatedAt: 1000,
}),
).toEqual({
agentId: 'agent-1',
name: 'Review bot',
workspace: 'codex:main',
model: 'gpt-5.5',
source: 'agent-harness',
})
})
})
describe('buildAgentApiUrl', () => {
it('does not add a trailing slash for the harness root route', () => {
expect(buildAgentApiUrl('http://127.0.0.1:9105', '/')).toBe(
'http://127.0.0.1:9105/agents',
)
expect(buildAgentApiUrl('http://127.0.0.1:9105', '/adapters')).toBe(
'http://127.0.0.1:9105/agents/adapters',
)
})
})

View File

@@ -0,0 +1,162 @@
import { useMutation, useQuery, useQueryClient } from '@tanstack/react-query'
import { getAgentServerUrl } from '@/lib/browseros/helpers'
import { useAgentServerUrl } from '@/lib/browseros/useBrowserOSProviders'
import { buildAgentApiUrl } from './agent-api-url'
import {
type AgentHarnessStreamEvent,
type CreateHarnessAgentInput,
type HarnessAdapterDescriptor,
type HarnessAgent,
type HarnessAgentHistoryPage,
mapHarnessAgentToEntry,
} from './agent-harness-types'
export type { AgentHarnessStreamEvent }
const AGENT_QUERY_KEYS = {
adapters: 'agent-harness-adapters',
agents: 'agent-harness-agents',
} as const
async function agentsFetch<T>(
baseUrl: string,
path: string,
init?: RequestInit,
): Promise<T> {
const res = await fetch(buildAgentApiUrl(baseUrl, path), init)
if (!res.ok) {
let message = `Request failed with status ${res.status}`
try {
const body = (await res.json()) as { error?: string }
if (body.error) message = body.error
} catch {}
throw new Error(message)
}
return res.json() as Promise<T>
}
export function useAgentAdapters(enabled = true) {
const {
baseUrl,
isLoading: urlLoading,
error: urlError,
} = useAgentServerUrl()
const query = useQuery<HarnessAdapterDescriptor[], Error>({
queryKey: [AGENT_QUERY_KEYS.adapters, baseUrl],
queryFn: async () => {
const data = await agentsFetch<{ adapters: HarnessAdapterDescriptor[] }>(
baseUrl as string,
'/adapters',
)
return data.adapters ?? []
},
enabled: Boolean(baseUrl) && !urlLoading && enabled,
})
return {
adapters: query.data ?? [],
loading: query.isLoading || urlLoading,
error: query.error ?? urlError,
refetch: query.refetch,
}
}
export function useHarnessAgents(enabled = true) {
const {
baseUrl,
isLoading: urlLoading,
error: urlError,
} = useAgentServerUrl()
const query = useQuery<HarnessAgent[], Error>({
queryKey: [AGENT_QUERY_KEYS.agents, baseUrl],
queryFn: async () => {
const data = await agentsFetch<{ agents: HarnessAgent[] }>(
baseUrl as string,
'/',
)
return data.agents ?? []
},
enabled: Boolean(baseUrl) && !urlLoading && enabled,
})
return {
agents: (query.data ?? []).map(mapHarnessAgentToEntry),
harnessAgents: query.data ?? [],
loading: query.isLoading || urlLoading,
error: query.error ?? urlError,
refetch: query.refetch,
}
}
export function useCreateHarnessAgent() {
const { baseUrl, isLoading: urlLoading } = useAgentServerUrl()
const queryClient = useQueryClient()
return useMutation({
mutationFn: async (input: CreateHarnessAgentInput) => {
if (!baseUrl || urlLoading) {
throw new Error('BrowserOS agent server URL is not ready')
}
const data = await agentsFetch<{ agent: HarnessAgent }>(baseUrl, '/', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(input),
})
return data.agent
},
onSuccess: async () => {
await queryClient.invalidateQueries({
queryKey: [AGENT_QUERY_KEYS.agents],
})
},
})
}
export function useDeleteHarnessAgent() {
const { baseUrl, isLoading: urlLoading } = useAgentServerUrl()
const queryClient = useQueryClient()
return useMutation({
mutationFn: async (agentId: string) => {
if (!baseUrl || urlLoading) {
throw new Error('BrowserOS agent server URL is not ready')
}
return agentsFetch<{ success: boolean }>(
baseUrl,
`/${encodeURIComponent(agentId)}`,
{ method: 'DELETE' },
)
},
onSuccess: async () => {
await queryClient.invalidateQueries({
queryKey: [AGENT_QUERY_KEYS.agents],
})
},
})
}
export async function chatWithHarnessAgent(
agentId: string,
message: string,
signal?: AbortSignal,
): Promise<Response> {
const baseUrl = await getAgentServerUrl()
return fetch(`${baseUrl}/agents/${encodeURIComponent(agentId)}/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message }),
signal,
})
}
export async function fetchHarnessAgentHistory(
agentId: string,
): Promise<HarnessAgentHistoryPage> {
const baseUrl = await getAgentServerUrl()
return agentsFetch<HarnessAgentHistoryPage>(
baseUrl,
`/${encodeURIComponent(agentId)}/sessions/main/history`,
)
}

View File

@@ -7,6 +7,7 @@ export interface AgentEntry {
name: string
workspace: string
model?: unknown
source?: 'openclaw' | 'agent-harness'
}
export interface OpenClawStatus {
@@ -41,6 +42,7 @@ export interface OpenClawAgentMutationInput {
baseUrl?: string
apiKey?: string
modelId?: string
supportsImages?: boolean
}
export interface OpenClawSetupInput {
@@ -49,6 +51,10 @@ export interface OpenClawSetupInput {
baseUrl?: string
apiKey?: string
modelId?: string
// Mirrors LlmProviderConfig.supportsImages — pass-through so the gateway
// can declare the model's input modalities correctly when persisting the
// custom-provider config.
supportsImages?: boolean
}
export function getModelDisplayName(model: unknown): string | undefined {
@@ -93,7 +99,10 @@ async function fetchOpenClawStatus(baseUrl: string): Promise<OpenClawStatus> {
async function fetchOpenClawAgents(baseUrl: string): Promise<AgentEntry[]> {
const data = await clawFetch<{ agents: AgentEntry[] }>(baseUrl, '/agents')
return data.agents ?? []
return (data.agents ?? []).map((agent) => ({
...agent,
source: 'openclaw',
}))
}
async function invalidateOpenClawQueries(
@@ -317,12 +326,18 @@ export async function chatWithAgent(
sessionKey?: string,
history: OpenClawChatHistoryMessage[] = [],
signal?: AbortSignal,
attachments?: ReadonlyArray<unknown>,
): Promise<Response> {
const baseUrl = await getAgentServerUrl()
return fetch(`${baseUrl}/claw/agents/${agentId}/chat`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message, sessionKey, history }),
body: JSON.stringify({
message,
sessionKey,
history,
...(attachments && attachments.length > 0 ? { attachments } : {}),
}),
signal,
})
}

View File

@@ -12,6 +12,8 @@ export interface AssistantThinkingPart {
export interface ToolEntry {
id: string
name: string
label: string
subject?: string
status: 'running' | 'completed' | 'error'
durationMs?: number
}
@@ -26,9 +28,24 @@ export type AssistantPart =
| AssistantThinkingPart
| AssistantToolBatchPart
/**
* Attachments rendered alongside the user's text on the optimistic turn
* — populated when the composer staged any images/files. The dataUrl is
* the same one the server received; we keep it in memory only for the
* lifetime of the live turn (history reload re-fetches via the JSONL).
*/
export interface UserAttachmentPreview {
id: string
kind: 'image' | 'file'
mediaType: string
name: string
dataUrl?: string
}
export interface AgentConversationTurn {
id: string
userText: string
userAttachments?: UserAttachmentPreview[]
parts: AssistantPart[]
done: boolean
timestamp: number
@@ -50,4 +67,7 @@ export interface AgentCardData {
status: 'idle' | 'working' | 'error'
lastMessage?: string
lastMessageTimestamp?: number
activitySummary?: string
currentTool?: string
costUsd?: number
}

View File

@@ -0,0 +1,369 @@
/**
* Composer attachment helpers — validation, image compression, and the
* client-side payload shape sent to /agents/:id/chat.
*
* Image attachments travel as `data:` URLs (base64) so the gateway, which
* runs on 127.0.0.1 over Lima virtiofs, can ingest them as standard
* OpenAI-style content blocks. Non-image text-shaped files are read into
* memory and travel as their extracted text body — the server inlines
* them as a fenced `<attachment>` block on the user message.
*/
export const MAX_ATTACHMENTS_PER_MESSAGE = 10
export const MAX_IMAGE_BYTES = 5 * 1024 * 1024 // 5 MB after compression
export const MAX_FILE_TEXT_BYTES = 1 * 1024 * 1024 // 1 MB extracted text
export const IMAGE_LONG_EDGE_CAP = 2048
export const ALLOWED_IMAGE_MEDIA_TYPES = [
'image/png',
'image/jpeg',
'image/jpg',
'image/webp',
'image/gif',
] as const
export const ALLOWED_FILE_MEDIA_TYPE_PREFIXES = [
'text/',
'application/json',
] as const
export type ServerImageAttachment = {
kind: 'image'
mediaType: string
dataUrl: string
name?: string
}
export type ServerFileAttachment = {
kind: 'file'
mediaType: string
name: string
text: string
}
export type ServerAttachmentPayload =
| ServerImageAttachment
| ServerFileAttachment
/** UI-side representation: what the composer needs to render a chip. */
export interface StagedAttachment {
id: string
kind: 'image' | 'file'
mediaType: string
name: string
// Set for images so the chip thumbnail can render directly. For files
// we don't need a preview yet, but the field exists for v2 PDF previews.
dataUrl?: string
// Pre-computed payload for the server. Built once at staging time so
// re-renders don't re-encode large blobs.
payload: ServerAttachmentPayload
}
export type AttachmentValidationError =
| { code: 'too_many'; message: string }
| { code: 'unsupported_type'; message: string; mediaType: string }
| { code: 'too_large'; message: string }
| { code: 'read_failed'; message: string }
export type StageAttachmentResult =
| { ok: true; attachment: StagedAttachment }
| { ok: false; error: AttachmentValidationError }
function isImageMediaType(mediaType: string): boolean {
return (ALLOWED_IMAGE_MEDIA_TYPES as readonly string[]).includes(mediaType)
}
function isAllowedFileMediaType(mediaType: string): boolean {
return ALLOWED_FILE_MEDIA_TYPE_PREFIXES.some((prefix) =>
mediaType.startsWith(prefix),
)
}
/** Build a unique id without depending on `crypto.randomUUID` outside DOM. */
function makeId(): string {
if (typeof crypto !== 'undefined' && crypto.randomUUID) {
return crypto.randomUUID()
}
return `att-${Date.now().toString(36)}-${Math.random().toString(36).slice(2, 10)}`
}
/**
* Read a `File` and produce the staged-attachment shape — validate type,
* compress if it's a large image, and pre-build the server payload.
*/
export async function stageAttachment(
file: File,
): Promise<StageAttachmentResult> {
const mediaType = file.type || 'application/octet-stream'
if (isImageMediaType(mediaType)) {
try {
const compressed = await compressImageIfNeeded(file)
const dataUrl = await readAsDataUrl(compressed)
// Rough byte ceiling — `data:image/png;base64,...` doubles size with
// base64. Reject early so we never POST something the route will 400.
if (dataUrl.length > MAX_IMAGE_BYTES * 2) {
return {
ok: false,
error: {
code: 'too_large',
message: `Image "${file.name}" is too large (max ${humanBytes(
MAX_IMAGE_BYTES,
)}).`,
},
}
}
return {
ok: true,
attachment: {
id: makeId(),
kind: 'image',
mediaType,
name: file.name || 'image',
dataUrl,
payload: {
kind: 'image',
mediaType,
dataUrl,
name: file.name || undefined,
},
},
}
} catch (err) {
return {
ok: false,
error: {
code: 'read_failed',
message:
err instanceof Error
? err.message
: `Failed to read image "${file.name}".`,
},
}
}
}
if (isAllowedFileMediaType(mediaType)) {
let text: string
try {
text = await file.text()
} catch (err) {
return {
ok: false,
error: {
code: 'read_failed',
message:
err instanceof Error
? err.message
: `Failed to read file "${file.name}".`,
},
}
}
if (text.length > MAX_FILE_TEXT_BYTES) {
return {
ok: false,
error: {
code: 'too_large',
message: `File "${file.name}" is too large (max ${humanBytes(
MAX_FILE_TEXT_BYTES,
)}).`,
},
}
}
return {
ok: true,
attachment: {
id: makeId(),
kind: 'file',
mediaType,
name: file.name || 'attachment',
payload: {
kind: 'file',
mediaType,
name: file.name || 'attachment',
text,
},
},
}
}
return {
ok: false,
error: {
code: 'unsupported_type',
message: `Unsupported attachment type: ${mediaType || 'unknown'}`,
mediaType,
},
}
}
/**
* Stage multiple files at once, enforcing the per-message cap. The result
* partitions successful stages and any errors so the caller can show
* granular toasts.
*/
export async function stageAttachments(
files: File[],
alreadyStaged: number,
): Promise<{
staged: StagedAttachment[]
errors: AttachmentValidationError[]
}> {
const remainingSlots = Math.max(
0,
MAX_ATTACHMENTS_PER_MESSAGE - alreadyStaged,
)
const staged: StagedAttachment[] = []
const errors: AttachmentValidationError[] = []
if (remainingSlots === 0 && files.length > 0) {
errors.push({
code: 'too_many',
message: `At most ${MAX_ATTACHMENTS_PER_MESSAGE} attachments per message.`,
})
return { staged, errors }
}
const overflow = files.length - remainingSlots
if (overflow > 0) {
errors.push({
code: 'too_many',
message: `Only the first ${remainingSlots} of ${files.length} files were attached (max ${MAX_ATTACHMENTS_PER_MESSAGE}).`,
})
}
for (const file of files.slice(0, remainingSlots)) {
const result = await stageAttachment(file)
if (result.ok) {
staged.push(result.attachment)
} else {
errors.push(result.error)
}
}
return { staged, errors }
}
/**
* Resize images that are oversized to a sane long-edge cap. JPEG/WebP
* source files are re-encoded to JPEG; PNGs/GIFs that are already small
* are passed through untouched.
*/
export async function compressImageIfNeeded(file: File): Promise<Blob> {
// Cheap path: small files don't need any transform.
if (file.size <= 1.5 * 1024 * 1024) return file
const bitmap = await blobToImageBitmap(file)
const { width, height } = bitmap
const longEdge = Math.max(width, height)
if (longEdge <= IMAGE_LONG_EDGE_CAP && file.size <= MAX_IMAGE_BYTES) {
bitmap.close?.()
return file
}
const scale = Math.min(1, IMAGE_LONG_EDGE_CAP / longEdge)
const targetWidth = Math.max(1, Math.round(width * scale))
const targetHeight = Math.max(1, Math.round(height * scale))
const canvas =
typeof OffscreenCanvas !== 'undefined'
? new OffscreenCanvas(targetWidth, targetHeight)
: Object.assign(document.createElement('canvas'), {
width: targetWidth,
height: targetHeight,
})
const ctx = canvas.getContext('2d') as
| CanvasRenderingContext2D
| OffscreenCanvasRenderingContext2D
| null
if (!ctx) {
bitmap.close?.()
return file
}
ctx.drawImage(bitmap, 0, 0, targetWidth, targetHeight)
bitmap.close?.()
const outputType = 'image/jpeg'
if (canvas instanceof HTMLCanvasElement) {
return await new Promise<Blob>((resolve, reject) => {
canvas.toBlob(
(blob) => {
if (blob) resolve(blob)
else reject(new Error('Image compression failed.'))
},
outputType,
0.85,
)
})
}
return await (canvas as OffscreenCanvas).convertToBlob({
type: outputType,
quality: 0.85,
})
}
async function blobToImageBitmap(blob: Blob): Promise<ImageBitmap> {
if (typeof createImageBitmap === 'function') {
return createImageBitmap(blob)
}
// Fallback: load via an Image element and use the canvas decode path.
const url = URL.createObjectURL(blob)
try {
const img = await new Promise<HTMLImageElement>((resolve, reject) => {
const el = new Image()
el.onload = () => resolve(el)
el.onerror = () =>
reject(new Error('Failed to decode image for compression.'))
el.src = url
})
const canvas = document.createElement('canvas')
canvas.width = img.naturalWidth
canvas.height = img.naturalHeight
const ctx = canvas.getContext('2d')
if (!ctx) throw new Error('Canvas 2D context unavailable.')
ctx.drawImage(img, 0, 0)
const blobOut = await new Promise<Blob | null>((resolve) =>
canvas.toBlob(resolve, 'image/png'),
)
if (!blobOut) throw new Error('Canvas toBlob returned null.')
return await createImageBitmap(blobOut)
} finally {
URL.revokeObjectURL(url)
}
}
async function readAsDataUrl(blob: Blob): Promise<string> {
if ('arrayBuffer' in blob && typeof FileReader === 'undefined') {
const buffer = await blob.arrayBuffer()
const base64 = arrayBufferToBase64(buffer)
const type = blob.type || 'application/octet-stream'
return `data:${type};base64,${base64}`
}
return await new Promise<string>((resolve, reject) => {
const reader = new FileReader()
reader.onload = () => resolve(reader.result as string)
reader.onerror = () =>
reject(reader.error ?? new Error('FileReader failed to read blob.'))
reader.readAsDataURL(blob)
})
}
function arrayBufferToBase64(buffer: ArrayBuffer): string {
const bytes = new Uint8Array(buffer)
let binary = ''
const chunkSize = 0x8000
for (let i = 0; i < bytes.byteLength; i += chunkSize) {
binary += String.fromCharCode.apply(
null,
Array.from(bytes.subarray(i, Math.min(i + chunkSize, bytes.byteLength))),
)
}
return btoa(binary)
}
function humanBytes(bytes: number): string {
if (bytes >= 1024 * 1024) return `${(bytes / 1024 / 1024).toFixed(0)} MB`
if (bytes >= 1024) return `${(bytes / 1024).toFixed(0)} KB`
return `${bytes} B`
}

View File

@@ -75,6 +75,12 @@ export const MCP_EXTERNAL_ACCESS_DISABLED_EVENT =
/** @public */
export const MCP_SERVER_RESTARTED_EVENT = 'settings.mcp_server.restarted'
/** @public */
export const AGENT_CREATED_EVENT = 'agents.agent.created'
/** @public */
export const AGENT_DELETED_EVENT = 'agents.agent.deleted'
/** @public */
export const NEW_SCHEDULED_TASK_CREATED_EVENT =
'settings.scheduled_task.created'

View File

@@ -0,0 +1,325 @@
/**
* @license
* Copyright 2025 BrowserOS
* SPDX-License-Identifier: AGPL-3.0-or-later
*
* Maps raw tool names + arguments to human-readable activity labels for
* the chat UI activity view. The MCP ToolRegistry is the source of truth
* for tool *existence*; this file is the editorial layer that turns
* snake_case identifiers into agent-speak verbs.
*/
const VERB_OVERRIDES: Record<string, string> = {
// Navigation
navigate_page: 'Navigated to',
new_page: 'Opened tab',
new_hidden_page: 'Opened tab',
show_page: 'Showed tab',
close_page: 'Closed tab',
list_pages: 'Listed open tabs',
get_active_page: 'Got active tab',
move_page: 'Moved tab',
group_tabs: 'Grouped tabs',
// Page reading
take_snapshot: 'Captured page snapshot',
take_enhanced_snapshot: 'Captured detailed snapshot',
get_page_content: 'Read page content',
get_page_links: 'Extracted page links',
get_dom: 'Read page DOM',
search_dom: 'Searched page DOM',
take_screenshot: 'Took screenshot',
// Input
click: 'Clicked',
click_at: 'Clicked at coordinates',
hover: 'Hovered',
hover_at: 'Hovered at coordinates',
type_at: 'Typed at coordinates',
drag_at: 'Dragged',
focus: 'Focused element',
fill: 'Filled field',
clear: 'Cleared field',
check: 'Checked box',
uncheck: 'Unchecked box',
press_key: 'Pressed key',
upload_file: 'Uploaded file',
// Console / scripts
evaluate_script: 'Ran script',
get_console_logs: 'Read console logs',
// History / bookmarks
search_history: 'Searched history',
get_recent_history: 'Read recent history',
delete_history_url: 'Deleted history entry',
delete_history_range: 'Deleted history range',
get_bookmarks: 'Listed bookmarks',
create_bookmark: 'Created bookmark',
remove_bookmark: 'Removed bookmark',
update_bookmark: 'Updated bookmark',
move_bookmark: 'Moved bookmark',
search_bookmarks: 'Searched bookmarks',
// Filesystem (sandboxed)
read_file: 'Read file',
write_file: 'Wrote file',
find_files: 'Searched files',
// Memory
read_soul: 'Read soul memory',
read_core: 'Read core memory',
write_memory: 'Wrote memory',
search_memory: 'Searched memory',
update_soul: 'Updated soul memory',
update_core: 'Updated core memory',
// Web
web_search: 'Searched the web',
web_fetch: 'Fetched URL',
// Klavis / external apps (Strata)
connector_mcp_servers: 'Listed connected apps',
discover_server_categories_or_actions: 'Browsed available actions',
get_category_actions: 'Listed actions',
get_action_details: 'Looked up action',
execute_action: 'Ran external action',
search_documentation: 'Searched docs',
handle_auth_failure: 'Handled auth issue',
// Suggestions
suggest_schedule: 'Suggested schedule',
suggest_app_connection: 'Suggested app connect',
// BrowserOS info
browseros_info: 'Read BrowserOS info',
// Windows
list_windows: 'Listed windows',
focus_window: 'Focused window',
close_window: 'Closed window',
create_window: 'Created window',
}
// ──────────────────────────────────────────────────────────────────────
// Helpers
// ──────────────────────────────────────────────────────────────────────
function asString(value: unknown): string | undefined {
return typeof value === 'string' && value.length > 0 ? value : undefined
}
function stringField(
input: Record<string, unknown>,
...keys: string[]
): string | undefined {
for (const k of keys) {
const v = asString(input[k])
if (v) return v
}
return undefined
}
function truncate(text: string | undefined, max: number): string | undefined {
if (!text) return undefined
return text.length > max ? `${text.slice(0, max - 1)}` : text
}
function quote(value: string | undefined): string | undefined {
if (!value) return undefined
return `"${truncate(value, 60)}"`
}
function basename(path: string | undefined): string | undefined {
if (!path) return undefined
const parts = path.split(/[/\\]/).filter(Boolean)
return parts[parts.length - 1] ?? path
}
function formatUrl(value: unknown): string | undefined {
const url = asString(value)
if (!url) return undefined
try {
const parsed = new URL(url)
const host = parsed.host
const path = parsed.pathname === '/' ? '' : parsed.pathname
const display = path && path.length > 0 ? `${host}${path}` : host
return truncate(display, 60)
} catch {
return truncate(url, 60)
}
}
function coords(x: unknown, y: unknown): string | undefined {
if (typeof x === 'number' && typeof y === 'number') {
return `${Math.round(x)}, ${Math.round(y)}`
}
return undefined
}
// ──────────────────────────────────────────────────────────────────────
// Subject extractors
// ──────────────────────────────────────────────────────────────────────
type SubjectExtractor = (input: Record<string, unknown>) => string | undefined
const SUBJECT_EXTRACTORS: Record<string, SubjectExtractor> = {
// URL-bearing tools
new_page: (i) => formatUrl(i.url),
new_hidden_page: (i) => formatUrl(i.url),
navigate_page: (i) => {
const action = asString(i.action)
if (action === 'back') return 'back'
if (action === 'forward') return 'forward'
if (action === 'reload') return 'reload'
return formatUrl(i.url)
},
web_fetch: (i) => formatUrl(i.url),
// Search queries
web_search: (i) => quote(stringField(i, 'query', 'q')),
search_history: (i) => quote(stringField(i, 'query', 'text')),
search_bookmarks: (i) => quote(stringField(i, 'query', 'text')),
search_memory: (i) => quote(stringField(i, 'query', 'q')),
search_dom: (i) => quote(stringField(i, 'query', 'selector')),
search_documentation: (i) => quote(stringField(i, 'query', 'q')),
find_files: (i) => quote(stringField(i, 'pattern', 'query')),
// Element interactions
click: (i) => stringField(i, 'element'),
hover: (i) => stringField(i, 'element'),
focus: (i) => stringField(i, 'element'),
clear: (i) => stringField(i, 'element'),
check: (i) => stringField(i, 'element'),
uncheck: (i) => stringField(i, 'element'),
fill: (i) => {
const target = stringField(i, 'element')
const text = stringField(i, 'text')
if (target && text) return `${target}: ${truncate(text, 40)}`
return target ?? truncate(text, 40)
},
press_key: (i) => stringField(i, 'key'),
// Coordinate-based input
click_at: (i) => coords(i.x, i.y),
hover_at: (i) => coords(i.x, i.y),
type_at: (i) => {
const at = coords(i.x, i.y)
const text = stringField(i, 'text')
if (at && text) return `${at}: ${truncate(text, 40)}`
return at ?? truncate(text, 40)
},
drag_at: (i) => {
const from = coords(i.fromX, i.fromY)
const to = coords(i.toX, i.toY)
if (from && to) return `${from}${to}`
return from ?? to
},
// Tab management
show_page: (i) => {
const page = i.page
return typeof page === 'number' ? `tab ${page}` : asString(page)
},
close_page: (i) => {
const page = i.page
return typeof page === 'number' ? `tab ${page}` : asString(page)
},
move_page: (i) => {
const page = i.page
return typeof page === 'number' ? `tab ${page}` : asString(page)
},
// Page reads (take_snapshot, take_enhanced_snapshot, get_page_content,
// get_page_links, get_dom, take_screenshot) intentionally omit a
// subject — the only argument is a numeric page ID that's internal
// to the agent and meaningless to the user ("tab 4" tells them nothing).
// The verb alone communicates what happened.
// External actions via Strata
execute_action: (i) => {
const server = stringField(i, 'server_name')
const action = stringField(i, 'action_name')
if (server && action) return `${server} · ${action}`
return action ?? server
},
get_category_actions: (i) => stringField(i, 'category_name', 'server_name'),
get_action_details: (i) => stringField(i, 'action_name'),
discover_server_categories_or_actions: (i) =>
stringField(i, 'server_name', 'category_name'),
connector_mcp_servers: (i) => stringField(i, 'server_name'),
// Filesystem
read_file: (i) => basename(stringField(i, 'path')),
write_file: (i) => basename(stringField(i, 'path')),
// Memory writes — show first chars of content
write_memory: (i) => truncate(stringField(i, 'content', 'text'), 40),
update_soul: (i) => truncate(stringField(i, 'content'), 40),
update_core: (i) => truncate(stringField(i, 'content'), 40),
// Bookmarks
create_bookmark: (i) => stringField(i, 'title') ?? formatUrl(i.url),
remove_bookmark: (i) => stringField(i, 'id', 'title'),
update_bookmark: (i) => stringField(i, 'id', 'title'),
move_bookmark: (i) => stringField(i, 'id', 'title'),
// History
delete_history_url: (i) => formatUrl(i.url),
}
// ──────────────────────────────────────────────────────────────────────
// Public API
// ──────────────────────────────────────────────────────────────────────
export interface ToolLabelResult {
label: string
subject?: string
}
/**
* Strip MCP namespace prefixes (e.g. "browseros__", "mcp_") to find the
* canonical tool name used in the override maps.
*/
function canonicalName(rawName: string): string {
return rawName.replace(/^browseros__/, '').replace(/^mcp_/, '')
}
/**
* Convert a snake_case tool name into Sentence-case English as a fallback
* when no curated override exists.
*/
function humanizeToolName(rawName: string): string {
const stripped = canonicalName(rawName)
const words = stripped.split(/[_-]/).filter((w) => w.length > 0)
if (words.length === 0) return rawName
const first = words[0]!
return [
first.charAt(0).toUpperCase() + first.slice(1),
...words.slice(1),
].join(' ')
}
/**
* Build a human-readable label and subject string for a tool call,
* suitable for rendering in the chat activity view.
*/
export function buildToolLabel(
rawName: string,
input?: Record<string, unknown>,
): ToolLabelResult {
const canonical = canonicalName(rawName)
const label =
VERB_OVERRIDES[canonical] ??
VERB_OVERRIDES[rawName] ??
humanizeToolName(rawName)
const extractor = Object.hasOwn(SUBJECT_EXTRACTORS, canonical)
? SUBJECT_EXTRACTORS[canonical]
: Object.hasOwn(SUBJECT_EXTRACTORS, rawName)
? SUBJECT_EXTRACTORS[rawName]
: undefined
const subject = extractor && input ? extractor(input) : undefined
return { label, subject }
}

View File

@@ -8,6 +8,7 @@ const chromiumArgs = [
'--show-component-extension-options',
'--disable-browseros-server',
'--disable-browseros-extensions',
'--browseros-dock-icon=dev',
]
if (env.BROWSEROS_CDP_PORT) {

View File

@@ -1,875 +0,0 @@
# Eval System - Production Grade Design Doc
## Current State Analysis
### What's Working Well
1. **Zod validation** - Already exists in `config-validator.ts`, reuses `LLMConfigSchema` from `@browseros/shared`
2. **Grader registry pattern** - `createGrader()` factory works well, easy to add new graders
3. **AgentEvaluator interface** - Clean interface: `execute() → AgentResult`
4. **Discriminated unions** - Messages, agent types use proper TypeScript patterns
5. **Capture utilities** - `ScreenshotCapture`, `MessageLogger`, `TrajectorySaver` are modular
### Key Problems
**1. No Agent Registry/Factory**
Agent creation is hardcoded if-else in `task-executor.ts`:
```typescript
// Current approach - not scalable
if (this.config.agent.type === 'single') {
const evaluator = new SingleAgentEvaluator(...)
} else if (this.config.agent.type === 'orchestrator-executor') {
const evaluator = new OrchestratorExecutorEvaluator(...)
}
// Adding new agent = modify this file
```
**2. Heavy Server Dependency**
Imports from `@browseros/server`:
- `GeminiAgent` - Core agent (necessary)
- `ToolExecutionHooks` - Hook interface
- `ResolvedAgentConfig` - Agent config type
- `AgentExecutionError` - Error type
- `VercelAIContentGenerator` - Provider adapter
- Gateway client functions
**3. Scattered Types**
- `src/types.ts` - Main types
- `agents/types.ts` - Agent interface
- `agents/orchestrator-executor/types.ts` - Orchestrator types
- `runner/types.ts` - Runner types
- `graders/types.ts` - Grader types
**4. Duplicated Capture Logic**
Both agent evaluators duplicate:
- Initialize ScreenshotCapture
- Initialize MessageLogger
- Set up tool hooks
- Handle timeouts
- Collect errors/warnings
**5. No Unified Utils**
Hooks, screenshot capture, message logging code is copy-pasted per agent type.
---
## Design Goals
1. **Easy to add new agents** - Register new agent type, implement interface, done
2. **Shared capture infrastructure** - All agents use same screenshot/logging utils
3. **Type-safe with Zod** - Config validation at entry point
4. **Minimal server coupling** - Only import what's necessary
5. **Clear folder structure** - Types where they belong
6. **Production patterns** - Factory, registry, composition
---
## Proposed Architecture
### Folder Structure
```
eval/src/
├── index.ts # Entry point, CLI
├── types/
│ ├── index.ts # Re-exports all types
│ ├── config.ts # EvalConfig, AgentConfig (Zod schemas + types)
│ ├── task.ts # Task, TaskMetadata
│ ├── message.ts # Message discriminated union
│ ├── result.ts # AgentResult, GraderResult
│ └── errors.ts # ErrorSource, TaskError, EvalWarning
├── agents/
│ ├── index.ts # Re-exports + auto-registration
│ ├── registry.ts # Agent registry + factory
│ ├── types.ts # AgentEvaluator interface, AgentContext
│ ├── single/
│ │ └── index.ts # SingleAgentEvaluator
│ └── orchestrator-executor/
│ ├── index.ts # OrchestratorExecutorEvaluator
│ ├── types.ts # Orchestrator-specific types only
│ ├── orchestrator.ts
│ ├── orchestrator-agent.ts
│ ├── orchestrator-tools.ts
│ ├── executor.ts
│ └── executor-store.ts
├── capture/
│ ├── index.ts # Re-exports
│ ├── types.ts # CaptureContext interface
│ ├── context.ts # CaptureContext class (bundles all capture)
│ ├── hooks.ts # createCaptureHooks() utility
│ ├── screenshot.ts # ScreenshotCapture
│ ├── message-logger.ts # MessageLogger
│ ├── trajectory-saver.ts # TrajectorySaver
│ └── window-manager.ts # WindowManager
├── graders/
│ ├── index.ts # Re-exports
│ ├── registry.ts # Grader registry (existing pattern)
│ ├── types.ts # Grader interface
│ ├── benchmark/
│ │ ├── webvoyager.ts
│ │ └── mind2web.ts
│ └── fara/
│ ├── alignment.ts
│ ├── rubric.ts
│ ├── multimodal.ts
│ └── combined.ts
├── runner/
│ ├── index.ts # runEval() main entry
│ ├── types.ts # RunEvalOptions, TaskResult, BatchSummary
│ ├── task-loader.ts
│ ├── task-executor.ts
│ └── parallel-executor.ts
└── utils/
├── env.ts # resolveEnvValue() helper
└── validation.ts # Config validation logic
```
---
## Key Components
### 1. Type System (`types/`)
**`types/config.ts`** - Zod schemas + inferred types:
```typescript
import { LLMConfigSchema, LLMProviderSchema } from '@browseros/shared/schemas/llm'
import { z } from 'zod'
// Single agent config
export const SingleAgentConfigSchema = LLMConfigSchema.extend({
type: z.literal('single'),
})
export type SingleAgentConfig = z.infer<typeof SingleAgentConfigSchema>
// Orchestrator-executor config
export const OrchestratorExecutorConfigSchema = z.object({
type: z.literal('orchestrator-executor'),
orchestrator: LLMConfigSchema.extend({
maxTurns: z.number().int().min(1).optional(),
}),
executor: LLMConfigSchema.extend({
maxStepsPerDelegation: z.number().int().min(1).optional(),
}),
})
export type OrchestratorExecutorConfig = z.infer<typeof OrchestratorExecutorConfigSchema>
// Discriminated union
export const AgentConfigSchema = z.discriminatedUnion('type', [
SingleAgentConfigSchema,
OrchestratorExecutorConfigSchema,
])
export type AgentConfig = z.infer<typeof AgentConfigSchema>
// Full eval config
export const EvalConfigSchema = z.object({
agent: AgentConfigSchema,
dataset: z.string().min(1),
output_dir: z.string().optional(),
num_workers: z.number().int().min(1).max(20).default(1),
browseros: z.object({
server_url: z.string().url(),
}),
grader_model: z.string().optional(),
grader_api_key_env: z.string().optional(),
grader_base_url: z.string().url().optional(),
timeout_ms: z.number().int().min(30000).max(3600000).optional(),
})
export type EvalConfig = z.infer<typeof EvalConfigSchema>
```
**`types/message.ts`** - Message types:
```typescript
import { z } from 'zod'
const BaseMessageSchema = z.object({
timestamp: z.string().datetime(),
})
export const UserMessageSchema = BaseMessageSchema.extend({
type: z.literal('user'),
content: z.string(),
})
export const AssistantMessageSchema = BaseMessageSchema.extend({
type: z.literal('assistant'),
content: z.string(),
})
export const ToolCallMessageSchema = BaseMessageSchema.extend({
type: z.literal('tool_call'),
tool: z.string(),
toolCallId: z.string(),
params: z.record(z.unknown()),
})
export const ToolResultMessageSchema = BaseMessageSchema.extend({
type: z.literal('tool_result'),
toolCallId: z.string(),
result: z.unknown(),
isError: z.boolean(),
screenshot: z.number().optional(),
})
export const ErrorMessageSchema = BaseMessageSchema.extend({
type: z.literal('error'),
content: z.string(),
errorCode: z.string().optional(),
})
// Orchestrator-specific messages
export const DelegationMessageSchema = BaseMessageSchema.extend({
type: z.literal('delegation'),
instruction: z.string(),
executorId: z.string(),
maxSteps: z.number().optional(),
})
export const DelegationResultMessageSchema = BaseMessageSchema.extend({
type: z.literal('delegation_result'),
executorId: z.string(),
summary: z.string(),
status: z.enum(['done', 'blocked', 'max_steps']),
stepsUsed: z.number(),
currentUrl: z.string().optional(),
})
export const MessageSchema = z.discriminatedUnion('type', [
UserMessageSchema,
AssistantMessageSchema,
ToolCallMessageSchema,
ToolResultMessageSchema,
ErrorMessageSchema,
DelegationMessageSchema,
DelegationResultMessageSchema,
])
export type Message = z.infer<typeof MessageSchema>
export type UserMessage = z.infer<typeof UserMessageSchema>
export type AssistantMessage = z.infer<typeof AssistantMessageSchema>
export type ToolCallMessage = z.infer<typeof ToolCallMessageSchema>
export type ToolResultMessage = z.infer<typeof ToolResultMessageSchema>
export type ErrorMessage = z.infer<typeof ErrorMessageSchema>
export type DelegationMessage = z.infer<typeof DelegationMessageSchema>
export type DelegationResultMessage = z.infer<typeof DelegationResultMessageSchema>
// Type guards
export const isToolCallMessage = (m: Message): m is ToolCallMessage => m.type === 'tool_call'
export const isDelegationMessage = (m: Message): m is DelegationMessage => m.type === 'delegation'
// ... etc
```
---
### 2. Agent Registry (`agents/registry.ts`)
```typescript
import type { AgentContext, AgentEvaluator } from './types'
type AgentFactory = (context: AgentContext) => AgentEvaluator
const registry = new Map<string, AgentFactory>()
/**
* Register an agent type
*/
export function registerAgent(type: string, factory: AgentFactory): void {
if (registry.has(type)) {
throw new Error(`Agent type "${type}" already registered`)
}
registry.set(type, factory)
}
/**
* Create agent evaluator from context
*/
export function createAgent(context: AgentContext): AgentEvaluator {
const factory = registry.get(context.config.agent.type)
if (!factory) {
const available = Array.from(registry.keys()).join(', ')
throw new Error(
`Unknown agent type: "${context.config.agent.type}". Available: ${available}`
)
}
return factory(context)
}
/**
* Get all registered agent types
*/
export function getRegisteredAgentTypes(): string[] {
return Array.from(registry.keys())
}
```
**`agents/index.ts`** - Auto-registration:
```typescript
import { registerAgent } from './registry'
import { SingleAgentEvaluator } from './single'
import { OrchestratorExecutorEvaluator } from './orchestrator-executor'
// Auto-register built-in agents
registerAgent('single', (ctx) => new SingleAgentEvaluator(ctx))
registerAgent('orchestrator-executor', (ctx) => new OrchestratorExecutorEvaluator(ctx))
// Re-exports
export { createAgent, registerAgent, getRegisteredAgentTypes } from './registry'
export type { AgentContext, AgentEvaluator, AgentResult } from './types'
```
---
### 3. Agent Context (`agents/types.ts`)
```typescript
import type { CaptureContext } from '../capture/types'
import type { EvalConfig, Task, TaskMetadata, Message } from '../types'
/**
* All dependencies an agent needs - passed to factory
*/
export interface AgentContext {
// Config
config: EvalConfig
task: Task
// Browser window
windowId: number
tabId: number
// Output
outputDir: string // Root output dir
taskOutputDir: string // Task-specific: outputDir/query_id/
// Capture infrastructure (pre-initialized)
capture: CaptureContext
}
/**
* Result returned by agent execution
*/
export interface AgentResult {
metadata: TaskMetadata
messages: Message[]
finalAnswer: string | null
}
/**
* Interface all agent evaluators must implement
*/
export interface AgentEvaluator {
/**
* Execute the agent on the task
*/
execute(): Promise<AgentResult>
}
```
---
### 4. Capture Context (`capture/context.ts`)
Bundle all capture utilities:
```typescript
import { randomUUID } from 'node:crypto'
import type { ToolExecutionHooks, ToolExecutionResult } from '@browseros/server/agent'
import type { Message, TaskError, EvalWarning, ErrorSource } from '../types'
import { MessageLogger } from './message-logger'
import { ScreenshotCapture } from './screenshot'
import { TrajectorySaver } from './trajectory-saver'
export interface CaptureContextConfig {
serverUrl: string
outputDir: string
taskId: string
tabId: number
windowId: number
}
/**
* Unified capture context - bundles screenshot, message logging, errors/warnings
*/
export class CaptureContext {
readonly screenshot: ScreenshotCapture
readonly messageLogger: MessageLogger
readonly trajectorySaver: TrajectorySaver
private errors: TaskError[] = []
private warnings: EvalWarning[] = []
private currentToolCallId: string | null = null
private readonly tabId: number
private readonly windowId: number
constructor(private config: CaptureContextConfig) {
this.tabId = config.tabId
this.windowId = config.windowId
this.trajectorySaver = new TrajectorySaver(config.outputDir, config.taskId)
}
/**
* Initialize - must be called before use
*/
async init(): Promise<string> {
const taskOutputDir = await this.trajectorySaver.init()
this.screenshot = new ScreenshotCapture(this.config.serverUrl, taskOutputDir)
await this.screenshot.init()
this.messageLogger = new MessageLogger(taskOutputDir)
return taskOutputDir
}
/**
* Create tool execution hooks for GeminiAgent
*/
createToolHooks(): ToolExecutionHooks {
return {
onBeforeToolCall: async (toolName: string, args: unknown) => {
try {
this.currentToolCallId = randomUUID()
await this.messageLogger.logToolCall(
toolName,
this.currentToolCallId,
args as Record<string, unknown>
)
} catch (err) {
this.addWarning('message_logging', `Failed to log tool call ${toolName}: ${err}`)
}
},
onAfterToolCall: async (toolName: string, result: ToolExecutionResult) => {
let screenshotNum = 0
// Capture screenshot
try {
screenshotNum = await this.screenshot.capture(this.tabId, this.windowId)
} catch (err) {
this.addWarning('screenshot', `Screenshot after ${toolName} failed: ${err}`)
screenshotNum = this.screenshot.getCount()
}
// Log tool errors
if (result.isError) {
this.addWarning('mcp_tool', `Tool ${toolName} error: ${result.errorMessage}`)
}
// Log result
if (this.currentToolCallId) {
try {
await this.messageLogger.logToolResult(
this.currentToolCallId,
result.isError ? { error: result.errorMessage } : result.parts,
result.isError,
screenshotNum
)
} catch (err) {
this.addWarning('message_logging', `Failed to log tool result: ${err}`)
}
}
this.currentToolCallId = null
},
}
}
// Error/warning collection
addError(source: ErrorSource, message: string, details?: Record<string, unknown>): void {
this.errors.push({ source, message, timestamp: new Date().toISOString(), details })
}
addWarning(source: ErrorSource, message: string): void {
this.warnings.push({ source, message, timestamp: new Date().toISOString() })
console.warn(`[${source}] ${message}`)
}
getErrors(): TaskError[] { return [...this.errors] }
getWarnings(): EvalWarning[] { return [...this.warnings] }
getMessages(): Message[] { return this.messageLogger.getMessages() }
getScreenshotCount(): number { return this.screenshot.getCount() }
getLastAssistantMessage(): string | null { return this.messageLogger.getLastAssistantMessage() }
// Delegation logging (for orchestrator-executor)
async logDelegation(instruction: string, executorId: string, maxSteps?: number): Promise<void> {
await this.messageLogger.logDelegation(instruction, executorId, maxSteps)
}
async logDelegationResult(
executorId: string,
summary: string,
status: 'done' | 'blocked' | 'max_steps',
stepsUsed: number,
currentUrl?: string
): Promise<void> {
await this.messageLogger.logDelegationResult(executorId, summary, status, stepsUsed, currentUrl)
}
}
```
---
### 5. Single Agent Evaluator (`agents/single/index.ts`)
Clean implementation using context:
```typescript
import { randomUUID } from 'node:crypto'
import { GeminiAgent } from '@browseros/server/agent'
import { AgentExecutionError } from '@browseros/server/agent/errors'
import type { ResolvedAgentConfig } from '@browseros/server/agent/types'
import { MCPServerConfig } from '@google/gemini-cli-core'
import type { AgentContext, AgentEvaluator, AgentResult } from '../types'
import type { SingleAgentConfig, TaskMetadata } from '../../types'
import { resolveEnvValue } from '../../utils/env'
const DEFAULT_TIMEOUT_MS = 15 * 60 * 1000
export class SingleAgentEvaluator implements AgentEvaluator {
constructor(private ctx: AgentContext) {}
async execute(): Promise<AgentResult> {
const startTime = Date.now()
const { config, task, capture } = this.ctx
const agentConfig = config.agent as SingleAgentConfig
const timeoutMs = config.timeout_ms ?? DEFAULT_TIMEOUT_MS
// Log initial user message
await capture.messageLogger.logUser(task.query)
// Set up timeout
const abortController = new AbortController()
const timeoutHandle = setTimeout(() => abortController.abort(), timeoutMs)
// Create agent
const resolvedConfig: ResolvedAgentConfig = {
conversationId: randomUUID(),
provider: agentConfig.provider,
model: agentConfig.model ?? 'gemini-2.0-flash',
apiKey: resolveEnvValue(agentConfig.apiKey),
baseUrl: agentConfig.baseUrl,
sessionExecutionDir: '/tmp/browseros-eval',
evalMode: true,
}
const mcpServers = {
'browseros-mcp': new MCPServerConfig(
undefined, undefined, undefined, undefined, undefined,
`${config.browseros.server_url}/mcp`,
{ Accept: 'application/json, text/event-stream', 'X-BrowserOS-Source': 'eval' },
undefined, undefined, true
),
}
const agent = await GeminiAgent.create(resolvedConfig, mcpServers)
// Set capture hooks
agent.setToolHooks(capture.createToolHooks())
// Create mock stream to capture assistant messages
let lastAssistantMessage = ''
const mockStream = {
write: async (data: string) => {
if (data.includes('"type":"text-delta"')) {
const match = data.match(/"delta":"((?:[^"\\]|\\.)*)"/)
if (match) lastAssistantMessage += JSON.parse(`"${match[1]}"`)
} else if (data.includes('"type":"finish"')) {
if (lastAssistantMessage) {
await capture.messageLogger.logAssistant(lastAssistantMessage)
lastAssistantMessage = ''
}
}
},
}
// Execute
let terminationReason: TaskMetadata['termination_reason'] = 'completed'
try {
await agent.execute(
task.query,
mockStream as Parameters<typeof agent.execute>[1],
abortController.signal,
{ windowId: this.ctx.windowId, activeTab: { id: this.ctx.tabId, url: task.start_url } }
)
} catch (err) {
const error = err instanceof Error ? err : new Error(String(err))
if (abortController.signal.aborted) {
terminationReason = 'timeout'
capture.addError('agent_execution', `Task timed out after ${timeoutMs / 1000}s`)
} else {
terminationReason = 'error'
const msg = err instanceof AgentExecutionError && err.originalError
? `${error.message}: ${err.originalError.message}`
: error.message
capture.addError('agent_execution', msg, { stack: error.stack })
}
await capture.messageLogger.logError(error.message)
} finally {
clearTimeout(timeoutHandle)
}
// Build metadata
const metadata: TaskMetadata = {
query_id: task.query_id,
dataset: task.dataset,
query: task.query,
started_at: new Date(startTime).toISOString(),
completed_at: new Date().toISOString(),
total_duration_ms: Date.now() - startTime,
total_steps: capture.getScreenshotCount(),
termination_reason: terminationReason,
final_answer: capture.getLastAssistantMessage(),
errors: capture.getErrors(),
warnings: capture.getWarnings(),
agent_config: { type: 'single', model: resolvedConfig.model },
grader_results: {},
}
await capture.trajectorySaver.saveMetadata(metadata)
return {
metadata,
messages: capture.getMessages(),
finalAnswer: metadata.final_answer,
}
}
}
```
---
### 6. Task Executor (`runner/task-executor.ts`)
Uses agent registry:
```typescript
import { createAgent } from '../agents'
import type { AgentContext } from '../agents/types'
import { CaptureContext } from '../capture/context'
import type { EvalConfig, Task } from '../types'
import type { WindowManager } from '../capture/window-manager'
export class TaskExecutor {
constructor(
private config: EvalConfig,
private outputDir: string,
private windowManager: WindowManager,
private graderOptions: GraderOptions | null,
) {}
async execute(task: Task): Promise<TaskResult> {
const startTime = Date.now()
let window: { windowId: number; tabId: number } | null = null
try {
// Create window
window = await this.windowManager.createWindow(task.query_id, task.start_url)
// Initialize capture context
const capture = new CaptureContext({
serverUrl: this.config.browseros.server_url,
outputDir: this.outputDir,
taskId: task.query_id,
tabId: window.tabId,
windowId: window.windowId,
})
const taskOutputDir = await capture.init()
// Build agent context
const context: AgentContext = {
config: this.config,
task,
windowId: window.windowId,
tabId: window.tabId,
outputDir: this.outputDir,
taskOutputDir,
capture,
}
// Create and execute agent (via registry)
const agent = createAgent(context)
const agentResult = await agent.execute()
// Run graders
const graderResults = await this.runGraders(task, agentResult)
return {
status: agentResult.metadata.termination_reason === 'timeout' ? 'timeout' : 'completed',
task,
agentResult,
graderResults,
durationMs: Date.now() - startTime,
}
} catch (error) {
return {
status: 'failed',
task,
error: error instanceof Error ? error : new Error(String(error)),
errorSource: 'unknown',
durationMs: Date.now() - startTime,
}
} finally {
if (window) {
await this.windowManager.closeWindow(task.query_id)
}
}
}
}
```
---
## Server Dependencies
### What We MUST Import from Server
These are necessary - `GeminiAgent` IS the agent:
```typescript
// Core agent
import { GeminiAgent, type ToolExecutionHooks, type ToolExecutionResult } from '@browseros/server/agent'
import { AgentExecutionError } from '@browseros/server/agent/errors'
import type { ResolvedAgentConfig } from '@browseros/server/agent/types'
// Provider adapter (for orchestrator-agent)
import { VercelAIContentGenerator } from '@browseros/server/agent/provider-adapter'
// Gateway client (for browseros provider only)
import { fetchBrowserOSConfig, getLLMConfigFromProvider } from '@browseros/server/lib/clients/gateway'
```
### What Could Move to Shared (Future)
If we want to decouple more:
```typescript
// These types could be in @browseros/shared
export interface ToolExecutionHooks { ... }
export interface ToolExecutionResult { ... }
export interface ResolvedAgentConfig { ... }
```
But for now, importing from server is fine - eval is tightly coupled to server anyway.
---
## Import Guidelines
```typescript
// Shared package - schemas, constants
import { LLMConfigSchema, LLMProviderSchema, LLM_PROVIDERS } from '@browseros/shared/schemas/llm'
import { TIMEOUTS } from '@browseros/shared/constants/timeouts'
import { AGENT_LIMITS } from '@browseros/shared/constants/limits'
import type { BrowserContext } from '@browseros/shared/schemas/browser-context'
// Server - only agent-related imports
import { GeminiAgent, type ToolExecutionHooks } from '@browseros/server/agent'
import type { ResolvedAgentConfig } from '@browseros/server/agent/types'
// Internal eval types - from types/ folder
import type { EvalConfig, Task, Message, AgentResult } from '../types'
import type { AgentContext, AgentEvaluator } from '../agents/types'
```
---
## Adding a New Agent Type
1. Create folder: `agents/my-new-agent/`
2. Implement `AgentEvaluator` interface:
```typescript
// agents/my-new-agent/index.ts
import type { AgentContext, AgentEvaluator, AgentResult } from '../types'
export class MyNewAgentEvaluator implements AgentEvaluator {
constructor(private ctx: AgentContext) {}
async execute(): Promise<AgentResult> {
const { config, task, capture } = this.ctx
// Use capture.createToolHooks() for screenshot/logging
// Use capture.messageLogger for messages
// Use capture.addError/addWarning for errors
// Return AgentResult
}
}
```
3. Register in `agents/index.ts`:
```typescript
import { MyNewAgentEvaluator } from './my-new-agent'
registerAgent('my-new-agent', (ctx) => new MyNewAgentEvaluator(ctx))
```
4. Add config schema in `types/config.ts`:
```typescript
export const MyNewAgentConfigSchema = z.object({
type: z.literal('my-new-agent'),
// ... specific fields
})
export const AgentConfigSchema = z.discriminatedUnion('type', [
SingleAgentConfigSchema,
OrchestratorExecutorConfigSchema,
MyNewAgentConfigSchema, // Add here
])
```
Done - no changes to runner code needed.
---
## Implementation Order
1. **Phase 1: Types** (~1 hour)
- Create `types/` folder with proper structure
- Move/consolidate all types
- Add Zod schemas for messages
2. **Phase 2: Capture Context** (~1 hour)
- Create `CaptureContext` class
- Add delegation message methods
- Create `createToolHooks()` utility
3. **Phase 3: Agent Registry** (~30 min)
- Create `registry.ts`
- Create `AgentContext` interface
- Update exports
4. **Phase 4: Refactor Single Agent** (~1 hour)
- Use `AgentContext`
- Use `CaptureContext`
- Clean up code
5. **Phase 5: Refactor Orchestrator-Executor** (~2 hours)
- Use `AgentContext`
- Integrate `CaptureContext`
- Wire up hooks properly
6. **Phase 6: Update Runner** (~30 min)
- Use `createAgent()` instead of if-else
- Initialize `CaptureContext` in executor
7. **Phase 7: Testing** (~1 hour)
- Run single-agent eval
- Run orchestrator-executor eval
- Verify screenshots/messages captured
---
## Summary
| Before | After |
|--------|-------|
| If-else agent creation | Registry + factory pattern |
| Duplicated capture code | Shared `CaptureContext` |
| Scattered types | Organized `types/` folder |
| Copy-paste hooks | `createToolHooks()` utility |
| Tight coupling | Clear interfaces |
| Hard to add agents | Register + implement |

View File

@@ -1,431 +0,0 @@
# Implementation Phases - Parallel Execution Plan
## Dependency Graph
```
Phase 1: Types (4 parallel subagents)
├──────────────────┬──────────────────┐
▼ ▼ │
Phase 2: Capture Phase 3: Agent │
(2 parallel) Registry │
│ (1 subagent) │
│ │ │
└────────┬─────────┘ │
▼ │
Phase 4: Agent Refactors │
(2 parallel - after 2+3) │
│ │
▼ │
Phase 5: Runner Update │
(1 subagent - after 4) │
│ │
▼ │
Phase 6: Cleanup & Test ◄─────────────────┘
(1 subagent)
```
---
## Phase 1: Types (4 Parallel Subagents)
No dependencies - can all run simultaneously.
### Subagent 1A: Config Types
```
Create /apps/eval/src/types/config.ts
Requirements:
1. Import LLMConfigSchema, LLMProviderSchema from @browseros/shared/schemas/llm
2. Import z from zod
Create Zod schemas:
- SingleAgentConfigSchema = LLMConfigSchema.extend({ type: z.literal('single') })
- OrchestratorExecutorConfigSchema with orchestrator + executor nested configs
- AgentConfigSchema = z.discriminatedUnion('type', [...])
- EvalConfigSchema with all fields (agent, dataset, output_dir, num_workers, browseros, grader_*, timeout_ms)
Export both schemas and inferred types (z.infer<>)
Reference: Current implementation in /apps/eval/src/utils/config-validator.ts (lines 1-42)
```
### Subagent 1B: Message Types
```
Create /apps/eval/src/types/message.ts
Requirements:
1. Use Zod for all schemas
2. Create BaseMessageSchema with timestamp field
Create schemas for:
- UserMessageSchema (type: 'user', content)
- AssistantMessageSchema (type: 'assistant', content)
- ToolCallMessageSchema (type: 'tool_call', tool, toolCallId, params)
- ToolResultMessageSchema (type: 'tool_result', toolCallId, result, isError, screenshot?)
- ErrorMessageSchema (type: 'error', content, errorCode?)
- DelegationMessageSchema (type: 'delegation', instruction, executorId, maxSteps?)
- DelegationResultMessageSchema (type: 'delegation_result', executorId, summary, status, stepsUsed, currentUrl?)
Create MessageSchema = z.discriminatedUnion('type', [...all schemas])
Export schemas, types, and type guards (isToolCallMessage, isDelegationMessage, etc.)
Reference: Current types in /apps/eval/src/types.ts (lines 62-127)
```
### Subagent 1C: Task & Result Types
```
Create /apps/eval/src/types/task.ts
Requirements:
1. Use Zod schemas with inferred types
Create:
- TaskMetadataSchema (original_task_id, website?, category?, additional?)
- TaskSchema (query_id, dataset, query, graders[], start_url?, setup_script?, metadata)
Export schemas and types.
---
Create /apps/eval/src/types/result.ts
Create:
- GraderResultSchema (score, pass, reasoning, details?)
- TaskMetadataSchema (query_id, dataset, query, started_at, completed_at, total_duration_ms, total_steps, termination_reason, final_answer, errors, warnings, agent_config, grader_results)
- AgentResultSchema (metadata, messages, finalAnswer)
Export schemas and types.
Reference: Current types in /apps/eval/src/types.ts (lines 6-20, 156-182)
```
### Subagent 1D: Error Types + Index
```
Create /apps/eval/src/types/errors.ts
Create:
- ErrorSourceSchema = z.enum(['window_creation', 'agent_execution', 'mcp_tool', 'screenshot', 'grader', 'message_logging', 'cleanup', 'unknown'])
- TaskErrorSchema (source, message, timestamp, details?)
- EvalWarningSchema (source, message, timestamp)
Export schemas and types.
---
Create /apps/eval/src/types/index.ts
Re-export everything from:
- ./config
- ./message
- ./task
- ./result
- ./errors
This becomes the single import point: import { EvalConfig, Message, Task } from '../types'
Reference: Current types in /apps/eval/src/types.ts (lines 129-154)
```
---
## Phase 2: Capture Infrastructure (2 Parallel Subagents)
**Depends on:** Phase 1 (types)
### Subagent 2A: CaptureContext Class
```
Create /apps/eval/src/capture/types.ts
Define interface:
- CaptureContextConfig { serverUrl, outputDir, taskId, tabId, windowId }
---
Create /apps/eval/src/capture/context.ts
Requirements:
1. Import ToolExecutionHooks, ToolExecutionResult from @browseros/server/agent
2. Import types from ../types
3. Import existing ScreenshotCapture, MessageLogger, TrajectorySaver
Implement CaptureContext class:
- Constructor takes CaptureContextConfig
- async init() - initializes screenshot, messageLogger, trajectorySaver, returns taskOutputDir
- createToolHooks(): ToolExecutionHooks - returns hooks for GeminiAgent
- addError(source, message, details?)
- addWarning(source, message)
- getErrors(), getWarnings(), getMessages(), getScreenshotCount(), getLastAssistantMessage()
- logDelegation(instruction, executorId, maxSteps?)
- logDelegationResult(executorId, summary, status, stepsUsed, currentUrl?)
Reference implementation details in DESIGN_DOC.md section "4. Capture Context"
Update /apps/eval/src/capture/index.ts to export CaptureContext
```
### Subagent 2B: MessageLogger Extensions
```
Update /apps/eval/src/capture/message-logger.ts
Add two new methods:
1. logDelegation(instruction: string, executorId: string, maxSteps?: number): Promise<void>
- Creates DelegationMessage with type: 'delegation'
- Appends to messages
2. logDelegationResult(executorId: string, summary: string, status: 'done' | 'blocked' | 'max_steps', stepsUsed: number, currentUrl?: string): Promise<void>
- Creates DelegationResultMessage with type: 'delegation_result'
- Appends to messages
Import DelegationMessage, DelegationResultMessage from ../types
Reference: Current MessageLogger in /apps/eval/src/capture/message-logger.ts
```
---
## Phase 3: Agent Registry (1 Subagent)
**Depends on:** Phase 1 (types)
**Can run parallel with:** Phase 2
### Subagent 3A: Agent Registry + Types
```
Create /apps/eval/src/agents/types.ts
Define:
- AgentContext interface:
{
config: EvalConfig
task: Task
windowId: number
tabId: number
outputDir: string
taskOutputDir: string
capture: CaptureContext
}
- AgentResult interface (re-export from ../types or define here)
- AgentEvaluator interface { execute(): Promise<AgentResult> }
---
Create /apps/eval/src/agents/registry.ts
Implement:
- type AgentFactory = (context: AgentContext) => AgentEvaluator
- const registry = new Map<string, AgentFactory>()
- registerAgent(type: string, factory: AgentFactory): void
- createAgent(context: AgentContext): AgentEvaluator
- getRegisteredAgentTypes(): string[]
Reference: DESIGN_DOC.md section "2. Agent Registry"
---
Update /apps/eval/src/agents/index.ts
- Import registerAgent from ./registry
- Import SingleAgentEvaluator (will be updated later)
- Import OrchestratorExecutorEvaluator (will be updated later)
- Call registerAgent for both
- Re-export createAgent, registerAgent, getRegisteredAgentTypes
- Re-export types
Note: Registration calls will fail initially until agents are refactored.
That's OK - add TODO comments for now.
```
---
## Phase 4: Agent Refactors (2 Parallel Subagents)
**Depends on:** Phase 2 + Phase 3
### Subagent 4A: Single Agent Refactor
```
Refactor /apps/eval/src/agents/single-agent.ts
Changes:
1. Change constructor to accept AgentContext instead of individual params:
constructor(private ctx: AgentContext) {}
2. Use ctx.capture instead of creating ScreenshotCapture/MessageLogger:
- Remove local ScreenshotCapture initialization
- Remove local MessageLogger initialization
- Remove local hooks setup
- Use ctx.capture.createToolHooks() for GeminiAgent hooks
- Use ctx.capture.messageLogger.logUser/logAssistant
- Use ctx.capture.addError/addWarning
- Use ctx.capture.getMessages(), getScreenshotCount(), etc.
3. Build metadata using capture methods
4. Remove TrajectorySaver init (done in CaptureContext)
5. Keep the core agent execution logic (GeminiAgent.create, agent.execute)
Reference:
- Current implementation: /apps/eval/src/agents/single-agent.ts
- Target implementation: DESIGN_DOC.md section "5. Single Agent Evaluator"
```
### Subagent 4B: Orchestrator-Executor Refactor
```
Refactor /apps/eval/src/agents/orchestrator-executor/index.ts
Changes:
1. Change OrchestratorExecutorEvaluator constructor to accept AgentContext:
constructor(private ctx: AgentContext) {}
2. Initialize capture from context (already done in runner)
3. Add hook integration:
- Create executor hooks that use ctx.capture.createToolHooks()
- Wire hooks through Orchestrator → ExecutorStore → Executor
- Call ctx.capture.logDelegation() when orchestrator delegates
- Call ctx.capture.logDelegationResult() when executor returns
4. Update return to include messages:
return {
metadata,
messages: ctx.capture.getMessages(), // Now populated!
finalAnswer,
}
Also update supporting files if needed:
- orchestrator.ts - add setExecutorHooks() method
- executor.ts - accept external hooks via setObservationHooks()
- executor-store.ts - pass hooks to new executors
Reference:
- Current: /apps/eval/src/agents/orchestrator-executor/index.ts
- Target: DESIGN_DOC.md and previous IMPLEMENTATION_PLAN.md
```
---
## Phase 5: Runner Update (1 Subagent)
**Depends on:** Phase 4
### Subagent 5A: Task Executor Update
```
Update /apps/eval/src/runner/task-executor.ts
Changes:
1. Import createAgent from ../agents instead of individual evaluators
2. Import CaptureContext from ../capture
3. In execute() method:
- Create CaptureContext and call init()
- Build AgentContext with all required fields
- Use createAgent(context) instead of if-else switch
- Remove the if (config.agent.type === 'single') / else if blocks
4. Remove direct imports of SingleAgentEvaluator, OrchestratorExecutorEvaluator
Before:
```typescript
if (this.config.agent.type === 'single') {
const evaluator = new SingleAgentEvaluator(this.config, task, window.windowId, ...)
} else if (this.config.agent.type === 'orchestrator-executor') {
const evaluator = new OrchestratorExecutorEvaluator(this.config, task, ...)
}
```
After:
```typescript
const capture = new CaptureContext({ serverUrl, outputDir, taskId, tabId, windowId })
const taskOutputDir = await capture.init()
const context: AgentContext = {
config: this.config,
task,
windowId: window.windowId,
tabId: window.tabId,
outputDir: this.outputDir,
taskOutputDir,
capture,
}
const agent = createAgent(context)
const agentResult = await agent.execute()
```
Reference:
- Current: /apps/eval/src/runner/task-executor.ts (lines 143-186)
- Target: DESIGN_DOC.md section "6. Task Executor"
```
---
## Phase 6: Cleanup & Test (1 Subagent)
**Depends on:** Phase 5
### Subagent 6A: Cleanup Old Files + Verify
```
Tasks:
1. Delete old /apps/eval/src/types.ts (replaced by types/ folder)
2. Update all imports across the codebase:
- Change: import { EvalConfig, Task, Message } from '../types'
- Keep same (types/index.ts re-exports everything)
3. Update /apps/eval/src/utils/config-validator.ts:
- Import schemas from ../types/config instead of defining locally
- Remove duplicate schema definitions
4. Verify no TypeScript errors:
- Run: cd apps/eval && bun run typecheck
5. Test single-agent eval:
- Run: cd apps/eval && bun run eval -c configs/webvoyager-test.json
- Verify screenshots captured
- Verify messages.jsonl populated
6. Test orchestrator-executor eval:
- Run: cd apps/eval && bun run eval -c configs/orchestrator-executor-test.json
- Verify screenshots captured
- Verify messages.jsonl has delegation messages
- Verify graders pass (no "no_screenshots" error)
Report any issues found.
```
---
## Execution Summary
| Phase | Subagents | Can Parallelize? | Dependencies |
|-------|-----------|------------------|--------------|
| 1 | 4 (1A, 1B, 1C, 1D) | Yes - all parallel | None |
| 2 | 2 (2A, 2B) | Yes - both parallel | Phase 1 |
| 3 | 1 (3A) | Yes - parallel with Phase 2 | Phase 1 |
| 4 | 2 (4A, 4B) | Yes - both parallel | Phase 2 + 3 |
| 5 | 1 (5A) | No | Phase 4 |
| 6 | 1 (6A) | No | Phase 5 |
**Total: 11 subagent tasks**
**Parallel execution timeline:**
```
Time →
─────────────────────────────────────────────────────────────────
Phase 1: [1A] [1B] [1C] [1D] (4 parallel)
─────────────────
Phase 2: [2A] [2B] (2 parallel)
Phase 3: [3A] (parallel with Phase 2)
───────────
Phase 4: [4A] [4B] (2 parallel)
──────────
Phase 5: [5A]
────
Phase 6: [6A]
────
```
**Maximum parallelism: 4 subagents** (Phase 1)

View File

@@ -1,888 +0,0 @@
# Eval System - Production Grade Implementation Plan
## Overview
This plan outlines the changes needed to make the eval system production-grade with uniform agent observation across all agent patterns (single-agent, orchestrator-executor, future patterns).
**Goal:** All agent evaluators produce consistent `AgentResult` with screenshots, message traces, and verifiable action sequences.
---
## Phase 1: Type System Extensions
### 1.1 Add New Message Types
**File:** `src/types.ts`
Add delegation-specific message types for orchestrator pattern:
```typescript
// After ErrorMessage definition (~line 99)
export interface DelegationMessage extends BaseMessage {
type: 'delegation'
instruction: string
executorId: string
maxSteps?: number
}
export interface DelegationResultMessage extends BaseMessage {
type: 'delegation_result'
executorId: string
summary: string
status: 'done' | 'blocked' | 'max_steps'
stepsUsed: number
currentUrl?: string
}
// Update Message union (~line 101)
export type Message =
| UserMessage
| AssistantMessage
| ToolCallMessage
| ToolResultMessage
| ErrorMessage
| DelegationMessage // NEW
| DelegationResultMessage // NEW
// Add type guards
export function isDelegationMessage(msg: Message): msg is DelegationMessage {
return msg.type === 'delegation'
}
export function isDelegationResultMessage(msg: Message): msg is DelegationResultMessage {
return msg.type === 'delegation_result'
}
```
### 1.2 Add Orchestrator Hook Types
**File:** `src/agents/orchestrator-executor/types.ts`
```typescript
// Add after existing types
export interface OrchestratorHooks {
onDelegation?: (instruction: string, executorId: string, maxSteps?: number) => Promise<void>
onDelegationResult?: (result: ExecutorResult) => Promise<void>
onTurnStart?: (turn: number) => Promise<void>
onTurnComplete?: (turn: number) => Promise<void>
onComplete?: (answer: string) => Promise<void>
onFailed?: (reason: string) => Promise<void>
}
export interface ExecutorObservationHooks {
onBeforeToolCall?: (toolName: string, args: unknown) => Promise<string> // returns toolCallId
onAfterToolCall?: (toolName: string, toolCallId: string, result: unknown, isError: boolean) => Promise<void>
}
```
---
## Phase 2: Unified Capture Infrastructure
### 2.1 Create EvalCapture Class
**File:** `src/capture/eval-capture.ts` (NEW)
```typescript
/**
* EvalCapture - Unified capture infrastructure for all agent evaluators
*
* Combines screenshot capture, message logging, and provides hooks for
* both single-agent and orchestrator-executor patterns.
*/
import { randomUUID } from 'node:crypto'
import type {
AssistantMessage,
DelegationMessage,
DelegationResultMessage,
ErrorMessage,
Message,
ToolCallMessage,
ToolResultMessage,
UserMessage,
} from '../types'
import { MessageLogger } from './message-logger'
import { ScreenshotCapture } from './screenshot'
export interface EvalCaptureConfig {
serverUrl: string
outputDir: string
tabId: number
windowId: number
}
export class EvalCapture {
private screenshotCapture: ScreenshotCapture
private messageLogger: MessageLogger
private tabId: number
private windowId: number
private currentToolCallId: string | null = null
constructor(config: EvalCaptureConfig) {
this.screenshotCapture = new ScreenshotCapture(config.serverUrl, config.outputDir)
this.messageLogger = new MessageLogger(config.outputDir)
this.tabId = config.tabId
this.windowId = config.windowId
}
async init(): Promise<void> {
await this.screenshotCapture.init()
}
// ============================================================================
// Screenshot Capture
// ============================================================================
async captureScreenshot(): Promise<number> {
return this.screenshotCapture.capture(this.tabId, this.windowId)
}
getScreenshotCount(): number {
return this.screenshotCapture.getCount()
}
// ============================================================================
// Message Logging - Basic Types
// ============================================================================
async logUser(content: string): Promise<void> {
await this.messageLogger.logUser(content)
}
async logAssistant(content: string): Promise<void> {
await this.messageLogger.logAssistant(content)
}
async logError(content: string, errorCode?: string): Promise<void> {
await this.messageLogger.logError(content, errorCode)
}
// ============================================================================
// Tool Call Logging (for single-agent and executor)
// ============================================================================
async logToolCall(tool: string, params: Record<string, unknown>): Promise<string> {
const toolCallId = randomUUID()
this.currentToolCallId = toolCallId
await this.messageLogger.logToolCall(tool, toolCallId, params)
return toolCallId
}
async logToolResult(
toolCallId: string,
result: unknown,
isError: boolean,
screenshot?: number,
): Promise<void> {
await this.messageLogger.logToolResult(toolCallId, result, isError, screenshot)
this.currentToolCallId = null
}
getCurrentToolCallId(): string | null {
return this.currentToolCallId
}
// ============================================================================
// Delegation Logging (for orchestrator-executor)
// ============================================================================
async logDelegation(
instruction: string,
executorId: string,
maxSteps?: number,
): Promise<void> {
const message: DelegationMessage = {
type: 'delegation',
timestamp: new Date().toISOString(),
instruction,
executorId,
...(maxSteps !== undefined && { maxSteps }),
}
// Extend MessageLogger to handle this, or append directly
await this.appendMessage(message)
}
async logDelegationResult(
executorId: string,
summary: string,
status: 'done' | 'blocked' | 'max_steps',
stepsUsed: number,
currentUrl?: string,
): Promise<void> {
const message: DelegationResultMessage = {
type: 'delegation_result',
timestamp: new Date().toISOString(),
executorId,
summary,
status,
stepsUsed,
...(currentUrl && { currentUrl }),
}
await this.appendMessage(message)
}
// ============================================================================
// Helpers
// ============================================================================
private async appendMessage(message: Message): Promise<void> {
// Access internal messages array and file
// This requires either extending MessageLogger or using a shared approach
const messages = this.messageLogger.getMessages()
messages.push(message)
// Write to file - MessageLogger needs extension for this
}
getMessages(): Message[] {
return this.messageLogger.getMessages()
}
getLastAssistantMessage(): string | null {
return this.messageLogger.getLastAssistantMessage()
}
}
```
### 2.2 Extend MessageLogger for New Types
**File:** `src/capture/message-logger.ts`
Add methods for delegation messages:
```typescript
// Add after logError method
async logDelegation(
instruction: string,
executorId: string,
maxSteps?: number,
): Promise<void> {
const message: DelegationMessage = {
type: 'delegation',
timestamp: new Date().toISOString(),
instruction,
executorId,
...(maxSteps !== undefined && { maxSteps }),
}
await this.append(message)
}
async logDelegationResult(
executorId: string,
summary: string,
status: 'done' | 'blocked' | 'max_steps',
stepsUsed: number,
currentUrl?: string,
): Promise<void> {
const message: DelegationResultMessage = {
type: 'delegation_result',
timestamp: new Date().toISOString(),
executorId,
summary,
status,
stepsUsed,
...(currentUrl && { currentUrl }),
}
await this.append(message)
}
```
---
## Phase 3: Executor Hook Integration
### 3.1 Modify Executor to Accept External Hooks
**File:** `src/agents/orchestrator-executor/executor.ts`
```typescript
// Add import
import type { ExecutorObservationHooks } from './types'
export class Executor {
private agent: GeminiAgent | null = null
private stepsUsed = 0
private currentUrl = ''
private config: ExecutorConfig
private serverUrl: string
private windowId: number
private tabId: number
private observationHooks?: ExecutorObservationHooks // NEW
// ... existing constructor ...
/**
* Set external observation hooks for capture integration
*/
setObservationHooks(hooks: ExecutorObservationHooks): void {
this.observationHooks = hooks
}
async execute(
instruction: string,
maxSteps?: number,
signal?: AbortSignal,
): Promise<Omit<ExecutorResult, 'executorId'>> {
// ... existing setup ...
// Track steps via hooks - MODIFIED to include external observation
let stepsThisRun = 0
const hooks: ToolExecutionHooks = {
onBeforeToolCall: async (toolName: string, args: unknown) => {
// Call external hook if set (for logging)
if (this.observationHooks?.onBeforeToolCall) {
await this.observationHooks.onBeforeToolCall(toolName, args)
}
},
onAfterToolCall: async (toolName: string, result: ToolExecutionResult) => {
stepsThisRun++
this.stepsUsed++
// Call external hook if set (for screenshot capture and logging)
if (this.observationHooks?.onAfterToolCall) {
const toolCallId = 'current' // Will be tracked by EvalCapture
await this.observationHooks.onAfterToolCall(
toolName,
toolCallId,
result.parts,
result.isError,
)
}
},
}
this.agent.setToolHooks(hooks)
// ... rest of execute method ...
}
}
```
### 3.2 Pass Hooks Through ExecutorStore
**File:** `src/agents/orchestrator-executor/executor-store.ts`
```typescript
import type { ExecutorObservationHooks } from './types'
export class ExecutorStore {
private executors = new Map<string, Executor>()
private observationHooks?: ExecutorObservationHooks // NEW
/**
* Set observation hooks that will be applied to all executors
*/
setObservationHooks(hooks: ExecutorObservationHooks): void {
this.observationHooks = hooks
// Apply to existing executors
for (const executor of this.executors.values()) {
executor.setObservationHooks(hooks)
}
}
getOrCreate(
id: string,
config: ExecutorConfig,
serverUrl: string,
windowId: number,
tabId: number,
): Executor {
if (!this.executors.has(id)) {
const executor = new Executor(config, serverUrl, windowId, tabId)
// Apply observation hooks to new executor
if (this.observationHooks) {
executor.setObservationHooks(this.observationHooks)
}
this.executors.set(id, executor)
}
return this.executors.get(id)!
}
// ... rest unchanged ...
}
```
---
## Phase 4: Orchestrator Hook Integration
### 4.1 Add Hooks to OrchestratorAgent
**File:** `src/agents/orchestrator-executor/orchestrator-agent.ts`
```typescript
import type { ExecutorObservationHooks, OrchestratorHooks } from './types'
export class OrchestratorAgent {
private orchestratorHooks?: OrchestratorHooks // NEW
private constructor(
private client: GeminiClient,
private geminiConfig: GeminiConfig,
private state: OrchestratorState,
private executorStore: ExecutorStore,
private maxTurns: number,
) {}
/**
* Set orchestrator-level hooks for delegation tracking
*/
setHooks(hooks: OrchestratorHooks): void {
this.orchestratorHooks = hooks
}
/**
* Set executor observation hooks (passed through to ExecutorStore)
*/
setExecutorObservationHooks(hooks: ExecutorObservationHooks): void {
this.executorStore.setObservationHooks(hooks)
}
/**
* Get hooks for tool context (used by orchestrator-tools.ts)
*/
getOrchestratorHooks(): OrchestratorHooks | undefined {
return this.orchestratorHooks
}
async run(taskQuery: string): Promise<OrchestratorAgentResult> {
let currentParts: Part[] = [{ text: taskQuery }]
let turns = 0
while (
!this.state.isComplete &&
!this.state.isFailed &&
turns < this.maxTurns
) {
turns++
// Fire turn start hook
await this.orchestratorHooks?.onTurnStart?.(turns)
// ... existing turn logic ...
// Fire turn complete hook
await this.orchestratorHooks?.onTurnComplete?.(turns)
}
// Fire completion hooks
if (this.state.isComplete && this.state.finalAnswer) {
await this.orchestratorHooks?.onComplete?.(this.state.finalAnswer)
} else if (this.state.isFailed && this.state.failureReason) {
await this.orchestratorHooks?.onFailed?.(this.state.failureReason)
}
return {
success: this.state.isComplete,
answer: this.state.finalAnswer,
reason: this.state.failureReason,
delegationCount: this.state.delegationCount,
totalExecutorSteps: this.state.totalExecutorSteps,
turns,
}
}
// ... rest unchanged ...
}
```
### 4.2 Fire Hooks in Orchestrator Tools
**File:** `src/agents/orchestrator-executor/orchestrator-tools.ts`
Modify the delegate tool handler to fire hooks:
```typescript
// In createOrchestratorTools function, modify the delegate tool handler
// Inside the delegate tool's handler:
handler: async (args) => {
const { instruction, executorId, maxSteps } = args as DelegateParams
// Fire delegation hook BEFORE execution
const hooks = context.getOrchestratorHooks?.()
const actualExecutorId = executorId ?? randomUUID()
await hooks?.onDelegation?.(instruction, actualExecutorId, maxSteps)
// Get or create executor
const executor = context.executorStore.getOrCreate(
actualExecutorId,
context.executorConfig,
context.serverUrl,
context.windowId,
context.tabId,
)
// Execute
const result = await executor.execute(instruction, maxSteps)
// Update state
context.state.delegationCount++
context.state.totalExecutorSteps += result.stepsUsed
// Fire delegation result hook AFTER execution
await hooks?.onDelegationResult?.({
...result,
executorId: actualExecutorId,
})
// Return result to orchestrator
return {
executorId: actualExecutorId,
...result,
}
}
```
---
## Phase 5: Update OrchestratorExecutorEvaluator
### 5.1 Full Integration
**File:** `src/agents/orchestrator-executor/index.ts`
```typescript
import { ScreenshotCapture } from '../../capture/screenshot'
import { MessageLogger } from '../../capture/message-logger'
import { TrajectorySaver } from '../../capture/trajectory-saver'
import type { ExecutorObservationHooks, OrchestratorHooks } from './types'
export class OrchestratorExecutorEvaluator implements AgentEvaluator {
constructor(
private config: EvalConfig,
private task: Task,
private windowId: number,
private tabId: number,
private outputDir: string,
) {}
async execute(): Promise<AgentResult> {
const startTime = Date.now()
const timeoutMs = this.config.timeout_ms ?? DEFAULT_TIMEOUT_MS
const errors: TaskError[] = []
const warnings: EvalWarning[] = []
const addError = (source: TaskError['source'], message: string, details?: Record<string, unknown>) => {
errors.push({ source, message, timestamp: new Date().toISOString(), details })
}
const addWarning = (source: EvalWarning['source'], message: string) => {
warnings.push({ source, message, timestamp: new Date().toISOString() })
console.warn(`[${source}] ${message}`)
}
// Initialize trajectory saver
const saver = new TrajectorySaver(this.outputDir, this.task.query_id)
const taskOutputDir = await saver.init()
// NEW: Initialize capture infrastructure (same as single-agent)
const screenshotCapture = new ScreenshotCapture(
this.config.browseros.server_url,
taskOutputDir,
)
await screenshotCapture.init()
const messageLogger = new MessageLogger(taskOutputDir)
// Log initial user message
await messageLogger.logUser(this.task.query)
// Validate config type
if (this.config.agent.type !== 'orchestrator-executor') {
throw new Error('OrchestratorExecutorEvaluator requires orchestrator-executor config')
}
const agentConfig = this.config.agent as OrchestratorExecutorConfig
const { orchestrator: orchestratorConfig, executor: executorConfig } =
resolveAgentConfig(agentConfig)
// Create orchestrator
const orchestrator = new Orchestrator(
orchestratorConfig,
executorConfig,
this.config.browseros.server_url,
this.windowId,
this.tabId,
)
// NEW: Set up executor observation hooks (for tool call/result capture)
let currentToolCallId: string | null = null
const executorHooks: ExecutorObservationHooks = {
onBeforeToolCall: async (toolName: string, args: unknown) => {
try {
currentToolCallId = randomUUID()
await messageLogger.logToolCall(toolName, currentToolCallId, args as Record<string, unknown>)
} catch (err) {
addWarning('message_logging', `Failed to log tool call ${toolName}: ${err instanceof Error ? err.message : String(err)}`)
}
return currentToolCallId
},
onAfterToolCall: async (toolName: string, _toolCallId: string, result: unknown, isError: boolean) => {
let screenshotNum = 0
// Capture screenshot after tool execution
try {
screenshotNum = await screenshotCapture.capture(this.tabId, this.windowId)
} catch (err) {
addWarning('screenshot', `Screenshot after ${toolName} failed: ${err instanceof Error ? err.message : String(err)}`)
screenshotNum = screenshotCapture.getCount()
}
// Log tool errors
if (isError) {
addWarning('mcp_tool', `Tool ${toolName} returned error`)
}
if (!currentToolCallId) {
addWarning('message_logging', 'Tool result without matching tool call')
return
}
try {
await messageLogger.logToolResult(currentToolCallId, result, isError, screenshotNum)
} catch (err) {
addWarning('message_logging', `Failed to log tool result: ${err instanceof Error ? err.message : String(err)}`)
}
currentToolCallId = null
},
}
// NEW: Set up orchestrator hooks (for delegation tracking)
const orchestratorHooks: OrchestratorHooks = {
onDelegation: async (instruction: string, executorId: string, maxSteps?: number) => {
try {
await messageLogger.logDelegation(instruction, executorId, maxSteps)
} catch (err) {
addWarning('message_logging', `Failed to log delegation: ${err instanceof Error ? err.message : String(err)}`)
}
},
onDelegationResult: async (result) => {
try {
await messageLogger.logDelegationResult(
result.executorId,
result.summary,
result.status,
result.stepsUsed,
result.currentUrl,
)
} catch (err) {
addWarning('message_logging', `Failed to log delegation result: ${err instanceof Error ? err.message : String(err)}`)
}
},
}
// Apply hooks to orchestrator
orchestrator.setHooks(orchestratorHooks)
orchestrator.setExecutorObservationHooks(executorHooks)
// Set up timeout
const abortController = new AbortController()
const timeoutHandle = setTimeout(() => {
abortController.abort()
}, timeoutMs)
let terminationReason: 'completed' | 'max_steps' | 'error' | 'timeout' = 'completed'
let finalAnswer: string | null = null
let orchestratorResult: Awaited<ReturnType<typeof orchestrator.run>> | null = null
try {
const runPromise = orchestrator.run(this.task.query)
orchestratorResult = await Promise.race([
runPromise,
new Promise<never>((_, reject) => {
abortController.signal.addEventListener('abort', () => {
reject(new Error('Timeout'))
})
}),
])
if (orchestratorResult.success) {
finalAnswer = orchestratorResult.answer
terminationReason = 'completed'
// Log final assistant message
if (finalAnswer) {
await messageLogger.logAssistant(finalAnswer)
}
} else {
terminationReason = 'error'
addError('agent_execution', orchestratorResult.reason ?? 'Unknown failure')
await messageLogger.logError(orchestratorResult.reason ?? 'Unknown failure')
}
} catch (err) {
const error = err instanceof Error ? err : new Error(String(err))
if (error.message === 'Timeout' || abortController.signal.aborted) {
terminationReason = 'timeout'
addError('agent_execution', `Task timed out after ${timeoutMs / 1000}s`)
} else {
terminationReason = 'error'
addError('agent_execution', error.message, { stack: error.stack })
}
await messageLogger.logError(error.message)
} finally {
clearTimeout(timeoutHandle)
orchestrator.getExecutorStore().clear()
}
const endTime = Date.now()
// Create metadata
const metadata: TaskMetadata = {
query_id: this.task.query_id,
dataset: this.task.dataset,
query: this.task.query,
started_at: new Date(startTime).toISOString(),
completed_at: new Date(endTime).toISOString(),
total_duration_ms: endTime - startTime,
total_steps: screenshotCapture.getCount(), // Now accurate
termination_reason: terminationReason,
final_answer: finalAnswer,
errors,
warnings,
agent_config: {
type: 'orchestrator-executor',
model: `${orchestratorConfig.model} / ${executorConfig.model}`,
},
grader_results: {},
}
await saver.saveMetadata(metadata)
return {
metadata,
messages: messageLogger.getMessages(), // NOW POPULATED
finalAnswer,
}
}
}
```
---
## Phase 6: Orchestrator Class Updates
### 6.1 Add Hook Passthrough Methods
**File:** `src/agents/orchestrator-executor/orchestrator.ts`
```typescript
import type { ExecutorObservationHooks, OrchestratorHooks } from './types'
export class Orchestrator {
private agent: OrchestratorAgent | null = null
private executorStore: ExecutorStore
private pendingOrchestratorHooks?: OrchestratorHooks
private pendingExecutorHooks?: ExecutorObservationHooks
constructor(
private orchestratorConfig: OrchestratorConfig,
private executorConfig: ExecutorConfig,
private serverUrl: string,
private windowId: number,
private tabId: number,
) {
this.executorStore = new ExecutorStore()
}
/**
* Set orchestrator-level hooks (must be called before run())
*/
setHooks(hooks: OrchestratorHooks): void {
this.pendingOrchestratorHooks = hooks
if (this.agent) {
this.agent.setHooks(hooks)
}
}
/**
* Set executor observation hooks (must be called before run())
*/
setExecutorObservationHooks(hooks: ExecutorObservationHooks): void {
this.pendingExecutorHooks = hooks
this.executorStore.setObservationHooks(hooks)
if (this.agent) {
this.agent.setExecutorObservationHooks(hooks)
}
}
async run(taskQuery: string): Promise<OrchestratorAgentResult> {
this.agent = await OrchestratorAgent.create(
this.orchestratorConfig,
this.executorConfig,
this.serverUrl,
this.windowId,
this.tabId,
)
// Apply pending hooks
if (this.pendingOrchestratorHooks) {
this.agent.setHooks(this.pendingOrchestratorHooks)
}
if (this.pendingExecutorHooks) {
this.agent.setExecutorObservationHooks(this.pendingExecutorHooks)
}
const result = await this.agent.run(taskQuery)
this.executorStore = this.agent.getExecutorStore()
return result
}
getExecutorStore(): ExecutorStore {
return this.agent?.getExecutorStore() ?? this.executorStore
}
}
```
---
## Implementation Order
1. **Phase 1** - Type extensions (types.ts) - 30 min
2. **Phase 2** - MessageLogger extensions - 30 min
3. **Phase 3** - Executor hook integration - 1 hour
4. **Phase 4** - OrchestratorAgent hooks - 1 hour
5. **Phase 5** - OrchestratorExecutorEvaluator update - 1.5 hours
6. **Phase 6** - Orchestrator passthrough - 30 min
7. **Testing** - End-to-end verification - 1 hour
**Total estimated time:** ~6 hours
---
## Testing Checklist
- [ ] Single-agent eval still works (regression test)
- [ ] Orchestrator-executor produces screenshots in output folder
- [ ] Orchestrator-executor produces messages.jsonl with:
- [ ] user message
- [ ] delegation messages
- [ ] tool_call messages (from executor)
- [ ] tool_result messages with screenshot numbers
- [ ] delegation_result messages
- [ ] assistant message (final answer)
- [ ] Graders pass with orchestrator-executor (no "no_screenshots" error)
- [ ] metadata.json has accurate `total_steps` count
- [ ] Error/warning capture works for both patterns
---
## Future Considerations
1. **New Agent Patterns:** Any new agent type just needs to:
- Accept hooks in constructor or via setter
- Fire hooks at appropriate points
- Use shared capture infrastructure
2. **Grader Updates:** May need to update graders to understand delegation messages
3. **Parallel Executors:** If orchestrator delegates to multiple executors in parallel, need to handle concurrent screenshot capture
4. **Memory/Performance:** Screenshot capture creates MCP connection per capture - consider connection pooling for high-volume evals

View File

@@ -2,81 +2,67 @@
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](../../../../LICENSE)
Evaluation framework for benchmarking BrowserOS browser automation agents. Runs tasks from standard datasets ([WebVoyager](https://arxiv.org/abs/2401.13919), [Mind2Web](https://arxiv.org/abs/2306.06070)), captures trajectories with screenshots, and grades results automatically.
Evaluation framework for BrowserOS browser automation agents. Runs tasks from standard datasets ([WebVoyager](https://arxiv.org/abs/2401.13919), [Mind2Web](https://arxiv.org/abs/2306.06070), AGI SDK / REAL Bench, WebArena-Infinity, WebBench), captures trajectories with screenshots, and grades results automatically.
## Prerequisites
- **BrowserOS binary** installed at `/Applications/BrowserOS.app` (macOS)
- **BrowserOS binary** at `/Applications/BrowserOS.app` (macOS) or `BROWSEROS_BINARY` pointing at it
- **Bun** runtime
- **API keys** for your chosen LLM provider and grader model
- **API keys** for your LLM provider (and `CLAUDE_CODE_OAUTH_TOKEN` if you use `performance_grader`)
## Quick Start
### 1. Set up environment
```bash
cd apps/eval
```
Edit `.env.development` and add your API keys:
```bash
# Pick ONE provider for the orchestrator (whichever you have access to)
OPENAI_API_KEY=sk-xxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxx
FIREWORKS_API_KEY=fw_xxxxx
GOOGLE_API_KEY=AIza-xxxxx
# For grading results (OpenRouter recommended — gives access to many models)
OPENROUTER_API_KEY=sk-or-v1-xxxxx
```
### 2. Launch the dashboard
```bash
# Edit .env.development with your keys, then:
bun run eval
```
Opens the **Eval Dashboard** at `http://localhost:9900` in config mode.
Opens the eval dashboard at `http://localhost:9900` in config mode. From there: load a preset, edit settings, click **Run**.
### 3. Configure and run
From the dashboard:
1. **Load a preset** — select from the dropdown or click **Load File** to import a config JSON
2. **Edit settings** — change agent type, provider, model, API keys, dataset, workers, timeouts
3. **Save Config** — export your configuration for reuse
4. **Click Run** — starts the evaluation with live progress
### Alternative: Run from CLI
### CLI mode
```bash
bun run eval -c configs/orchestrator-executor-clado-test.json
bun run eval -c configs/browseros-agent-weekly.json
```
Runs immediately. Dashboard still available at `http://localhost:9900` for live progress.
## Agent Types
## Agent types
### Orchestrator-Executor with Clado
| Type | Description |
|------|-------------|
| `single` | Single LLM agent driven by the BrowserOS tool loop (CDP) |
| `orchestrator-executor` | High-level orchestrator + per-step executor (LLM or Clado visual model) |
The recommended architecture for visual model evals. Two tiers:
### Single agent
- **Orchestrator** — An LLM that plans and issues high-level instructions
- **Executor** — The **Clado Action** visual model that takes screenshots and predicts click/type/scroll coordinates
```json
{
"agent": {
"type": "single",
"provider": "openai-compatible",
"model": "moonshotai/kimi-k2.5",
"apiKey": "OPENROUTER_API_KEY",
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
}
}
```
The orchestrator works with **any LLM provider**. Pick whichever you have access to:
### Orchestrator-Executor
#### OpenAI orchestrator
The orchestrator works with any LLM provider. The executor can be another LLM, or the **Clado action** visual model that takes screenshots and predicts click/type/scroll coordinates.
```json
{
"agent": {
"type": "orchestrator-executor",
"orchestrator": {
"provider": "openai",
"model": "gpt-4o",
"apiKey": "OPENAI_API_KEY"
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
},
"executor": {
"provider": "clado-action",
@@ -84,73 +70,31 @@ The orchestrator works with **any LLM provider**. Pick whichever you have access
"apiKey": "",
"baseUrl": "https://clado-ai--clado-browseros-action-actionmodel-generate.modal.run"
}
},
"dataset": "../data/webvoyager_e2e_test.jsonl",
"output_dir": "../results/oe-clado-openai",
"num_workers": 3,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"headless": true
},
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}
}
```
#### Anthropic orchestrator
## Graders
| Name | Description |
|------|-------------|
| `performance_grader` | Multi-axis grader running on Claude Agent SDK (uses its own credentials via `CLAUDE_CODE_OAUTH_TOKEN`) |
| `agisdk_state_diff` | AGI SDK / REAL Bench environment state-diff grader (deterministic) |
| `infinity_state` | WebArena-Infinity verifier-script grader (deterministic) |
Set `graders` in your config to override the per-task `graders` field from the dataset:
```json
"orchestrator": {
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"apiKey": "ANTHROPIC_API_KEY"
}
"graders": ["performance_grader"]
```
#### Google orchestrator
```json
"orchestrator": {
"provider": "google",
"model": "gemini-2.0-flash",
"apiKey": "GOOGLE_API_KEY"
}
```
#### Fireworks orchestrator (OpenAI-compatible)
```json
"orchestrator": {
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1"
}
```
The executor config stays the same across all orchestrator providers — it always uses the Clado action model.
### Other Agent Types
| Type | Description | Example config |
|------|-------------|----------------|
| `single` | Single LLM agent via Gemini CLI + MCP | `webvoyager-test.json` |
| `tool-loop` | AI SDK tool loop, connects via CDP | `tool-loop-test.json` |
| `gemini-computer-use` | Google native computer use API | `gemini-computer-use.json` |
| `yutori-navigator` | Yutori N1 visual model | `yutori-navigator.json` |
## Configuration Reference
## Configuration reference
### API keys
The `apiKey` field supports two formats:
- **Env var name**: `"OPENAI_API_KEY"` — resolved from `.env.development` at runtime
- **Direct value**: `"sk-xxxxx"` — used as-is (not recommended, prefer env vars)
- **Direct value**: `"sk-xxxxx"` — used as-is (not recommended)
### Supported providers
@@ -160,7 +104,7 @@ The `apiKey` field supports two formats:
| Anthropic | `anthropic` | No |
| Google | `google` | No |
| Azure OpenAI | `azure` | Yes |
| AWS Bedrock | `bedrock` | No (uses `region`, `accessKeyId`, `secretAccessKey`) |
| AWS Bedrock | `bedrock` | No |
| OpenRouter | `openrouter` | No |
| Fireworks, Together, etc. | `openai-compatible` | Yes |
| Ollama | `ollama` | No |
@@ -179,34 +123,27 @@ The `apiKey` field supports two formats:
}
```
Each worker gets its own Chrome instance. Worker N uses `base_port + N` for CDP and server ports. `base_extension_port` is still reserved as a legacy BrowserOS launch argument for compatibility with Chromium builds that still pass it.
Each worker gets its own Chrome instance. Worker N uses `base_port + N` for CDP and server ports.
### Execution settings
| Field | Description | Default |
|-------|-------------|---------|
| `num_workers` | Parallel workers (each gets its own Chrome) | `1` |
| `timeout_ms` | Per-task timeout in ms | `900000` (15 min) |
| `timeout_ms` | Per-task timeout in ms | `1800000` (30 min) |
| `restart_server_per_task` | Restart Chrome between tasks (cleaner state, slower) | `false` |
### Grading
Results are auto-graded after each task. The grader uses an LLM judge.
| Field | Description |
|-------|-------------|
| `grader_model` | Model for grading (e.g., `openai/gpt-4.1`) |
| `grader_api_key_env` | Env var name for grader API key |
| `grader_base_url` | API endpoint (e.g., `https://openrouter.ai/api/v1`) |
## Datasets
| File | Tasks | Description |
|------|-------|-------------|
| `webvoyager_e2e_test.jsonl` | 10 | WebVoyager test subset (quick smoke test) |
| `webvoyager.jsonl` | 643 | Full WebVoyager benchmark |
| `mind2web_e2e_test.jsonl` | 10 | Mind2Web test subset |
| `mind2web.jsonl` | 300 | Full Mind2Web benchmark |
| `mind2web.jsonl` | 300 | Online-Mind2Web |
| `webbench-{0,1,2}of4-50.jsonl` | 50 each | WebBench shards (50-task subsets) |
| `agisdk-real.jsonl` | 40 | AGI SDK / REAL Bench (action-only tasks) |
| `webarena-infinity-hard-50.jsonl` | 50 | WebArena-Infinity hard set |
| `browsecomp-medium-hard-50.jsonl` | 50 | BrowseComp medium-hard |
| `browsecomp-very-hard-50.jsonl` | 50 | BrowseComp very-hard |
Task format (JSONL, one per line):
@@ -215,7 +152,7 @@ Task format (JSONL, one per line):
"query_id": "Amazon--0",
"dataset": "webvoyager",
"query": "Search an Xbox Wireless controller with green color and rated above 4 stars.",
"graders": ["webvoyager_grader", "fara_combined"],
"graders": ["performance_grader"],
"start_url": "https://www.amazon.com/",
"metadata": { "original_task_id": "Amazon--0", "website": "Amazon" }
}
@@ -227,24 +164,25 @@ Results are saved to `output_dir`:
```
results/
oe-clado-openai/
Amazon--0/
metadata.json # Task result, timing, grader scores
messages.jsonl # Full message log
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
browseros-agent-weekly/
2026-04-29-1430/
Amazon--0/
metadata.json # Task result, timing, grader scores
messages.jsonl # Full message log
screenshots/
001.png # Step-by-step screenshots
002.png
summary.json # Aggregate pass rates
```
## Troubleshooting
**BrowserOS not found**: Expects `/Applications/BrowserOS.app/Contents/MacOS/BrowserOS`. Make sure it's installed.
**BrowserOS not found**: Expects `/Applications/BrowserOS.app/Contents/MacOS/BrowserOS`. Set `BROWSEROS_BINARY` to override.
**Port conflicts**: Each worker uses `base_port + workerIndex`. 3 workers on base 9110 → ports 9110, 9111, 9112. Stop other BrowserOS instances first.
**API key not resolving**: If your config has `"apiKey": "OPENAI_API_KEY"`, ensure the env var is set in `.env.development`.
**Tasks timing out**: Increase `timeout_ms`. Default is 15 minutes; complex tasks may need 20+ minutes.
**Tasks timing out**: Increase `timeout_ms`. Default is 30 minutes.
**Headless vs headed**: Set `"headless": false` to watch Chrome in real-time. Useful for debugging.
**Headless vs headed**: Set `"headless": false` to watch Chrome in real time.

View File

@@ -1,18 +0,0 @@
{
"agent": {
"type": "single",
"provider": "openrouter",
"model": "openai/gpt-4o",
"apiKey": "OPENROUTER_API_KEY"
},
"dataset": "data/webvoyager_e2e_test.jsonl",
"output_dir": "results",
"num_workers": 5,
"browseros": {
"server_url": "http://127.0.0.1:9110"
},
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 300000
}

View File

@@ -0,0 +1,26 @@
{
"agent": {
"type": "single",
"provider": "openai-compatible",
"model": "moonshotai/kimi-k2.5",
"apiKey": "OPENROUTER_API_KEY",
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
},
"dataset": "../data/agisdk-real.jsonl",
"num_workers": 10,
"restart_server_per_task": true,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["agisdk_state_diff"],
"timeout_ms": 1800000
}

View File

@@ -2,9 +2,9 @@
"agent": {
"type": "single",
"provider": "openai-compatible",
"model": "accounts/fireworks/models/kimi-k2p5",
"apiKey": "FIREWORKS_API_KEY",
"baseUrl": "https://api.fireworks.ai/inference/v1",
"model": "moonshotai/kimi-k2.5",
"apiKey": "OPENROUTER_API_KEY",
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
},
"dataset": "../data/webbench-2of4-50.jsonl",
@@ -22,8 +22,5 @@
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1800000
}

View File

@@ -29,8 +29,5 @@
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1800000
}

View File

@@ -29,8 +29,5 @@
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1800000
}

View File

@@ -0,0 +1,26 @@
{
"agent": {
"type": "single",
"provider": "openai-compatible",
"model": "moonshotai/kimi-k2.5",
"apiKey": "OPENROUTER_API_KEY",
"baseUrl": "https://openrouter.ai/api/v1",
"supportsImages": true
},
"dataset": "../data/webarena-infinity-hard-50.jsonl",
"num_workers": 10,
"restart_server_per_task": true,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["infinity_state"],
"timeout_ms": 1800000
}

View File

@@ -20,8 +20,5 @@
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 300000
}

View File

@@ -22,8 +22,5 @@
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}

View File

@@ -1,30 +0,0 @@
{
"agent": {
"type": "gemini-computer-use",
"apiKey": "GOOGLE_AI_API_KEY",
"screenSize": {
"width": 1440,
"height": 900
},
"turnLimit": 100
},
"dataset": "../data/test-set.jsonl",
"num_workers": 1,
"restart_server_per_task": true,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}

View File

@@ -1,30 +0,0 @@
{
"agent": {
"type": "yutori-navigator",
"apiKey": "YUTORI_API_KEY",
"screenSize": {
"width": 1280,
"height": 800
},
"turnLimit": 100
},
"dataset": "../data/test-set.jsonl",
"num_workers": 1,
"restart_server_per_task": true,
"browseros": {
"server_url": "http://127.0.0.1:9110",
"base_cdp_port": 9010,
"base_server_port": 9110,
"base_extension_port": 9310,
"load_extensions": false,
"headless": false
},
"captcha": {
"api_key_env": "NOPECHA_API_KEY"
},
"graders": ["performance_grader"],
"grader_api_key_env": "OPENROUTER_API_KEY",
"grader_base_url": "https://openrouter.ai/api/v1",
"grader_model": "openai/gpt-4.1",
"timeout_ms": 1200000
}

View File

@@ -0,0 +1,40 @@
{"query_id": "agisdk-dashdish-10", "dataset": "agisdk-real", "query": "Place an order from \"Souvla\" for a \"Medium Classic Cheeseburger\" and a \"Small Bacon Double Cheeseburger\" with \"Standard Delivery\" as the method with the default charged options.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-dashdish.vercel.app", "metadata": {"original_task_id": "dashdish-10", "website": "DashDish", "category": "agisdk-real", "additional": {"agisdk_task_id": "dashdish-10", "challenge_type": "action", "difficulty": "hard", "similar_to": "Doordash"}}}
{"query_id": "agisdk-fly-unified-5", "dataset": "agisdk-real", "query": "Find me the cheapest fare for a flight from Orlando to Milwaukee on December 5th, 2024 and book it.\nPassenger: John Doe\nDate of Birth: 01/01/1990\nSex: Male\nSeat Selection: No\nPayment: Credit Card (378342143523967), Exp: 12/30, Security Code: 420 Address: 123 Main St, San Francisco, CA, 94105, USA, Phone: 555-123-4567, Email: johndoe@example.com.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-fly-unified.vercel.app", "metadata": {"original_task_id": "fly-unified-5", "website": "Fly Unified", "category": "agisdk-real", "additional": {"agisdk_task_id": "fly-unified-5", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "United Airlines"}}}
{"query_id": "agisdk-udriver-10", "dataset": "agisdk-real", "query": "Order me a ride for 4pm, I'll be at the de Young muesum headed to the Waterbar, fanciest option possible please.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-10", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-10", "challenge_type": "action", "difficulty": "hard", "similar_to": "Uber"}}}
{"query_id": "agisdk-udriver-9", "dataset": "agisdk-real", "query": "Book me a ride from the thai restaurant I last took a ride to for later today at 2pm, I'll be at 333 Apartments on Fremont", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-9", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-9", "challenge_type": "retrieval-action", "difficulty": "hard", "similar_to": "Uber"}}}
{"query_id": "agisdk-topwork-4", "dataset": "agisdk-real", "query": "Create a job post for a UI/UX Designer with expertise in Figma, Sketch, and Adobe Creative Suite, including project details, timeline, and required skills (Wireframing, Prototyping, Responsive Design).", "graders": ["agisdk_state_diff"], "start_url": "https://evals-topwork.vercel.app", "metadata": {"original_task_id": "topwork-4", "website": "TopWork", "category": "agisdk-real", "additional": {"agisdk_task_id": "topwork-4", "challenge_type": "action", "difficulty": "medium", "similar_to": "Upwork"}}}
{"query_id": "agisdk-gocalendar-4", "dataset": "agisdk-real", "query": "Change the \"Team Check-In\" event on July 18, 2024, name to \"Project Kickoff\" and update the location to \"Zoom\"", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gocalendar.vercel.app", "metadata": {"original_task_id": "gocalendar-4", "website": "GoCalendar", "category": "agisdk-real", "additional": {"agisdk_task_id": "gocalendar-4", "challenge_type": "action", "difficulty": "medium", "similar_to": "Google Calendar"}}}
{"query_id": "agisdk-staynb-6", "dataset": "agisdk-real", "query": "Find and book the stay with the best value for money (cheapest stay with the best reviews) for 1 day. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-staynb.vercel.app", "metadata": {"original_task_id": "staynb-6", "website": "StayNB", "category": "agisdk-real", "additional": {"agisdk_task_id": "staynb-6", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "Airbnb"}}}
{"query_id": "agisdk-udriver-11", "dataset": "agisdk-real", "query": "I need to go from Pacific Catch on Chestnut back home to 333 Fremont now. If the fancy version is within ten dollars of the regular one, book that.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-11", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-11", "challenge_type": "action", "difficulty": "hard", "similar_to": "Uber"}}}
{"query_id": "agisdk-networkin-5", "dataset": "agisdk-real", "query": "Send a connection request to John Smith.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-networkin.vercel.app", "metadata": {"original_task_id": "networkin-5", "website": "Networkin", "category": "agisdk-real", "additional": {"agisdk_task_id": "networkin-5", "challenge_type": "action", "difficulty": "easy", "similar_to": "LinkedIn"}}}
{"query_id": "agisdk-zilloft-6", "dataset": "agisdk-real", "query": "Select a property listed in San Francisco as \"Condos\" within a price range under $300,000 and request a tour for tomorrow at 4:00 PM. Use these contact details: Name: Sarah Brown, Email: sarahbrown@example.com, Phone: 555-987-6543.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-zilloft.vercel.app", "metadata": {"original_task_id": "zilloft-6", "website": "Zilloft", "category": "agisdk-real", "additional": {"agisdk_task_id": "zilloft-6", "challenge_type": "action", "difficulty": "medium", "similar_to": "Zillow"}}}
{"query_id": "agisdk-topwork-2", "dataset": "agisdk-real", "query": "Create a job posting for a Backend Developer specializing in Python, Django, and Flask to develop a high-performance web application. Include project details such as required skills (PostgreSQL, Docker, AWS, CI/CD), estimated project timeline, and budget.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-topwork.vercel.app", "metadata": {"original_task_id": "topwork-2", "website": "TopWork", "category": "agisdk-real", "additional": {"agisdk_task_id": "topwork-2", "challenge_type": "action", "difficulty": "medium", "similar_to": "Upwork"}}}
{"query_id": "agisdk-gocalendar-3", "dataset": "agisdk-real", "query": "Delete the event titled \"Breakfast Meeting with Client\" scheduled for July 19, 2024", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gocalendar.vercel.app", "metadata": {"original_task_id": "gocalendar-3", "website": "GoCalendar", "category": "agisdk-real", "additional": {"agisdk_task_id": "gocalendar-3", "challenge_type": "action", "difficulty": "easy", "similar_to": "Google Calendar"}}}
{"query_id": "agisdk-topwork-3", "dataset": "agisdk-real", "query": "Create a job listing for a Full-Stack Developer with expertise in Java, Spring Boot, and Angular, outlining the project scope, estimated duration, and required skills (MySQL, Docker, Kubernetes, and Jenkins). The ideal candidate should have experience in enterprise-level applications and building scalable microservices. After creating the job post, please describe what you included in the job listing.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-topwork.vercel.app", "metadata": {"original_task_id": "topwork-3", "website": "TopWork", "category": "agisdk-real", "additional": {"agisdk_task_id": "topwork-3", "challenge_type": "retrieval", "difficulty": "medium", "similar_to": "Upwork"}}}
{"query_id": "agisdk-dashdish-7", "dataset": "agisdk-real", "query": "Select \"Express Delivery\" for an order from \"DragonEats\" of \"Mushroom Swiss Burger\" and complete the checkout with the pre-loaded Visa card.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-dashdish.vercel.app", "metadata": {"original_task_id": "dashdish-7", "website": "DashDish", "category": "agisdk-real", "additional": {"agisdk_task_id": "dashdish-7", "challenge_type": "action", "difficulty": "hard", "similar_to": "Doordash"}}}
{"query_id": "agisdk-networkin-3", "dataset": "agisdk-real", "query": "Write a post inviting users to a networking event, including details about the event's purpose, date, and target audience.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-networkin.vercel.app", "metadata": {"original_task_id": "networkin-3", "website": "Networkin", "category": "agisdk-real", "additional": {"agisdk_task_id": "networkin-3", "challenge_type": "action", "difficulty": "medium", "similar_to": "LinkedIn"}}}
{"query_id": "agisdk-gomail-7", "dataset": "agisdk-real", "query": "Delete the email with the subject \"New Leadership Articles You Can't Miss\" from the Inbox.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gomail.vercel.app", "metadata": {"original_task_id": "gomail-7", "website": "GoMail", "category": "agisdk-real", "additional": {"agisdk_task_id": "gomail-7", "challenge_type": "retrieval-action", "difficulty": "hard", "similar_to": "Gmail"}}}
{"query_id": "agisdk-opendining-8", "dataset": "agisdk-real", "query": "Identify and book the restaurant with the lowest rating. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-opendining.vercel.app", "metadata": {"original_task_id": "opendining-8", "website": "OpenDining", "category": "agisdk-real", "additional": {"agisdk_task_id": "opendining-8", "challenge_type": "retrieval-action", "difficulty": "easy", "similar_to": "OpenTable"}}}
{"query_id": "agisdk-udriver-1", "dataset": "agisdk-real", "query": "Book a ride from Fitness Urbano to Pacific Cafe", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-1", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-1", "challenge_type": "action", "difficulty": "easy", "similar_to": "Uber"}}}
{"query_id": "agisdk-staynb-2", "dataset": "agisdk-real", "query": "Click on one of the stays displayed on the homepage and book it for a family of 4 (2 adults and 2 children). For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-staynb.vercel.app", "metadata": {"original_task_id": "staynb-2", "website": "StayNB", "category": "agisdk-real", "additional": {"agisdk_task_id": "staynb-2", "challenge_type": "action", "difficulty": "easy", "similar_to": "Airbnb"}}}
{"query_id": "agisdk-opendining-10", "dataset": "agisdk-real", "query": "Check the menus of all restaurants for vegetarian options and make a reservation at the one with the most vegetarian choices. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-opendining.vercel.app", "metadata": {"original_task_id": "opendining-10", "website": "OpenDining", "category": "agisdk-real", "additional": {"agisdk_task_id": "opendining-10", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "OpenTable"}}}
{"query_id": "agisdk-opendining-4", "dataset": "agisdk-real", "query": "Use the search bar to search for a restaurant on September 2nd at 4:30 PM for 7 people, using \"Japanese\" as the search term, and book the first result. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-opendining.vercel.app", "metadata": {"original_task_id": "opendining-4", "website": "OpenDining", "category": "agisdk-real", "additional": {"agisdk_task_id": "opendining-4", "challenge_type": "action", "difficulty": "hard", "similar_to": "OpenTable"}}}
{"query_id": "agisdk-dashdish-4", "dataset": "agisdk-real", "query": "Schedule a delivery order from \"Taco Bell\" adding a \"Classic Cheeseburger\" large size for later and add the note \"Leave at the front door\".", "graders": ["agisdk_state_diff"], "start_url": "https://evals-dashdish.vercel.app", "metadata": {"original_task_id": "dashdish-4", "website": "DashDish", "category": "agisdk-real", "additional": {"agisdk_task_id": "dashdish-4", "challenge_type": "action", "difficulty": "medium", "similar_to": "Doordash"}}}
{"query_id": "agisdk-networkin-1", "dataset": "agisdk-real", "query": "Create a new text post for the feed with a professional update about AI trends in 2025, mentioning three key advancements and their impact on the job market.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-networkin.vercel.app", "metadata": {"original_task_id": "networkin-1", "website": "Networkin", "category": "agisdk-real", "additional": {"agisdk_task_id": "networkin-1", "challenge_type": "action", "difficulty": "medium", "similar_to": "LinkedIn"}}}
{"query_id": "agisdk-dashdish-5", "dataset": "agisdk-real", "query": "Add three \"Loaded Bacon Cheese Fries\" to the shopping cart from \"Man vs. Fries\". Proceed to checkout and select \"Pickup\" as the delivery method.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-dashdish.vercel.app", "metadata": {"original_task_id": "dashdish-5", "website": "DashDish", "category": "agisdk-real", "additional": {"agisdk_task_id": "dashdish-5", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "Doordash"}}}
{"query_id": "agisdk-opendining-5", "dataset": "agisdk-real", "query": "Scroll through the homepage carousel until \"Ocean Breeze\" is visible, select the second available time slot, and complete the reservation. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-opendining.vercel.app", "metadata": {"original_task_id": "opendining-5", "website": "OpenDining", "category": "agisdk-real", "additional": {"agisdk_task_id": "opendining-5", "challenge_type": "action", "difficulty": "medium", "similar_to": "OpenTable"}}}
{"query_id": "agisdk-gocalendar-1", "dataset": "agisdk-real", "query": "Create a new event titled \"Team Meeting\" on July 19, 2024, from 2 PM to 2:30 PM, and include \"Conference Room A\" as the location", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gocalendar.vercel.app", "metadata": {"original_task_id": "gocalendar-1", "website": "GoCalendar", "category": "agisdk-real", "additional": {"agisdk_task_id": "gocalendar-1", "challenge_type": "action", "difficulty": "medium", "similar_to": "Google Calendar"}}}
{"query_id": "agisdk-gomail-5", "dataset": "agisdk-real", "query": "Schedule an email to jane.doe@example.com with the subject \"Weekly Update\" to be sent next Monday at 9:00 AM.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gomail.vercel.app", "metadata": {"original_task_id": "gomail-5", "website": "GoMail", "category": "agisdk-real", "additional": {"agisdk_task_id": "gomail-5", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "Gmail"}}}
{"query_id": "agisdk-staynb-4", "dataset": "agisdk-real", "query": "Book a stay for 2 children with 1 adult. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-staynb.vercel.app", "metadata": {"original_task_id": "staynb-4", "website": "StayNB", "category": "agisdk-real", "additional": {"agisdk_task_id": "staynb-4", "challenge_type": "action", "difficulty": "medium", "similar_to": "Airbnb"}}}
{"query_id": "agisdk-dashdish-2", "dataset": "agisdk-real", "query": "Add a \"Medium Pepperoni Pizza\" from the restaurant \"Papa Johns Pizza\" to the shopping cart and purchase it.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-dashdish.vercel.app", "metadata": {"original_task_id": "dashdish-2", "website": "DashDish", "category": "agisdk-real", "additional": {"agisdk_task_id": "dashdish-2", "challenge_type": "action", "difficulty": "easy", "similar_to": "Doordash"}}}
{"query_id": "agisdk-staynb-8", "dataset": "agisdk-real", "query": "Scroll through the homepage and book the last stay located in Paris.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-staynb.vercel.app", "metadata": {"original_task_id": "staynb-8", "website": "StayNB", "category": "agisdk-real", "additional": {"agisdk_task_id": "staynb-8", "challenge_type": "retrieval-action", "difficulty": "medium", "similar_to": "Airbnb"}}}
{"query_id": "agisdk-gomail-2", "dataset": "agisdk-real", "query": "Mark the first email in the Inbox as \"read\".", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gomail.vercel.app", "metadata": {"original_task_id": "gomail-2", "website": "GoMail", "category": "agisdk-real", "additional": {"agisdk_task_id": "gomail-2", "challenge_type": "action", "difficulty": "easy", "similar_to": "Gmail"}}}
{"query_id": "agisdk-networkin-10", "dataset": "agisdk-real", "query": "Generate a polite follow-up message for a previous unanswered chat, starting with \"Following up on\".", "graders": ["agisdk_state_diff"], "start_url": "https://evals-networkin.vercel.app", "metadata": {"original_task_id": "networkin-10", "website": "Networkin", "category": "agisdk-real", "additional": {"agisdk_task_id": "networkin-10", "challenge_type": "action", "difficulty": "medium", "similar_to": "LinkedIn"}}}
{"query_id": "agisdk-gomail-3", "dataset": "agisdk-real", "query": "Compose a new email to jonathan.smith@example.com with the subject \"Meeting Notes\" and body \"Please find the meeting notes attached.\"", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gomail.vercel.app", "metadata": {"original_task_id": "gomail-3", "website": "GoMail", "category": "agisdk-real", "additional": {"agisdk_task_id": "gomail-3", "challenge_type": "action", "difficulty": "easy", "similar_to": "Gmail"}}}
{"query_id": "agisdk-udriver-6", "dataset": "agisdk-real", "query": "Me and 4 friends need a ride from the Palace Hotel to dinner at Osha Thai leaving now", "graders": ["agisdk_state_diff"], "start_url": "https://evals-udriver.vercel.app", "metadata": {"original_task_id": "udriver-6", "website": "UDriver", "category": "agisdk-real", "additional": {"agisdk_task_id": "udriver-6", "challenge_type": "action", "difficulty": "hard", "similar_to": "Uber"}}}
{"query_id": "agisdk-staynb-9", "dataset": "agisdk-real", "query": "Book a stay with the maximum number of guests supported. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-staynb.vercel.app", "metadata": {"original_task_id": "staynb-9", "website": "StayNB", "category": "agisdk-real", "additional": {"agisdk_task_id": "staynb-9", "challenge_type": "action", "difficulty": "hard", "similar_to": "Airbnb"}}}
{"query_id": "agisdk-zilloft-3", "dataset": "agisdk-real", "query": "Find a home in San Diego priced under $150,000 with at least 2 bedrooms and request a tour. Use these details: Contact Name: John Doe, Email: johndoe@example.com, Phone: 555-123-4567, Tour Time: 2:00 PM, Tour Date: First available.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-zilloft.vercel.app", "metadata": {"original_task_id": "zilloft-3", "website": "Zilloft", "category": "agisdk-real", "additional": {"agisdk_task_id": "zilloft-3", "challenge_type": "retrieval-action", "difficulty": "easy", "similar_to": "Zillow"}}}
{"query_id": "agisdk-fly-unified-6", "dataset": "agisdk-real", "query": "Reserve me a seat for the flight from Austin to Pittsburgh departing on December 11th, 2024 at 8:00 in Basic Economy.\nPassenger: Alice Brown\nDate of Birth: 05/20/1992\nSex: Female\nSeat Selection: Yes (Aisle seat)\nPayment: Credit Card (378342143523967), Exp: 09/27, security code: 332 Address: 789 Pine St, Los Angeles, CA, 90012, USA, Phone: 555-456-7890, Email: alicebrown@example.com.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-fly-unified.vercel.app", "metadata": {"original_task_id": "fly-unified-6", "website": "Fly Unified", "category": "agisdk-real", "additional": {"agisdk_task_id": "fly-unified-6", "challenge_type": "action", "difficulty": "medium", "similar_to": "United Airlines"}}}
{"query_id": "agisdk-opendining-3", "dataset": "agisdk-real", "query": "Book a table at \"The Royal Dine\" for a party of 4 on July 20, 2024, at 7 PM. For fields you don't know the answer for, just fill them in with anything of your choice.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-opendining.vercel.app", "metadata": {"original_task_id": "opendining-3", "website": "OpenDining", "category": "agisdk-real", "additional": {"agisdk_task_id": "opendining-3", "challenge_type": "action", "difficulty": "easy", "similar_to": "OpenTable"}}}
{"query_id": "agisdk-gocalendar-7", "dataset": "agisdk-real", "query": "Reschedule the \"Morning Coffee with sister\" event from July 18, 2024, at 9 AM to July 19, 2024, at 10AM using drag-and-drop functionality", "graders": ["agisdk_state_diff"], "start_url": "https://evals-gocalendar.vercel.app", "metadata": {"original_task_id": "gocalendar-7", "website": "GoCalendar", "category": "agisdk-real", "additional": {"agisdk_task_id": "gocalendar-7", "challenge_type": "action", "difficulty": "medium", "similar_to": "Google Calendar"}}}
{"query_id": "agisdk-staynb-5", "dataset": "agisdk-real", "query": "Use the search bar to look for a stay. For the \"Where\" section, use the \"Search by region\" popover and select \"Europe\". Set the check-in date to October 13th and the check-out date to October 23rd. For the \"Who\" section, select 1 infant, 2 children, and 2 adults. Press the search button, select the first stay, and book it.", "graders": ["agisdk_state_diff"], "start_url": "https://evals-staynb.vercel.app", "metadata": {"original_task_id": "staynb-5", "website": "StayNB", "category": "agisdk-real", "additional": {"agisdk_task_id": "staynb-5", "challenge_type": "action", "difficulty": "medium", "similar_to": "Airbnb"}}}

View File

@@ -1,5 +0,0 @@
{"query_id": "CoordClick--1", "dataset": "coordinate-click", "query": "Click on circle A located at the top-left corner of the page.", "graders": ["webvoyager_grader"], "start_url": "http://localhost:3100", "metadata": {"original_task_id": "CoordClick--1", "website": "eval-target", "category": "coordinate-prediction", "additional": {"ground_truth": "Circle A is clicked and shows data-clicked=true", "answer_type": "golden"}}}
{"query_id": "CoordClick--2", "dataset": "coordinate-click", "query": "Click on circle B located at the top-right corner of the page.", "graders": ["webvoyager_grader"], "start_url": "http://localhost:3100", "metadata": {"original_task_id": "CoordClick--2", "website": "eval-target", "category": "coordinate-prediction", "additional": {"ground_truth": "Circle B is clicked and shows data-clicked=true", "answer_type": "golden"}}}
{"query_id": "CoordClick--3", "dataset": "coordinate-click", "query": "Click on circle C located at the bottom-left corner of the page.", "graders": ["webvoyager_grader"], "start_url": "http://localhost:3100", "metadata": {"original_task_id": "CoordClick--3", "website": "eval-target", "category": "coordinate-prediction", "additional": {"ground_truth": "Circle C is clicked and shows data-clicked=true", "answer_type": "golden"}}}
{"query_id": "CoordClick--4", "dataset": "coordinate-click", "query": "Click on circle D located at the bottom-right corner of the page.", "graders": ["webvoyager_grader"], "start_url": "http://localhost:3100", "metadata": {"original_task_id": "CoordClick--4", "website": "eval-target", "category": "coordinate-prediction", "additional": {"ground_truth": "Circle D is clicked and shows data-clicked=true", "answer_type": "golden"}}}
{"query_id": "CoordClick--5", "dataset": "coordinate-click", "query": "Click on all four circles A, B, C, and D on the page.", "graders": ["webvoyager_grader"], "start_url": "http://localhost:3100", "metadata": {"original_task_id": "CoordClick--5", "website": "eval-target", "category": "coordinate-prediction", "additional": {"ground_truth": "All four circles are clicked and page shows ALL TARGETS HIT", "answer_type": "golden"}}}

View File

@@ -0,0 +1,50 @@
{"query_id": "infinity-elation-prescriptions-task_h69", "dataset": "webarena-infinity", "query": "Approve all pending refill requests except for any medication that is involved in a major drug-drug interaction with another of the patient's active medications. Deny those with the reason 'Drug interaction \u2014 needs provider review before renewal'.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h69", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h69.py", "app_base_port": 8020}}}
{"query_id": "infinity-elation-clinical-records-task_h52", "dataset": "webarena-infinity", "query": "Add the document tag 'Provider-Reviewed' to every visit note template that was created by the current logged-in provider. Do not modify templates created by other providers.", "graders": ["infinity_state"], "start_url": "http://localhost:8000", "metadata": {"original_task_id": "elation-clinical-records-task_h52", "website": "elation-clinical-records", "category": "webarena-infinity", "additional": {"app_name": "elation-clinical-records", "difficulty": "hard", "verifier_path": "real-tasks/task_h52.py", "app_base_port": 8000}}}
{"query_id": "infinity-gmail-accounts-and-contacts-task_h44", "dataset": "webarena-infinity", "query": "Your sister's husband is one of your contacts. Find him, star his entry, and add the Friends label.", "graders": ["infinity_state"], "start_url": "http://localhost:8070", "metadata": {"original_task_id": "gmail-accounts-and-contacts-task_h44", "website": "gmail-accounts-and-contacts", "category": "webarena-infinity", "additional": {"app_name": "gmail-accounts-and-contacts", "difficulty": "hard", "verifier_path": "real-tasks/task_h44.py", "app_base_port": 8070}}}
{"query_id": "infinity-gmail-task_h2", "dataset": "webarena-infinity", "query": "Update the Datadog alerts filter to also archive matching emails and forward them to priya.sharma@cloudnine.dev instead of nate.patel@devops.tools.", "graders": ["infinity_state"], "start_url": "http://localhost:8060", "metadata": {"original_task_id": "gmail-task_h2", "website": "gmail", "category": "webarena-infinity", "additional": {"app_name": "gmail", "difficulty": "hard", "verifier_path": "real-tasks/task_h2.py", "app_base_port": 8060}}}
{"query_id": "infinity-gitlab-plan-and-track-task_h58", "dataset": "webarena-infinity", "query": "The Performance Initiative epic has two child epics. For the child epic with more open issues, set the weight of every issue in it to 13. For the other child epic, close all its open issues.", "graders": ["infinity_state"], "start_url": "http://localhost:8050", "metadata": {"original_task_id": "gitlab-plan-and-track-task_h58", "website": "gitlab-plan-and-track", "category": "webarena-infinity", "additional": {"app_name": "gitlab-plan-and-track", "difficulty": "hard", "verifier_path": "real-tasks/task_h58.py", "app_base_port": 8050}}}
{"query_id": "infinity-figma-slides-task_h46", "dataset": "webarena-infinity", "query": "There are two slides with tables in the deck. Lock the table that compares competitors, and change the font size to 16 on the table that tracks quarterly feature adoption.", "graders": ["infinity_state"], "start_url": "http://localhost:8030", "metadata": {"original_task_id": "figma-slides-task_h46", "website": "figma-slides", "category": "webarena-infinity", "additional": {"app_name": "figma-slides", "difficulty": "hard", "verifier_path": "real-tasks/task_h46.py", "app_base_port": 8030}}}
{"query_id": "infinity-elation-prescriptions-task_h50", "dataset": "webarena-infinity", "query": "Deny the pending refill for the patient's cholesterol medication because his lipid panel is overdue. Then deny the Lisinopril refill as well \u2014 he needs a follow-up blood pressure check first.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h50", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h50.py", "app_base_port": 8020}}}
{"query_id": "infinity-elation-prescriptions-task_h19", "dataset": "webarena-infinity", "query": "Discontinue the Omeprazole and prescribe Famotidine 20mg tablet twice daily as a replacement for GERD \u2014 qty 60, 3 refills, send to CVS #4521.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h19", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h19.py", "app_base_port": 8020}}}
{"query_id": "infinity-paypal-my-wallet-task_h25", "dataset": "webarena-infinity", "query": "Convert all of my Australian dollars to euros.", "graders": ["infinity_state"], "start_url": "http://localhost:8100", "metadata": {"original_task_id": "paypal-my-wallet-task_h25", "website": "paypal-my-wallet", "category": "webarena-infinity", "additional": {"app_name": "paypal-my-wallet", "difficulty": "hard", "verifier_path": "real-tasks/task_h25.py", "app_base_port": 8100}}}
{"query_id": "infinity-elation-clinical-records-task_h66", "dataset": "webarena-infinity", "query": "Create a new template called 'Anxiety Management' with HPI and Assessment sections, and billing code 99213 with description 'Office visit, established, low complexity'. Then create a visit note for Emily Nakamura using that new template and the Telehealth category, add a Psychological Status block to the note, and sign it.", "graders": ["infinity_state"], "start_url": "http://localhost:8000", "metadata": {"original_task_id": "elation-clinical-records-task_h66", "website": "elation-clinical-records", "category": "webarena-infinity", "additional": {"app_name": "elation-clinical-records", "difficulty": "hard", "verifier_path": "real-tasks/task_h66.py", "app_base_port": 8000}}}
{"query_id": "infinity-elation-clinical-records-task_h62", "dataset": "webarena-infinity", "query": "Look up which template is assigned to the COVID Vaccine appointment type. Remove all its existing document tags and replace them with the single tag 'COVID-Protocol'. Then also assign that same template to the Urgent Same-Day appointment type.", "graders": ["infinity_state"], "start_url": "http://localhost:8000", "metadata": {"original_task_id": "elation-clinical-records-task_h62", "website": "elation-clinical-records", "category": "webarena-infinity", "additional": {"app_name": "elation-clinical-records", "difficulty": "hard", "verifier_path": "real-tasks/task_h62.py", "app_base_port": 8000}}}
{"query_id": "infinity-elation-prescriptions-task_h32", "dataset": "webarena-infinity", "query": "The patient has a medication that's being dispensed as written (brand name only). Discontinue that prescription and replace it with a new one \u2014 same medication, same sig, same pharmacy \u2014 but allow generic substitution this time. Qty 30, 3 refills, 30 days supply.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h32", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h32.py", "app_base_port": 8020}}}
{"query_id": "infinity-gitlab-plan-and-track-task_h48", "dataset": "webarena-infinity", "query": "Add the 'breaking-change' label to every open issue in the API v3 Migration epic and remove any existing workflow-scoped labels from those issues.", "graders": ["infinity_state"], "start_url": "http://localhost:8050", "metadata": {"original_task_id": "gitlab-plan-and-track-task_h48", "website": "gitlab-plan-and-track", "category": "webarena-infinity", "additional": {"app_name": "gitlab-plan-and-track", "difficulty": "hard", "verifier_path": "real-tasks/task_h48.py", "app_base_port": 8050}}}
{"query_id": "infinity-gitlab-plan-and-track-task_h77", "dataset": "webarena-infinity", "query": "Rename the 'UX' label to 'user-experience', change its type to 'group', and then add it to every open issue in the Frontend Modernization epic that doesn't already have it.", "graders": ["infinity_state"], "start_url": "http://localhost:8050", "metadata": {"original_task_id": "gitlab-plan-and-track-task_h77", "website": "gitlab-plan-and-track", "category": "webarena-infinity", "additional": {"app_name": "gitlab-plan-and-track", "difficulty": "hard", "verifier_path": "real-tasks/task_h77.py", "app_base_port": 8050}}}
{"query_id": "infinity-xero-invoicing-task_h15", "dataset": "webarena-infinity", "query": "Create a new invoice for Summit Health Group for an annual software license and 12 months of support with a 10% discount on support.", "graders": ["infinity_state"], "start_url": "http://localhost:8120", "metadata": {"original_task_id": "xero-invoicing-task_h15", "website": "xero-invoicing", "category": "webarena-infinity", "additional": {"app_name": "xero-invoicing", "difficulty": "hard", "verifier_path": "real-tasks/task_h15.py", "app_base_port": 8120}}}
{"query_id": "infinity-elation-clinical-records-task_h55", "dataset": "webarena-infinity", "query": "Resolve every problem across all patients in the system that currently has a status of Controlled.", "graders": ["infinity_state"], "start_url": "http://localhost:8000", "metadata": {"original_task_id": "elation-clinical-records-task_h55", "website": "elation-clinical-records", "category": "webarena-infinity", "additional": {"app_name": "elation-clinical-records", "difficulty": "hard", "verifier_path": "real-tasks/task_h55.py", "app_base_port": 8000}}}
{"query_id": "infinity-gitlab-plan-and-track-task_h8", "dataset": "webarena-infinity", "query": "Create a confidential issue titled 'Emergency security patch' with priority::critical and the 'security' label, assigned to James O'Brien and Oliver Schmidt, with weight 2 in the Security Hardening milestone.", "graders": ["infinity_state"], "start_url": "http://localhost:8050", "metadata": {"original_task_id": "gitlab-plan-and-track-task_h8", "website": "gitlab-plan-and-track", "category": "webarena-infinity", "additional": {"app_name": "gitlab-plan-and-track", "difficulty": "hard", "verifier_path": "real-tasks/task_h8.py", "app_base_port": 8050}}}
{"query_id": "infinity-paypal-my-wallet-task_h20", "dataset": "webarena-infinity", "query": "Make a $200 payment on PayPal Credit and change autopay to pay the full balance.", "graders": ["infinity_state"], "start_url": "http://localhost:8100", "metadata": {"original_task_id": "paypal-my-wallet-task_h20", "website": "paypal-my-wallet", "category": "webarena-infinity", "additional": {"app_name": "paypal-my-wallet", "difficulty": "hard", "verifier_path": "real-tasks/task_h20.py", "app_base_port": 8100}}}
{"query_id": "infinity-gitlab-plan-and-track-task_h52", "dataset": "webarena-infinity", "query": "Create a new board called 'Performance Tracker' with lists for the priority::critical, priority::high, and priority::medium labels. Then add the 'priority::high' label to every open issue in the v4.1 milestone that has the 'performance' label.", "graders": ["infinity_state"], "start_url": "http://localhost:8050", "metadata": {"original_task_id": "gitlab-plan-and-track-task_h52", "website": "gitlab-plan-and-track", "category": "webarena-infinity", "additional": {"app_name": "gitlab-plan-and-track", "difficulty": "hard", "verifier_path": "real-tasks/task_h52.py", "app_base_port": 8050}}}
{"query_id": "infinity-paypal-my-wallet-task_h80", "dataset": "webarena-infinity", "query": "Save all available Food & Drink offers, buy a $25 DoorDash gift card for yourself, and switch currency conversion to use my card issuer.", "graders": ["infinity_state"], "start_url": "http://localhost:8100", "metadata": {"original_task_id": "paypal-my-wallet-task_h80", "website": "paypal-my-wallet", "category": "webarena-infinity", "additional": {"app_name": "paypal-my-wallet", "difficulty": "hard", "verifier_path": "real-tasks/task_h80.py", "app_base_port": 8100}}}
{"query_id": "infinity-gmail-accounts-and-contacts-task_h50", "dataset": "webarena-infinity", "query": "Add the Emergency label to every contact who is currently listed as a delegate (active, pending, or expired). Then remove all delegates whose status is not 'active'.", "graders": ["infinity_state"], "start_url": "http://localhost:8070", "metadata": {"original_task_id": "gmail-accounts-and-contacts-task_h50", "website": "gmail-accounts-and-contacts", "category": "webarena-infinity", "additional": {"app_name": "gmail-accounts-and-contacts", "difficulty": "hard", "verifier_path": "real-tasks/task_h50.py", "app_base_port": 8070}}}
{"query_id": "infinity-elation-clinical-records-task_h14", "dataset": "webarena-infinity", "query": "Add the tag 'Flu-Season' to every patient whose primary provider is Dr. Sarah Chen.", "graders": ["infinity_state"], "start_url": "http://localhost:8000", "metadata": {"original_task_id": "elation-clinical-records-task_h14", "website": "elation-clinical-records", "category": "webarena-infinity", "additional": {"app_name": "elation-clinical-records", "difficulty": "hard", "verifier_path": "real-tasks/task_h14.py", "app_base_port": 8000}}}
{"query_id": "infinity-figma-text-and-typography-task_h7", "dataset": "webarena-infinity", "query": "Remove all list formatting from every layer.", "graders": ["infinity_state"], "start_url": "http://localhost:8040", "metadata": {"original_task_id": "figma-text-and-typography-task_h7", "website": "figma-text-and-typography", "category": "webarena-infinity", "additional": {"app_name": "figma-text-and-typography", "difficulty": "hard", "verifier_path": "real-tasks/task_h7.py", "app_base_port": 8040}}}
{"query_id": "infinity-paypal-my-wallet-task_h26", "dataset": "webarena-infinity", "query": "Send a $50 Amazon gift card to sarah.chen@email.com with 'Thank you!' as the message, and save the Amazon cashback offer.", "graders": ["infinity_state"], "start_url": "http://localhost:8100", "metadata": {"original_task_id": "paypal-my-wallet-task_h26", "website": "paypal-my-wallet", "category": "webarena-infinity", "additional": {"app_name": "paypal-my-wallet", "difficulty": "hard", "verifier_path": "real-tasks/task_h26.py", "app_base_port": 8100}}}
{"query_id": "infinity-handshake-career-exploration-task_h97", "dataset": "webarena-infinity", "query": "Find the single most helpful answer across all Q&A questions and mark it helpful. Then find the most-viewed question and submit your own answer to it.", "graders": ["infinity_state"], "start_url": "http://localhost:8080", "metadata": {"original_task_id": "handshake-career-exploration-task_h97", "website": "handshake-career-exploration", "category": "webarena-infinity", "additional": {"app_name": "handshake-career-exploration", "difficulty": "hard", "verifier_path": "real-tasks/task_h97.py", "app_base_port": 8080}}}
{"query_id": "infinity-figma-slides-task_h79", "dataset": "webarena-infinity", "query": "In the adoption table, find the feature with the highest Target Q4 percentage. In the competitive table, change DesignCraft's entry for that same feature to 'Market Leader'. Then update that feature's Target Q4 to '95%'.", "graders": ["infinity_state"], "start_url": "http://localhost:8030", "metadata": {"original_task_id": "figma-slides-task_h79", "website": "figma-slides", "category": "webarena-infinity", "additional": {"app_name": "figma-slides", "difficulty": "hard", "verifier_path": "real-tasks/task_h79.py", "app_base_port": 8030}}}
{"query_id": "infinity-gitlab-plan-and-track-task_h41", "dataset": "webarena-infinity", "query": "For every open issue in the v4.2 - Security Hardening milestone: if it is already confidential, set its health status to 'at risk'. If it is not confidential, make it confidential and set its health status to 'needs attention'.", "graders": ["infinity_state"], "start_url": "http://localhost:8050", "metadata": {"original_task_id": "gitlab-plan-and-track-task_h41", "website": "gitlab-plan-and-track", "category": "webarena-infinity", "additional": {"app_name": "gitlab-plan-and-track", "difficulty": "hard", "verifier_path": "real-tasks/task_h41.py", "app_base_port": 8050}}}
{"query_id": "infinity-handshake-career-exploration-task_h90", "dataset": "webarena-infinity", "query": "A student in the feed mentioned attending the NSBE conference. That student also answered a Q&A question about diversity programs in tech. Submit your own answer to that same question sharing your experience, then bookmark that student's feed post.", "graders": ["infinity_state"], "start_url": "http://localhost:8080", "metadata": {"original_task_id": "handshake-career-exploration-task_h90", "website": "handshake-career-exploration", "category": "webarena-infinity", "additional": {"app_name": "handshake-career-exploration", "difficulty": "hard", "verifier_path": "real-tasks/task_h90.py", "app_base_port": 8080}}}
{"query_id": "infinity-elation-prescriptions-task_h30", "dataset": "webarena-infinity", "query": "The patient has three temporary medications. Discontinue the corticosteroid taper and the penicillin antibiotic \u2014 the patient completed both courses. Move the remaining temporary medication to permanent Rx.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h30", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h30.py", "app_base_port": 8020}}}
{"query_id": "infinity-linear-account-settings-task_h19", "dataset": "webarena-infinity", "query": "Turn off all desktop application settings: open in desktop app, notification badge, and spell check.", "graders": ["infinity_state"], "start_url": "http://localhost:8090", "metadata": {"original_task_id": "linear-account-settings-task_h19", "website": "linear-account-settings", "category": "webarena-infinity", "additional": {"app_name": "linear-account-settings", "difficulty": "hard", "verifier_path": "real-tasks/task_h19.py", "app_base_port": 8090}}}
{"query_id": "infinity-elation-prescriptions-task_h39", "dataset": "webarena-infinity", "query": "Change the default pharmacy to Express Scripts Mail Pharmacy for mail-order prescriptions. Then document that the patient takes Magnesium Citrate 400mg tablet as an OTC supplement \u2014 once daily at bedtime, 30-day supply.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h39", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h39.py", "app_base_port": 8020}}}
{"query_id": "infinity-handshake-career-exploration-task_h136", "dataset": "webarena-infinity", "query": "Your earliest completed appointment was a specific type. Schedule a follow-up appointment of the same category and type with the same staff member, for March 28, 2026 at 9:00 AM, in person.", "graders": ["infinity_state"], "start_url": "http://localhost:8080", "metadata": {"original_task_id": "handshake-career-exploration-task_h136", "website": "handshake-career-exploration", "category": "webarena-infinity", "additional": {"app_name": "handshake-career-exploration", "difficulty": "hard", "verifier_path": "real-tasks/task_h136.py", "app_base_port": 8080}}}
{"query_id": "infinity-handshake-career-exploration-task_h105", "dataset": "webarena-infinity", "query": "Find the second-most-viewed question in Q&A. It has two answers \u2014 mark the one with fewer helpful votes as helpful.", "graders": ["infinity_state"], "start_url": "http://localhost:8080", "metadata": {"original_task_id": "handshake-career-exploration-task_h105", "website": "handshake-career-exploration", "category": "webarena-infinity", "additional": {"app_name": "handshake-career-exploration", "difficulty": "hard", "verifier_path": "real-tasks/task_h105.py", "app_base_port": 8080}}}
{"query_id": "infinity-gmail-accounts-and-contacts-task_h22", "dataset": "webarena-infinity", "query": "The Engineering Manager at TechCorp is listed as one of your delegates. Remove her delegation and unstar her contact.", "graders": ["infinity_state"], "start_url": "http://localhost:8070", "metadata": {"original_task_id": "gmail-accounts-and-contacts-task_h22", "website": "gmail-accounts-and-contacts", "category": "webarena-infinity", "additional": {"app_name": "gmail-accounts-and-contacts", "difficulty": "hard", "verifier_path": "real-tasks/task_h22.py", "app_base_port": 8070}}}
{"query_id": "infinity-elation-patient-communication-task_h9", "dataset": "webarena-infinity", "query": "Acknowledge all unacknowledged reminders in the system.", "graders": ["infinity_state"], "start_url": "http://localhost:8010", "metadata": {"original_task_id": "elation-patient-communication-task_h9", "website": "elation-patient-communication", "category": "webarena-infinity", "additional": {"app_name": "elation-patient-communication", "difficulty": "hard", "verifier_path": "real-tasks/task_h9.py", "app_base_port": 8010}}}
{"query_id": "infinity-superhuman-general-task_h1", "dataset": "webarena-infinity", "query": "Label the FinancePlus partnership email and the QuantumLab prototype email as 'Clients'.", "graders": ["infinity_state"], "start_url": "http://localhost:8110", "metadata": {"original_task_id": "superhuman-general-task_h1", "website": "superhuman-general", "category": "webarena-infinity", "additional": {"app_name": "superhuman-general", "difficulty": "hard", "verifier_path": "real-tasks/task_h1.py", "app_base_port": 8110}}}
{"query_id": "infinity-xero-invoicing-task_h79", "dataset": "webarena-infinity", "query": "Change the invoice prefix to 'AUS-' and the next number to 100, then create a new invoice for CloudNine Analytics for 8 hours of UI/UX design work.", "graders": ["infinity_state"], "start_url": "http://localhost:8120", "metadata": {"original_task_id": "xero-invoicing-task_h79", "website": "xero-invoicing", "category": "webarena-infinity", "additional": {"app_name": "xero-invoicing", "difficulty": "hard", "verifier_path": "real-tasks/task_h79.py", "app_base_port": 8120}}}
{"query_id": "infinity-figma-slides-task_h16", "dataset": "webarena-infinity", "query": "Enable slide numbers on every slide using the 'with total' format and change the aspect ratio to 4:3.", "graders": ["infinity_state"], "start_url": "http://localhost:8030", "metadata": {"original_task_id": "figma-slides-task_h16", "website": "figma-slides", "category": "webarena-infinity", "additional": {"app_name": "figma-slides", "difficulty": "hard", "verifier_path": "real-tasks/task_h16.py", "app_base_port": 8030}}}
{"query_id": "infinity-linear-account-settings-task_h16", "dataset": "webarena-infinity", "query": "Revoke all API keys that have an expiration date.", "graders": ["infinity_state"], "start_url": "http://localhost:8090", "metadata": {"original_task_id": "linear-account-settings-task_h16", "website": "linear-account-settings", "category": "webarena-infinity", "additional": {"app_name": "linear-account-settings", "difficulty": "hard", "verifier_path": "real-tasks/task_h16.py", "app_base_port": 8090}}}
{"query_id": "infinity-elation-prescriptions-task_h2", "dataset": "webarena-infinity", "query": "Prescribe Buspirone 10mg for the patient's anxiety \u2014 once daily in the morning, qty 30, 5 refills. Send it to the same pharmacy that fills his Sertraline.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h2", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h2.py", "app_base_port": 8020}}}
{"query_id": "infinity-handshake-career-exploration-task_h1", "dataset": "webarena-infinity", "query": "Follow all consulting firms on Handshake.", "graders": ["infinity_state"], "start_url": "http://localhost:8080", "metadata": {"original_task_id": "handshake-career-exploration-task_h1", "website": "handshake-career-exploration", "category": "webarena-infinity", "additional": {"app_name": "handshake-career-exploration", "difficulty": "hard", "verifier_path": "real-tasks/task_h1.py", "app_base_port": 8080}}}
{"query_id": "infinity-handshake-career-exploration-task_h141", "dataset": "webarena-infinity", "query": "Some of your saved jobs are from employers you haven't followed yet. Find and follow each of those employers.", "graders": ["infinity_state"], "start_url": "http://localhost:8080", "metadata": {"original_task_id": "handshake-career-exploration-task_h141", "website": "handshake-career-exploration", "category": "webarena-infinity", "additional": {"app_name": "handshake-career-exploration", "difficulty": "hard", "verifier_path": "real-tasks/task_h141.py", "app_base_port": 8080}}}
{"query_id": "infinity-figma-text-and-typography-task_h74", "dataset": "webarena-infinity", "query": "Set the spelling language to Japanese, the big nudge amount to 50, and the default horizontal alignment to right.", "graders": ["infinity_state"], "start_url": "http://localhost:8040", "metadata": {"original_task_id": "figma-text-and-typography-task_h74", "website": "figma-text-and-typography", "category": "webarena-infinity", "additional": {"app_name": "figma-text-and-typography", "difficulty": "hard", "verifier_path": "real-tasks/task_h74.py", "app_base_port": 8040}}}
{"query_id": "infinity-elation-patient-communication-task_h63", "dataset": "webarena-infinity", "query": "Check the visit summaries to find the patient whose BNP level improved. Reply to their most recent message confirming they can resume light activity, then update their emergency contact's phone number to (650) 555-0001.", "graders": ["infinity_state"], "start_url": "http://localhost:8010", "metadata": {"original_task_id": "elation-patient-communication-task_h63", "website": "elation-patient-communication", "category": "webarena-infinity", "additional": {"app_name": "elation-patient-communication", "difficulty": "hard", "verifier_path": "real-tasks/task_h63.py", "app_base_port": 8010}}}
{"query_id": "infinity-elation-patient-communication-task_h14", "dataset": "webarena-infinity", "query": "Change Dr. Torres's notification timeframe to 'Do not notify me' and remove Dr. Torres from Dr. Chen's General Question routing.", "graders": ["infinity_state"], "start_url": "http://localhost:8010", "metadata": {"original_task_id": "elation-patient-communication-task_h14", "website": "elation-patient-communication", "category": "webarena-infinity", "additional": {"app_name": "elation-patient-communication", "difficulty": "hard", "verifier_path": "real-tasks/task_h14.py", "app_base_port": 8010}}}
{"query_id": "infinity-gitlab-plan-and-track-task_h67", "dataset": "webarena-infinity", "query": "Delete all time entries from the GraphQL gateway issue, add a single new entry of 16 hours with summary 'Complete rewrite estimate', and set its time estimate to 40 hours.", "graders": ["infinity_state"], "start_url": "http://localhost:8050", "metadata": {"original_task_id": "gitlab-plan-and-track-task_h67", "website": "gitlab-plan-and-track", "category": "webarena-infinity", "additional": {"app_name": "gitlab-plan-and-track", "difficulty": "hard", "verifier_path": "real-tasks/task_h67.py", "app_base_port": 8050}}}
{"query_id": "infinity-gmail-accounts-and-contacts-task_h73", "dataset": "webarena-infinity", "query": "Among the individual people in your other contacts (those with a first and last name), find the one who was saved most recently. Move them to your main contacts, set their company to 'Salesforce', job title to 'Account Executive', and add the Work label.", "graders": ["infinity_state"], "start_url": "http://localhost:8070", "metadata": {"original_task_id": "gmail-accounts-and-contacts-task_h73", "website": "gmail-accounts-and-contacts", "category": "webarena-infinity", "additional": {"app_name": "gmail-accounts-and-contacts", "difficulty": "hard", "verifier_path": "real-tasks/task_h73.py", "app_base_port": 8070}}}
{"query_id": "infinity-elation-prescriptions-task_h4", "dataset": "webarena-infinity", "query": "Run a medication reconciliation and mark the Calcium+D3 supplement for discontinuation during the review.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h4", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h4.py", "app_base_port": 8020}}}
{"query_id": "infinity-elation-prescriptions-task_h47", "dataset": "webarena-infinity", "query": "The patient's SSRI is currently dispensed at a different pharmacy than most of his other medications. Prescribe a refill of the same SSRI at the same dose and sig, but send it to CVS #4521 instead \u2014 qty 30, 5 refills, 30 days supply.", "graders": ["infinity_state"], "start_url": "http://localhost:8020", "metadata": {"original_task_id": "elation-prescriptions-task_h47", "website": "elation-prescriptions", "category": "webarena-infinity", "additional": {"app_name": "elation-prescriptions", "difficulty": "hard", "verifier_path": "real-tasks/task_h47.py", "app_base_port": 8020}}}
{"query_id": "infinity-paypal-my-wallet-task_h89", "dataset": "webarena-infinity", "query": "If your USD PayPal balance is above $2,500, convert $500 to Japanese Yen. If it is $2,500 or below, first add $500 from your Chase bank account, then convert $500 to JPY. Either way, set the debit card cash back category to Fuel.", "graders": ["infinity_state"], "start_url": "http://localhost:8100", "metadata": {"original_task_id": "paypal-my-wallet-task_h89", "website": "paypal-my-wallet", "category": "webarena-infinity", "additional": {"app_name": "paypal-my-wallet", "difficulty": "hard", "verifier_path": "real-tasks/task_h89.py", "app_base_port": 8100}}}

View File

@@ -1,147 +0,0 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Coordinate Click Test</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
width: 100vw;
height: 100vh;
overflow: hidden;
background: #1a1a2e;
font-family: system-ui, -apple-system, sans-serif;
}
.circle {
position: fixed;
border-radius: 50%;
background: #e94560;
display: flex;
align-items: center;
justify-content: center;
color: #fff;
font-weight: 700;
cursor: pointer;
user-select: none;
transition: background 0.2s, transform 0.15s;
clip-path: circle(50%);
}
.circle:hover { transform: scale(1.08); }
.circle[data-clicked="true"] {
background: #0f3460;
pointer-events: none;
}
/* A — top-left area, large */
.circle-a {
width: 80px;
height: 80px;
font-size: 24px;
top: 15%;
left: 10%;
}
/* B — right side, upper-middle, medium */
.circle-b {
width: 50px;
height: 50px;
font-size: 18px;
top: 30%;
right: 18%;
}
/* C — center-left, lower area, small */
.circle-c {
width: 30px;
height: 30px;
font-size: 13px;
bottom: 25%;
left: 35%;
}
/* D — bottom-right area, very small */
.circle-d {
width: 16px;
height: 16px;
font-size: 9px;
bottom: 12%;
right: 30%;
}
#status {
position: fixed;
top: 50%;
left: 50%;
transform: translate(-50%, -50%);
color: #eee;
font-size: 16px;
text-align: center;
pointer-events: none;
}
#status .count {
font-size: 48px;
font-weight: 700;
color: #0f3460;
}
.success-flash {
animation: flash 0.4s ease-out;
}
@keyframes flash {
0% { background: #16c79a; transform: scale(1.3); }
100% { background: #0f3460; transform: scale(1); }
}
</style>
</head>
<body>
<div id="circle-a" class="circle circle-a" data-target="A" data-clicked="false">A</div>
<div id="circle-b" class="circle circle-b" data-target="B" data-clicked="false">B</div>
<div id="circle-c" class="circle circle-c" data-target="C" data-clicked="false">C</div>
<div id="circle-d" class="circle circle-d" data-target="D" data-clicked="false">D</div>
<div id="status">
<div class="count" id="clicked-count">0</div>
<div>of 4 targets clicked</div>
</div>
<script>
const circles = document.querySelectorAll('.circle')
const countEl = document.getElementById('clicked-count')
let clicked = 0
circles.forEach(circle => {
circle.addEventListener('click', (e) => {
if (circle.dataset.clicked === 'true') return
const rect = circle.getBoundingClientRect()
const centerX = rect.left + rect.width / 2
const centerY = rect.top + rect.height / 2
const radius = rect.width / 2
const dx = e.clientX - centerX
const dy = e.clientY - centerY
if (dx * dx + dy * dy > radius * radius) return
circle.dataset.clicked = 'true'
circle.classList.add('success-flash')
clicked++
countEl.textContent = clicked
if (clicked === 4) {
document.getElementById('status').innerHTML =
'<div class="count" style="color:#16c79a">ALL TARGETS HIT</div>' +
'<div>4 of 4 targets clicked</div>'
document.body.dataset.allClicked = 'true'
}
})
})
</script>
</body>
</html>

View File

@@ -1,16 +0,0 @@
const server = Bun.serve({
port: 3100,
async fetch(req) {
const url = new URL(req.url)
const path = url.pathname === '/' ? '/index.html' : url.pathname
const file = Bun.file(import.meta.dir + path)
if (await file.exists()) {
return new Response(file)
}
return new Response('Not Found', { status: 404 })
},
})
console.log(`Coordinate click test running at http://localhost:${server.port}`)

View File

@@ -0,0 +1,133 @@
#!/usr/bin/env python3
"""
AGI SDK evaluation helper for BrowserOS eval framework.
Reads JSON from stdin with task_id and env_state, runs the agisdk
evaluator, and outputs the result as JSON to stdout.
Input format:
{"task_id": "dashdish-1", "env_state": {...}, "model_response": ""}
Output format:
{"reward": 0.0, "pass": false, "message": "...", "per_criterion": [...]}
Lenient string matching is enabled by default: a failed criterion where
expected_value is a clean substring of actual_value (both strings) is
re-marked as a softened pass. This handles AGISDK tasks where the model
adds harmless decoration to a title or note (e.g. topwork-3, topwork-4).
Set AGISDK_STRICT_STRINGS=1 to disable and recover the strict score.
"""
import json
import os
import sys
_STRICT = os.environ.get("AGISDK_STRICT_STRINGS", "").lower() in ("1", "true", "yes")
def _soft_string_match(detail: object) -> bool:
"""Return True iff `detail` is `{actual_value, expected_value}` with both
strings and a non-empty `expected_value` that is contained in `actual_value`
(case-insensitive). Otherwise False — the criterion stays failed.
"""
if not isinstance(detail, dict):
return False
actual = detail.get("actual_value")
expected = detail.get("expected_value")
if not isinstance(actual, str) or not isinstance(expected, str):
return False
expected_stripped = expected.strip()
if not expected_stripped:
return False
return expected_stripped.lower() in actual.lower()
def main():
data = json.loads(sys.stdin.read())
task_id = data["task_id"]
env_state = data["env_state"]
model_response = data.get("model_response", "")
try:
from agisdk.REAL.browsergym.webclones.evaluate import WebCloneEvaluator
from agisdk.REAL.browsergym.webclones.task_config import TaskConfig
except ImportError:
print(
json.dumps(
{
"reward": 0,
"pass": False,
"message": "agisdk package not installed. Run: pip install agisdk",
"per_criterion": [],
}
)
)
sys.exit(0)
try:
# Redirect stdout to stderr during evaluation — agisdk's rich logger
# prints directly to stdout, which would corrupt our JSON output
real_stdout = sys.stdout
sys.stdout = sys.stderr
tc = TaskConfig(task_id)
evaluator = WebCloneEvaluator(tc)
reward_val, _done, message, info = evaluator.evaluate(
env_state=env_state, model_response=model_response
)
sys.stdout = real_stdout
reward_val = float(reward_val) if reward_val is not None else 0.0
results = info.get("results", [])
per_criterion = []
softened_count = 0
for r in results:
passed = bool(r[0])
detail = r[1] if len(r) > 1 else ""
entry: dict = {"passed": passed, "detail": str(detail)}
if not _STRICT and not passed and _soft_string_match(detail):
entry["passed"] = True
entry["softened"] = True
softened_count += 1
per_criterion.append(entry)
# Recompute pass/reward after softening: if every criterion now passes,
# the task counts as a soft pass.
all_pass = all(c["passed"] for c in per_criterion) and bool(per_criterion)
if all_pass and reward_val != 1.0:
reward_val = 1.0
out_message = str(message)
if softened_count and all_pass:
out_message = f"Task passed (with {softened_count} softened string criterion/criteria)."
print(
json.dumps(
{
"reward": reward_val,
"pass": reward_val == 1.0,
"message": out_message,
"per_criterion": per_criterion,
}
)
)
except Exception as e:
sys.stdout = real_stdout if "real_stdout" in dir() else sys.__stdout__
print(
json.dumps(
{
"reward": 0,
"pass": False,
"message": f"Evaluation error: {str(e)}",
"per_criterion": [],
}
)
)
if __name__ == "__main__":
main()

View File

@@ -1,93 +0,0 @@
"""
Analyze how many WebBench tasks require authentication across ALL buckets.
Usage: python3 apps/eval/scripts/analyze-webbench-auth.py
"""
import json
import re
from collections import defaultdict
# Login/auth indicators in task text
AUTH_KEYWORDS = [
"log in", "login", "sign in", "signin", "sign up", "signup",
"your account", "your profile", "your wishlist", "your order",
"your cart", "your dashboard", "your settings", "your subscription",
"your inbox", "your message", "your review", "your playlist",
"your favorites", "your saved", "your history", "your list",
"your address", "your payment", "your booking", "your reservation",
"my account", "my profile", "my wishlist", "my order", "my cart",
"my dashboard", "my settings", "my subscription", "my inbox",
"my message", "my review", "my playlist", "my favorites",
"my saved", "my history", "my list", "my address", "my payment",
"my booking", "my reservation", "my bag",
"send a message", "post a comment", "write a review", "submit a review",
"leave a review", "publish", "upload a", "create a playlist",
"add to cart", "add to bag", "add to wishlist", "add to favorites",
"save to", "bookmark", "subscribe", "unsubscribe",
"delete your", "remove your", "delete my", "remove my",
"edit your", "edit my", "update your", "update my",
"change your", "change my", "modify your", "modify my",
]
# Categories that almost always need auth
WRITE_CATEGORIES = {"CREATE", "UPDATE", "DELETE"}
def needs_auth(task_text, category):
task_lower = task_text.lower()
# Check keywords
for kw in AUTH_KEYWORDS:
if kw in task_lower:
return True, f"keyword: '{kw}'"
# WRITE tasks that don't match keywords but still likely need auth
# (be conservative — some CREATE tasks like "create a search filter" don't need login)
return False, ""
# Load all datasets
for bucket in [0, 1, 2]:
full_path = f"apps/eval/data/webbench-{bucket}of4.jsonl"
tasks = []
with open(full_path) as f:
for line in f:
tasks.append(json.loads(line))
auth_tasks = []
no_auth_tasks = []
for t in tasks:
needs, reason = needs_auth(t["query"], t["metadata"]["category"])
if needs:
auth_tasks.append((t, reason))
else:
no_auth_tasks.append(t)
print(f"{'=' * 60}")
print(f"BUCKET {bucket}/4: {len(tasks)} total")
print(f" Needs auth: {len(auth_tasks)} ({len(auth_tasks)/len(tasks)*100:.0f}%)")
print(f" No auth: {len(no_auth_tasks)} ({len(no_auth_tasks)/len(tasks)*100:.0f}%)")
# Breakdown of no-auth tasks
cats = defaultdict(int)
diffs = defaultdict(int)
domains = set()
for t in no_auth_tasks:
cats[t["metadata"]["category"]] += 1
diffs[t["metadata"]["additional"]["difficulty"]] += 1
domains.add(t["metadata"]["website"])
cat_str = ", ".join(f"{c}({n})" for c, n in sorted(cats.items(), key=lambda x: -x[1]))
diff_str = ", ".join(f"{d}({n})" for d, n in sorted(diffs.items(), key=lambda x: -x[1]))
print(f" No-auth breakdown:")
print(f" categories: {cat_str}")
print(f" difficulty: {diff_str}")
print(f" websites: {len(domains)}")
# Sample no-auth tasks
print(f"\n Sample no-auth tasks:")
for t in no_auth_tasks[:8]:
print(f" [{t['metadata']['additional']['webbench_id']}] [{t['metadata']['category']}] {t['metadata']['website']}")
print(f" {t['query'][:150]}")
# Sample auth tasks (to verify detection)
print(f"\n Sample auth tasks (verify detection):")
for t, reason in auth_tasks[:5]:
print(f" [{t['metadata']['additional']['webbench_id']}] [{t['metadata']['category']}] {t['metadata']['website']} ({reason})")
print(f" {t['query'][:150]}")
print()

View File

@@ -1,214 +0,0 @@
"""
Analyze WebBench results across ALL 8 agents to stratify tasks by pass count.
Usage: python3 apps/eval/scripts/analyze-webbench.py
"""
import csv
import os
from collections import defaultdict
DATA_DIR = "apps/eval/data/webbench"
AGENTS = [
{"file": "anthropicfinal.csv", "eval_col": "Anthropic_Eval", "name": "Anthropic CUA"},
{"file": "skyvern2.0final.csv", "eval_col": "Skyvern2.0Eval", "name": "Skyvern 2.0"},
{"file": "skyvern2.0browserbasefinal.csv", "eval_col": "Browserbase_SkyvernEval", "name": "Skyvern BB"},
{"file": "openaicuafinal.csv", "eval_col": "CUAEval", "name": "OpenAI CUA"},
{"file": "browserusefinal.csv", "eval_col": "BUEval", "name": "BrowserUse"},
{"file": "convergencehitlfinal.csv", "eval_col": "convergence_hitl_eval", "name": "Convergence"},
{"file": "operatorhitlfinal.csv", "eval_col": "operator_hitl_eval", "name": "Operator"},
{"file": "rtrvrfinal.csv", "eval_col": "Human Label", "name": "RTRVR"},
]
def load_agent(agent):
path = os.path.join(DATA_DIR, agent["file"])
results = {}
with open(path, newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
try:
task_id = int(row["ID"])
except (ValueError, KeyError):
continue
eval_val = row.get(agent["eval_col"], "")
results[task_id] = {
"eval": eval_val,
"difficulty": row.get("Difficulty", ""),
"category": row.get("Category", ""),
"task": row.get("Task", ""),
"url": row.get("Starting URL", ""),
}
return results
# Load all agents
print("Loading agents...")
agent_results = {}
for agent in AGENTS:
data = load_agent(agent)
agent_results[agent["name"]] = data
print(f" {agent['name']}: {len(data)} tasks")
# ─── INDIVIDUAL AGENT STATS ──────────────────────────────────────────
print("\n" + "=" * 70)
print("INDIVIDUAL AGENT PASS RATES")
print("=" * 70)
for agent in AGENTS:
name = agent["name"]
data = agent_results[name]
total = len(data)
passed = sum(1 for r in data.values() if r["eval"] and "success" in r["eval"].lower())
easy_total = sum(1 for r in data.values() if r["difficulty"] == "easy")
easy_pass = sum(1 for r in data.values() if r["difficulty"] == "easy" and r["eval"] and "success" in r["eval"].lower())
hard_total = sum(1 for r in data.values() if r["difficulty"] == "hard")
hard_pass = sum(1 for r in data.values() if r["difficulty"] == "hard" and r["eval"] and "success" in r["eval"].lower())
print(f"\n{name}: {passed}/{total} = {passed/total*100:.1f}%")
if easy_total:
print(f" easy: {easy_pass}/{easy_total} = {easy_pass/easy_total*100:.1f}%")
if hard_total:
print(f" hard: {hard_pass}/{hard_total} = {hard_pass/hard_total*100:.1f}%")
# ─── FULL-COVERAGE AGENTS (2452 tasks each) ──────────────────────────
# Anthropic CUA, Skyvern 2.0, Skyvern BB, OpenAI CUA
full_agents = ["Anthropic CUA", "Skyvern 2.0", "Skyvern BB", "OpenAI CUA"]
print("\n" + "=" * 70)
print(f"4 FULL-COVERAGE AGENTS: {', '.join(full_agents)}")
print("(each has ~2452 tasks)")
print("=" * 70)
# Collect IDs present in ALL 4 full agents
all_ids = None
for name in full_agents:
ids = set(agent_results[name].keys())
all_ids = ids if all_ids is None else all_ids & ids
print(f"Tasks in intersection: {len(all_ids)}")
by_pass = defaultdict(list)
for tid in sorted(all_ids):
pass_count = 0
info = {}
agent_evals = {}
for name in full_agents:
r = agent_results[name][tid]
is_success = "success" in r["eval"].lower() if r["eval"] else False
if is_success:
pass_count += 1
agent_evals[name] = "PASS" if is_success else "FAIL"
if not info:
info = r
by_pass[pass_count].append({
"id": tid, "pass_count": pass_count,
"difficulty": info["difficulty"], "category": info["category"],
"task": info["task"], "url": info["url"], "agents": agent_evals,
})
for pc in range(5):
tasks = by_pass[pc]
label = {0: "0/4 (ALL FAIL)", 4: "4/4 (ALL PASS)"}.get(pc, f"{pc}/4")
easy = sum(1 for t in tasks if t["difficulty"] == "easy")
hard = sum(1 for t in tasks if t["difficulty"] == "hard")
cats = defaultdict(int)
for t in tasks:
cats[t["category"]] += 1
urls = len(set(t["url"] for t in tasks))
cat_str = ", ".join(f"{c}({n})" for c, n in sorted(cats.items(), key=lambda x: -x[1]))
print(f"\n{label}: {len(tasks)} tasks")
print(f" easy: {easy}, hard: {hard}")
print(f" categories: {cat_str}")
print(f" unique websites: {urls}")
# ─── NOW ALSO CHECK: how many 0/4 tasks require login? ───────────────
print("\n" + "=" * 70)
print("0/4 TASKS: LOGIN vs NO-LOGIN breakdown")
print("=" * 70)
login_keywords = ["log in", "login", "sign in", "signin", "your account", "your profile",
"your wishlist", "your order", "your cart", "your dashboard", "your settings",
"your subscription", "your inbox", "your message", "your review",
"send a message", "post a comment", "write a review", "submit a",
"publish", "upload"]
zero_pass = by_pass[0]
login_tasks = []
no_login_tasks = []
for t in zero_pass:
task_lower = t["task"].lower()
needs_login = any(kw in task_lower for kw in login_keywords)
if needs_login:
login_tasks.append(t)
else:
no_login_tasks.append(t)
print(f" Likely needs login: {len(login_tasks)}")
print(f" Possibly no login: {len(no_login_tasks)}")
print(f"\n No-login 0/4 tasks by category:")
cats = defaultdict(int)
for t in no_login_tasks:
cats[t["category"]] += 1
cat_str = ", ".join(f"{c}({n})" for c, n in sorted(cats.items(), key=lambda x: -x[1]))
print(f" {cat_str}")
print(f"\n Sample no-login 0/4 tasks:")
for t in no_login_tasks[:10]:
print(f" [{t['id']}] [{t['difficulty']}] [{t['category']}] {t['url']}")
print(f" {t['task'][:180]}")
# ─── ALSO INCLUDE THE HITL AGENTS (smaller overlap) ──────────────────
hitl_agents = ["Convergence", "Operator", "RTRVR"]
print("\n" + "=" * 70)
print(f"HITL AGENTS: {', '.join(hitl_agents)}")
print("=" * 70)
for name in hitl_agents:
data = agent_results[name]
total = len(data)
passed = sum(1 for r in data.values() if r["eval"] and "success" in r["eval"].lower())
print(f" {name}: {passed}/{total} = {passed/total*100:.1f}%")
# See how HITL agents do on the same tasks as the 4 full agents
hitl_ids = None
for name in hitl_agents:
ids = set(agent_results[name].keys())
hitl_ids = ids if hitl_ids is None else hitl_ids & ids
common_hitl = all_ids & hitl_ids if hitl_ids else set()
print(f"\n Tasks in common (all 7 agents): {len(common_hitl)}")
if common_hitl:
by_pass_7 = defaultdict(list)
all_7 = full_agents + hitl_agents
for tid in sorted(common_hitl):
pass_count = 0
info = {}
for name in all_7:
r = agent_results[name].get(tid)
if r:
is_success = "success" in r["eval"].lower() if r["eval"] else False
if is_success:
pass_count += 1
if not info:
info = r
by_pass_7[pass_count].append({"id": tid, **info})
print("\n 7-AGENT PASS COUNT (on common subset):")
for pc in range(8):
if by_pass_7[pc]:
print(f" {pc}/7: {len(by_pass_7[pc])} tasks")
# ─── SUMMARY TABLE ───────────────────────────────────────────────────
print("\n" + "=" * 70)
print("SUMMARY FOR DATASET BUILDING")
print("=" * 70)
print(f"""
Pool sizes (4 full-coverage agents):
0/4 (all fail): {len(by_pass[0]):>4} (login-required: ~{len(login_tasks)}, no-login: ~{len(no_login_tasks)})
1/4: {len(by_pass[1]):>4}
2/4: {len(by_pass[2]):>4}
3/4: {len(by_pass[3]):>4}
4/4 (all pass): {len(by_pass[4]):>4}
─────────────────────
Total: {sum(len(v) for v in by_pass.values()):>4}
""")

View File

@@ -1,233 +0,0 @@
/**
* Analyze WebBench results across 4 agents to stratify tasks by pass count.
* Usage: bun apps/eval/scripts/analyze-webbench.ts
*/
import { parse } from 'csv-parse/sync'
const dataDir = 'apps/eval/data/webbench'
interface AgentConfig {
file: string
evalCol: string
name: string
}
const agents: AgentConfig[] = [
{ file: 'anthropicfinal.csv', evalCol: 'Anthropic_Eval', name: 'Anthropic' },
{ file: 'skyvern2.0final.csv', evalCol: 'Skyvern2.0Eval', name: 'Skyvern' },
{ file: 'openaicuafinal.csv', evalCol: 'CUAEval', name: 'OpenAI CUA' },
{ file: 'browserusefinal.csv', evalCol: 'BUEval', name: 'BrowserUse' },
]
type Row = Record<string, string>
// Parse each agent's results
const agentResults = new Map<
string,
Map<
number,
{
eval: string
difficulty: string
category: string
task: string
url: string
}
>
>()
for (const agent of agents) {
const text = await Bun.file(`${dataDir}/${agent.file}`).text()
const rows: Row[] = parse(text, {
columns: true,
skip_empty_lines: true,
relax_column_count: true,
})
const results = new Map<
number,
{
eval: string
difficulty: string
category: string
task: string
url: string
}
>()
for (const row of rows) {
const id = parseInt(row.ID, 10)
if (Number.isNaN(id)) continue
results.set(id, {
eval: row[agent.evalCol] || '',
difficulty: row.Difficulty || '',
category: row.Category || '',
task: row.Task || '',
url: row['Starting URL'] || '',
})
}
agentResults.set(agent.name, results)
console.log(`${agent.name}: ${results.size} tasks loaded`)
}
// Find common task IDs (present in all 4 agents)
const allIds = new Set<number>()
for (const [, results] of agentResults) {
for (const id of results.keys()) allIds.add(id)
}
// Build pass count per task
interface TaskStats {
id: number
passCount: number
difficulty: string
category: string
task: string
url: string
agents: Record<string, string>
}
const taskStats: TaskStats[] = []
const _fullAgentNames = agents.map((a) => a.name)
for (const id of allIds) {
let passCount = 0
let _presentCount = 0
const agentEvals: Record<string, string> = {}
let difficulty = ''
let category = ''
let task = ''
let url = ''
for (const agent of agents) {
const result = agentResults.get(agent.name)?.get(id)
if (result) {
_presentCount++
const isSuccess = result.eval?.toLowerCase().includes('success')
if (isSuccess) passCount++
agentEvals[agent.name] = isSuccess ? 'PASS' : 'FAIL'
if (!difficulty) difficulty = result.difficulty
if (!category) category = result.category
if (!task) task = result.task
if (!url) url = result.url
} else {
agentEvals[agent.name] = 'N/A'
}
}
taskStats.push({
id,
passCount,
difficulty,
category,
task,
url,
agents: agentEvals,
})
}
// Group by pass count
const byPassCount: Record<number, TaskStats[]> = {
0: [],
1: [],
2: [],
3: [],
4: [],
}
for (const t of taskStats) {
byPassCount[t.passCount].push(t)
}
console.log('\n═══════════════════════════════════════════════════')
console.log('TASKS BY PASS COUNT (how many agents succeeded)')
console.log('═══════════════════════════════════════════════════\n')
for (let pc = 0; pc <= 4; pc++) {
const tasks = byPassCount[pc]
const label =
pc === 0 ? '0/4 (ALL FAIL)' : pc === 4 ? '4/4 (ALL PASS)' : `${pc}/4`
console.log(`${label}: ${tasks.length} tasks`)
// Breakdown by difficulty
const easy = tasks.filter((t) => t.difficulty === 'easy').length
const hard = tasks.filter((t) => t.difficulty === 'hard').length
console.log(` easy: ${easy}, hard: ${hard}`)
// Breakdown by category
const byCat: Record<string, number> = {}
for (const t of tasks) {
byCat[t.category] = (byCat[t.category] || 0) + 1
}
console.log(
` categories: ${Object.entries(byCat)
.sort((a, b) => b[1] - a[1])
.map(([c, n]) => `${c}(${n})`)
.join(', ')}`,
)
console.log()
}
// Now handle BrowserUse only having 658 tasks — let's also do a 3-agent view (Anthropic, Skyvern, OpenAI)
console.log('\n═══════════════════════════════════════════════════')
console.log('3-AGENT VIEW (Anthropic + Skyvern + OpenAI CUA)')
console.log('(BrowserUse only has 658 tasks, so this is more complete)')
console.log('═══════════════════════════════════════════════════\n')
const threeAgents = ['Anthropic', 'Skyvern', 'OpenAI CUA']
const byPassCount3: Record<number, TaskStats[]> = { 0: [], 1: [], 2: [], 3: [] }
for (const t of taskStats) {
let pc3 = 0
let allPresent = true
for (const a of threeAgents) {
if (t.agents[a] === 'N/A') {
allPresent = false
break
}
if (t.agents[a] === 'PASS') pc3++
}
if (!allPresent) continue
if (!byPassCount3[pc3]) byPassCount3[pc3] = []
byPassCount3[pc3].push(t)
}
let total3 = 0
for (let pc = 0; pc <= 3; pc++) {
const tasks = byPassCount3[pc]
total3 += tasks.length
const label =
pc === 0 ? '0/3 (ALL FAIL)' : pc === 3 ? '3/3 (ALL PASS)' : `${pc}/3`
console.log(`${label}: ${tasks.length} tasks`)
const easy = tasks.filter((t) => t.difficulty === 'easy').length
const hard = tasks.filter((t) => t.difficulty === 'hard').length
console.log(` easy: ${easy}, hard: ${hard}`)
const byCat: Record<string, number> = {}
for (const t of tasks) {
byCat[t.category] = (byCat[t.category] || 0) + 1
}
console.log(
` categories: ${Object.entries(byCat)
.sort((a, b) => b[1] - a[1])
.map(([c, n]) => `${c}(${n})`)
.join(', ')}`,
)
// Show unique websites count
const uniqueUrls = new Set(tasks.map((t) => t.url))
console.log(` unique websites: ${uniqueUrls.size}`)
console.log()
}
console.log(`Total tasks in 3-agent intersection: ${total3}`)
// Quick sample of 0/3 tasks (hardest)
console.log('\n── Sample 0/3 (all fail) tasks ──')
byPassCount3[0].slice(0, 5).forEach((t) => {
console.log(` [${t.id}] [${t.difficulty}] [${t.category}] ${t.url}`)
console.log(` ${t.task.slice(0, 150)}`)
})
console.log('\n── Sample 1/3 tasks ──')
byPassCount3[1].slice(0, 5).forEach((t) => {
console.log(` [${t.id}] [${t.difficulty}] [${t.category}] ${t.url}`)
console.log(` ${t.task.slice(0, 150)}`)
})

View File

@@ -1,340 +0,0 @@
#!/usr/bin/env bun
/**
* Annotate Screenshots with Tool Coordinates
*
* Reads messages.jsonl from an eval run and annotates screenshots with
* coordinate markers showing where browser actions (click, fill, hover, drag)
* actually landed.
*
* Coordinates are in CSS pixels (returned by tool outputs). They're mapped to
* screenshot pixels using: screenshot_xy = css_xy × devicePixelRatio
*
* Usage:
* bun run apps/eval/scripts/annotate-screenshots.ts <results-folder> [--dpr=2]
*
* Options:
* --dpr=N devicePixelRatio (default: 2). Use the value from take_screenshot output.
*
* Output:
* Creates an 'annotated' folder inside the screenshots directory.
*/
import {
copyFileSync,
existsSync,
mkdirSync,
readdirSync,
readFileSync,
} from 'node:fs'
import { basename, join } from 'node:path'
import sharp from 'sharp'
interface ActionInfo {
screenshotNum: number
toolName: string
cssX: number
cssY: number
// For drag: second coordinate
cssX2?: number
cssY2?: number
}
const COORDINATE_TOOLS = new Set([
'click',
'click_at',
'fill',
'hover',
'hover_at',
'type_at',
'drag',
'drag_at',
])
/**
* Parse CSS coordinates from tool output text.
*
* Formats returned by tools:
* "Clicked [47] at (125, 42)"
* "Typed 5 characters into [12] at (300, 150)"
* "Hovered over [31] at (200, 88)"
* "Clicked at (125, 42)"
* "Hovered at (125, 42)"
* "Typed 10 chars at (125, 42)"
* "Dragged [10] (50, 100) → [20] (400, 300)"
* "Dragged from (50, 100) to (400, 300)"
*/
function parseCoordinates(
toolName: string,
output: unknown,
): { x: number; y: number; x2?: number; y2?: number } | null {
const text = extractText(output)
if (!text) return null
// Drag with two coordinate pairs: "(x1, y1) → ... (x2, y2)" or "from (x1, y1) to (x2, y2)"
if (toolName === 'drag' || toolName === 'drag_at') {
const dragMatch = text.match(
/\((\d+),\s*(\d+)\).*?(?:→|to)\s*.*?\((\d+),\s*(\d+)\)/,
)
if (dragMatch) {
return {
x: Number(dragMatch[1]),
y: Number(dragMatch[2]),
x2: Number(dragMatch[3]),
y2: Number(dragMatch[4]),
}
}
}
// Single coordinate: "at (x, y)" or just "(x, y)"
const singleMatch = text.match(/\((\d+),\s*(\d+)\)/)
if (singleMatch) {
return { x: Number(singleMatch[1]), y: Number(singleMatch[2]) }
}
return null
}
function extractText(output: unknown): string | null {
if (typeof output === 'string') return output
if (Array.isArray(output)) {
for (const item of output) {
if (item?.type === 'text' && typeof item.text === 'string')
return item.text
}
}
if (output && typeof output === 'object' && 'text' in output) {
return String((output as Record<string, unknown>).text)
}
return null
}
/**
* Parse messages.jsonl to extract actions with coordinates
*/
function parseMessages(messagesPath: string): ActionInfo[] {
const content = readFileSync(messagesPath, 'utf-8')
const lines = content.trim().split('\n')
const messages = lines.map((line) => JSON.parse(line))
const actions: ActionInfo[] = []
const pendingTools = new Map<
string,
{ toolName: string; screenshotNum: number }
>()
let screenshotNum = 0
for (const msg of messages) {
if (msg.type === 'tool-input-available') {
pendingTools.set(msg.toolCallId, {
toolName: msg.toolName,
screenshotNum: -1,
})
}
if (msg.type === 'tool-output-available') {
screenshotNum++
const pending = pendingTools.get(msg.toolCallId)
if (!pending) continue
if (!COORDINATE_TOOLS.has(pending.toolName)) {
pendingTools.delete(msg.toolCallId)
continue
}
const coords = parseCoordinates(pending.toolName, msg.output)
if (coords) {
actions.push({
screenshotNum,
toolName: pending.toolName,
cssX: coords.x,
cssY: coords.y,
cssX2: coords.x2,
cssY2: coords.y2,
})
}
pendingTools.delete(msg.toolCallId)
}
}
return actions
}
async function annotateScreenshot(
inputPath: string,
outputPath: string,
action: ActionInfo | null,
dpr: number,
): Promise<void> {
if (!action) {
copyFileSync(inputPath, outputPath)
return
}
const image = sharp(inputPath)
const metadata = await image.metadata()
// biome-ignore lint/style/noNonNullAssertion: sharp metadata always has dimensions for valid images
const imgWidth = metadata.width!
// biome-ignore lint/style/noNonNullAssertion: sharp metadata always has dimensions for valid images
const imgHeight = metadata.height!
const sx = Math.round(action.cssX * dpr)
const sy = Math.round(action.cssY * dpr)
let markersSvg = ''
// Primary marker (red crosshair)
markersSvg += `
<circle cx="${sx}" cy="${sy}" r="25" fill="none" stroke="red" stroke-width="4"/>
<circle cx="${sx}" cy="${sy}" r="6" fill="red" fill-opacity="0.6"/>
<line x1="${sx - 40}" y1="${sy}" x2="${sx - 10}" y2="${sy}" stroke="red" stroke-width="3"/>
<line x1="${sx + 10}" y1="${sy}" x2="${sx + 40}" y2="${sy}" stroke="red" stroke-width="3"/>
<line x1="${sx}" y1="${sy - 40}" x2="${sx}" y2="${sy - 10}" stroke="red" stroke-width="3"/>
<line x1="${sx}" y1="${sy + 10}" x2="${sx}" y2="${sy + 40}" stroke="red" stroke-width="3"/>
`
// Drag target marker (orange)
if (action.cssX2 !== undefined && action.cssY2 !== undefined) {
const sx2 = Math.round(action.cssX2 * dpr)
const sy2 = Math.round(action.cssY2 * dpr)
markersSvg += `
<circle cx="${sx2}" cy="${sy2}" r="25" fill="none" stroke="orange" stroke-width="4"/>
<circle cx="${sx2}" cy="${sy2}" r="6" fill="orange" fill-opacity="0.6"/>
<line x1="${sx}" y1="${sy}" x2="${sx2}" y2="${sy2}" stroke="orange" stroke-width="2" stroke-dasharray="8,4"/>
`
}
// Info box
const label2 =
action.cssX2 !== undefined
? ` → (${action.cssX2}, ${action.cssY2}) css`
: ''
const infoText = `${action.toolName}: (${action.cssX}, ${action.cssY}) css × ${dpr} dpr = (${sx}, ${sy}) px${label2}`
markersSvg += `
<rect x="10" y="10" width="${Math.min(infoText.length * 8 + 20, imgWidth - 20)}" height="50" fill="rgba(0,0,0,0.9)" rx="5"/>
<text x="20" y="30" fill="red" font-family="monospace" font-size="14" font-weight="bold">
Screenshot ${action.screenshotNum}: AFTER ${action.toolName}
</text>
<text x="20" y="50" fill="white" font-family="monospace" font-size="12">
${infoText}
</text>
`
const svg = `<svg width="${imgWidth}" height="${imgHeight}">${markersSvg}</svg>`
await image
.composite([{ input: Buffer.from(svg), top: 0, left: 0 }])
.toFile(outputPath)
}
async function main() {
const args = process.argv.slice(2)
const flags = args.filter((a) => a.startsWith('--'))
const positional = args.filter((a) => !a.startsWith('--'))
if (positional.length === 0) {
console.log(
'Usage: bun run apps/eval/scripts/annotate-screenshots.ts <results-folder> [--dpr=2]',
)
console.log('')
console.log('Example:')
console.log(
' bun run apps/eval/scripts/annotate-screenshots.ts apps/eval/results/single/Amazon--3',
)
process.exit(1)
}
const dprFlag = flags.find((f) => f.startsWith('--dpr='))
let dpr = dprFlag ? Number(dprFlag.split('=')[1]) : 0
// Try reading DPR from metadata.json if not explicitly provided
if (!dpr) {
const metadataPath = join(positional[0], 'metadata.json')
if (existsSync(metadataPath)) {
const meta = JSON.parse(readFileSync(metadataPath, 'utf-8'))
dpr = meta.device_pixel_ratio ?? 0
if (dpr) console.log(`Read devicePixelRatio=${dpr} from metadata.json`)
}
}
if (!dpr) {
console.error(
'Error: devicePixelRatio not found in metadata.json. Provide --dpr=N flag.',
)
process.exit(1)
}
const resultsFolder = positional[0]
const messagesPath = join(resultsFolder, 'messages.jsonl')
const screenshotsDir = join(resultsFolder, 'screenshots')
const annotatedDir = join(screenshotsDir, 'annotated')
if (!existsSync(messagesPath)) {
console.error(`Error: messages.jsonl not found at ${messagesPath}`)
process.exit(1)
}
if (!existsSync(screenshotsDir)) {
console.error(`Error: screenshots directory not found at ${screenshotsDir}`)
process.exit(1)
}
mkdirSync(annotatedDir, { recursive: true })
console.log(`devicePixelRatio: ${dpr}`)
console.log('Parsing messages.jsonl...')
const actions = parseMessages(messagesPath)
console.log(`Found ${actions.length} actions with coordinates:`)
for (const action of actions) {
const dragInfo =
action.cssX2 !== undefined ? ` → (${action.cssX2}, ${action.cssY2})` : ''
console.log(
` Screenshot ${action.screenshotNum}: ${action.toolName} at (${action.cssX}, ${action.cssY})${dragInfo} css → (${Math.round(action.cssX * dpr)}, ${Math.round(action.cssY * dpr)}) px`,
)
}
console.log('')
const screenshots = readdirSync(screenshotsDir)
.filter((f) => f.endsWith('.png') && !f.includes('annotated'))
.sort((a, b) => {
const numA = parseInt(basename(a, '.png'), 10)
const numB = parseInt(basename(b, '.png'), 10)
return numA - numB
})
console.log(`Found ${screenshots.length} screenshots`)
const firstMeta = await sharp(join(screenshotsDir, screenshots[0])).metadata()
console.log(`Screenshot dimensions: ${firstMeta.width} x ${firstMeta.height}`)
console.log('')
const actionByScreenshot = new Map<number, ActionInfo>()
for (const action of actions) {
actionByScreenshot.set(action.screenshotNum, action)
}
console.log('Annotating screenshots...')
for (const ss of screenshots) {
const ssNum = parseInt(basename(ss, '.png'), 10)
const inputPath = join(screenshotsDir, ss)
const outputPath = join(annotatedDir, `${ssNum}_annotated.png`)
const action = actionByScreenshot.get(ssNum) || null
if (action) {
console.log(` ${ss} → annotated (${action.toolName})`)
} else {
console.log(` ${ss} → copied (no coordinates)`)
}
await annotateScreenshot(inputPath, outputPath, action, dpr)
}
console.log('')
console.log(`Done! Annotated screenshots saved to: ${annotatedDir}`)
}
main().catch((err) => {
console.error('Error:', err)
process.exit(1)
})

View File

@@ -0,0 +1,176 @@
#!/usr/bin/env python3
"""
Build JSONL dataset for AGI SDK / REAL Bench evaluation.
Reads task definitions from the agisdk package, filters to feasible
action-only tasks (excludes llm_boolean evaluators), and outputs JSONL
to stdout in the BrowserOS eval framework format.
Usage:
python scripts/build-agisdk-dataset.py > data/agisdk-real.jsonl
"""
import json
import re
import sys
from datetime import date
# evals-omnizon.vercel.app was DMCA-takedown'd by Vercel (HTTP 451). Every task
# on that site fails grading with "Failed to fetch /finish endpoint".
EXCLUDED_WEBSITES = {"omnizon"}
# Tasks where either the task itself is invalid (data rot, eval site broken)
# or the grader penalizes correct work. We do NOT exclude tasks where the
# agent system genuinely fails (e.g. broken MCP tools) — those are real
# capability gaps the team needs to see in the score.
#
# Each entry below was confirmed via head-to-head deep-dive on the 2026-04-28
# K2.5 + Opus 4.6 runs; see plans/audits/.
EXCLUDED_TASKS = {
# evals-topwork.vercel.app throws Minified React error #185
# ("Maximum update depth exceeded") on every form submit; the page renders
# "Application error: a client-side exception has occurred" instead of
# saving the job post. Eval site is broken.
"topwork-1",
# Hardcodes `Exp: 12/25` in both the goal text and a jmespath grader
# criterion (`paymentInfo.expDate`). Freshening the goal alone leaves the
# grader expecting the original (now-expired) value; freshening both would
# require monkey-patching agisdk's TaskConfig at runtime. Unsolvable
# without two-sided patching.
"fly-unified-2",
# Goal says "Dec 18 2024 at 10:00", but the live eval site only has 2025
# inventory and no 10:00 slot at all. Both K2.5 and Opus successfully
# booked the closest flight; neither could match the grader's expected
# timestamp. Data rot.
"fly-unified-9",
# Eval site stores selected flight times as bare-UTC wall time
# (`T08:00:00.000Z`) but the grader expects them shifted by 8h
# (`T16:00:00.000Z` = 8 AM PST). Opus 4.6 completed the booking
# correctly and was penalized only on the timestamp criteria.
# Eval-site TZ-storage bug.
"fly-unified-4",
# Goal says "Clear all emails from GitHub in the inbox" but the third
# grader criterion expects exactly 1 update. Both models correctly
# interpreted "all" and were penalized for it. Grader contradicts goal.
"gomail-8",
# Goal says "Choose a random person you haven't connected with" but the
# grader hardcodes `profilesDiff.updated."4".connectionGrade`. Both models
# picked someone other than profile id 4 (correctly random) and were
# penalized. Grader contradicts goal.
"networkin-6",
# Eval site's `searchHistoryDiff` doesn't record search queries submitted
# via the autocomplete + Enter path. Opus 4.6 completed the entire task
# correctly (sent connection request + message to a Stanford alumna) but
# the grader's first criterion (search history contains "stanford") was
# never triggered server-side. Eval-site bug.
"networkin-9",
}
# Far-future replacement used by `freshen_goal_dates` when a task's hardcoded
# credit-card expiration is in the past (or expires within the next 6 months).
_FRESH_EXP = "Exp: 12/30"
_EXP_PATTERN = re.compile(r"Exp:\s*(\d{2})/(\d{2})\b")
def freshen_goal_dates(goal: str) -> str:
"""Roll any `Exp: MM/YY` date forward when it's within 6 months of today.
Several AGISDK tasks (e.g., fly-unified-{2,5,12}) hardcode credit-card
expirations like `Exp: 12/25`. The eval-site checkout forms reject expired
cards; once the wall clock passes the hardcoded date, those tasks become
unsolvable. Two-digit years are interpreted as 20YY.
"""
today_yyyymm = date.today().year * 12 + date.today().month
def replace(match: re.Match[str]) -> str:
mo, yr = int(match.group(1)), int(match.group(2))
exp_yyyymm = (2000 + yr) * 12 + mo
if exp_yyyymm <= today_yyyymm + 6:
return _FRESH_EXP
return match.group(0)
return _EXP_PATTERN.sub(replace, goal)
def has_llm_eval(task: dict) -> bool:
return any(e.get("type") == "llm_boolean" for e in task.get("evals", []))
def main():
try:
from agisdk.REAL.tasks import all_tasks
except ImportError:
print(
"Error: agisdk package not installed. Run: pip install agisdk",
file=sys.stderr,
)
sys.exit(1)
count = 0
skipped_infeasible = 0
skipped_llm = 0
skipped_excluded = 0
skipped_tasks = 0
freshened = 0
for task in all_tasks:
if not task.get("possible", True):
skipped_infeasible += 1
continue
if has_llm_eval(task):
skipped_llm += 1
continue
task_id = task["id"]
if task_id in EXCLUDED_TASKS:
skipped_tasks += 1
continue
website = task.get("website", {})
if website.get("id") in EXCLUDED_WEBSITES:
skipped_excluded += 1
continue
original_goal = task.get("goal", "")
goal = freshen_goal_dates(original_goal)
if goal != original_goal:
freshened += 1
start_url = website.get("url", "")
if not start_url or not goal:
print(f"Warning: Skipping {task_id} — missing url or goal", file=sys.stderr)
continue
entry = {
"query_id": f"agisdk-{task_id}",
"dataset": "agisdk-real",
"query": goal,
"graders": ["agisdk_state_diff"],
"start_url": start_url,
"metadata": {
"original_task_id": task_id,
"website": website.get("name", ""),
"category": "agisdk-real",
"additional": {
"agisdk_task_id": task_id,
"challenge_type": task.get("challengeType", "action"),
"difficulty": task.get("difficulty", "unknown"),
"similar_to": website.get("similarTo", ""),
},
},
}
print(json.dumps(entry))
count += 1
print(
f"Generated {count} tasks (skipped {skipped_infeasible} infeasible, "
f"{skipped_llm} llm_boolean, {skipped_excluded} excluded sites, "
f"{skipped_tasks} excluded tasks; freshened {freshened} expired card dates)",
file=sys.stderr,
)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,118 @@
#!/usr/bin/env python3
"""
Dataset generator for WebArena-Infinity benchmark.
Reads real-tasks.json from each app directory and outputs JSONL
in the eval framework's TaskSchema format.
Usage:
python build-infinity-dataset.py --apps-dir /path/to/webarena-infinity/apps
python build-infinity-dataset.py --apps-dir /path/to/apps --apps gmail linear --difficulty medium
"""
import argparse
import json
import os
import sys
def load_tasks(app_dir: str) -> list[dict]:
tasks_file = os.path.join(app_dir, "real-tasks.json")
if not os.path.exists(tasks_file):
print(f"Warning: No real-tasks.json found in {app_dir}", file=sys.stderr)
return []
with open(tasks_file) as f:
return json.load(f)
def build_task_entry(
app_name: str,
task: dict,
base_port: int,
) -> dict:
task_id = task.get("id", task.get("task_id", "unknown"))
difficulty = task.get("difficulty", "unknown")
query = task.get("query", task.get("instruction", task.get("task", "")))
verifier_path = task.get(
"verify",
task.get("verifier_path", f"real-tasks/{task_id}.py"),
)
return {
"query_id": f"infinity-{app_name}-{task_id}",
"dataset": "webarena-infinity",
"query": query,
"graders": ["infinity_state"],
"start_url": f"http://localhost:{base_port}",
"setup_script": f"POST http://localhost:{base_port}/api/reset",
"metadata": {
"original_task_id": f"{app_name}-{task_id}",
"website": app_name,
"category": "webarena-infinity",
"additional": {
"app_name": app_name,
"difficulty": difficulty,
"verifier_path": verifier_path,
"app_base_port": base_port,
},
},
}
def main():
parser = argparse.ArgumentParser(
description="Generate JSONL dataset from WebArena-Infinity apps"
)
parser.add_argument(
"--apps-dir",
required=True,
help="Path to webarena-infinity/apps/ directory",
)
parser.add_argument(
"--apps",
nargs="*",
default=None,
help="Filter to specific app names (default: all)",
)
parser.add_argument(
"--difficulty",
choices=["easy", "medium", "hard"],
default=None,
help="Filter by difficulty tier",
)
parser.add_argument(
"--base-port",
type=int,
default=8000,
help="Starting port number for apps (default: 8000)",
)
args = parser.parse_args()
if not os.path.isdir(args.apps_dir):
print(f"Error: {args.apps_dir} is not a directory", file=sys.stderr)
sys.exit(1)
app_dirs = sorted(os.listdir(args.apps_dir))
if args.apps:
app_dirs = [d for d in app_dirs if d in args.apps]
port = args.base_port
for app_name in app_dirs:
app_path = os.path.join(args.apps_dir, app_name)
if not os.path.isdir(app_path):
continue
tasks = load_tasks(app_path)
for task in tasks:
difficulty = task.get("difficulty", "unknown")
if args.difficulty and difficulty != args.difficulty:
continue
entry = build_task_entry(app_name, task, port)
print(json.dumps(entry))
port += 1
if __name__ == "__main__":
main()

View File

@@ -1,249 +0,0 @@
/**
* Long-running stress test to simulate eval behavior
* Run with: bun apps/eval/scripts/debug-long-run.ts
*/
import { Client } from '@modelcontextprotocol/sdk/client/index.js'
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js'
const SERVER_URL = 'http://127.0.0.1:9110'
const MCP_URL = `${SERVER_URL}/mcp`
// Simulate 60 turns like the failing task had
const NUM_TURNS = 60
const SCREENSHOT_EVERY_N_TURNS = 1
async function checkBrowserReady(): Promise<boolean> {
try {
const res = await fetch(`${SERVER_URL}/health`, {
signal: AbortSignal.timeout(5000),
})
if (!res.ok) return false
const data = (await res.json()) as { cdpConnected?: boolean }
return data.cdpConnected === true
} catch {
return false
}
}
async function callMcpTool(
name: string,
args: Record<string, unknown> = {},
timeoutMs: number = 65000,
): Promise<{ success: boolean; error?: string; duration: number }> {
const start = Date.now()
const client = new Client({ name: 'long-run-test', version: '1.0.0' })
const transport = new StreamableHTTPClientTransport(new URL(MCP_URL))
try {
await client.connect(transport)
const toolPromise = client.callTool({ name, arguments: args })
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(
() => reject(new Error(`Timeout after ${timeoutMs}ms`)),
timeoutMs,
),
)
const result = await Promise.race([toolPromise, timeoutPromise])
const duration = Date.now() - start
const res = result as Record<string, unknown>
if (res.isError) {
const content = res.content as
| Array<{ type: string; text?: string }>
| undefined
const errorText =
content?.find((c) => c.type === 'text')?.text || 'Unknown error'
return { success: false, error: errorText, duration }
}
return { success: true, duration }
} catch (error) {
return {
success: false,
error: error instanceof Error ? error.message : String(error),
duration: Date.now() - start,
}
} finally {
try {
await transport.close()
} catch {}
}
}
async function main() {
console.log('='.repeat(60))
console.log('Long-Running Stress Test (simulating eval)')
console.log('='.repeat(60))
console.log(
`Simulating ${NUM_TURNS} turns with screenshots every ${SCREENSHOT_EVERY_N_TURNS} turn(s)`,
)
console.log()
// Create window
console.log('Creating window...')
let windowId = 0
let tabId = 0
const client = new Client({ name: 'long-run-test', version: '1.0.0' })
const transport = new StreamableHTTPClientTransport(new URL(MCP_URL))
try {
await client.connect(transport)
const result = await client.callTool({
name: 'browser_create_window',
arguments: { url: 'https://example.com', focused: false },
})
// Try structured content first
const createRes = result as Record<string, unknown>
const structured = createRes.structuredContent as
| Record<string, number>
| undefined
windowId = structured?.windowId ?? 0
tabId = structured?.tabId ?? 0
// Fall back to parsing text
if (!windowId || !tabId) {
const content = createRes.content as
| Array<{ type: string; text?: string }>
| undefined
const text = content?.find((c) => c.type === 'text')?.text || ''
const windowMatch = text.match(/window\s+(\d+)/i)
const tabMatch =
text.match(/Tab ID:\s*(\d+)/i) || text.match(/tab\s+(\d+)/i)
if (windowMatch) windowId = parseInt(windowMatch[1], 10)
if (tabMatch) tabId = parseInt(tabMatch[1], 10)
}
} finally {
try {
await transport.close()
} catch {}
}
if (!windowId || !tabId) {
console.log('❌ Could not determine window/tab IDs')
console.log('Trying to get from list tabs...')
// Try listing tabs
const client2 = new Client({ name: 'long-run-test', version: '1.0.0' })
const transport2 = new StreamableHTTPClientTransport(new URL(MCP_URL))
try {
await client2.connect(transport2)
const tabs = await client2.callTool({
name: 'browser_list_tabs',
arguments: {},
})
console.log('Tabs response:', JSON.stringify(tabs, null, 2))
} finally {
try {
await transport2.close()
} catch {}
}
return
}
console.log(`Window: ${windowId}, Tab: ${tabId}`)
console.log()
await new Promise((r) => setTimeout(r, 2000))
// Stats
let screenshotSuccess = 0
let screenshotFail = 0
let toolSuccess = 0
let toolFail = 0
let browserDisconnects = 0
const startTime = Date.now()
// Simulate turns
for (let turn = 1; turn <= NUM_TURNS; turn++) {
const _turnStart = Date.now()
// Random tool calls to simulate agent behavior
const tools = [
{
name: 'browser_get_interactive_elements',
args: { tabId, windowId, simplified: true },
},
{ name: 'browser_list_tabs', args: { windowId } },
{ name: 'browser_get_active_tab', args: { windowId } },
]
// Pick a random tool
const tool = tools[Math.floor(Math.random() * tools.length)]
const toolRes = await callMcpTool(tool.name, tool.args, 30000)
if (toolRes.success) {
toolSuccess++
} else {
toolFail++
console.log(` Turn ${turn}: ❌ ${tool.name} failed: ${toolRes.error}`)
}
// Screenshot every N turns
if (turn % SCREENSHOT_EVERY_N_TURNS === 0) {
const ssRes = await callMcpTool(
'browser_get_screenshot',
{ tabId, windowId, size: 'small' },
65000,
)
if (ssRes.success) {
screenshotSuccess++
} else {
screenshotFail++
console.log(` Turn ${turn}: ❌ Screenshot failed: ${ssRes.error}`)
}
}
// Check browser status
const browserReady = await checkBrowserReady()
if (!browserReady) {
browserDisconnects++
console.log(` Turn ${turn}: ⚠️ Browser became unavailable!`)
}
// Progress
if (turn % 10 === 0) {
const elapsed = ((Date.now() - startTime) / 1000).toFixed(1)
console.log(
`Turn ${turn}/${NUM_TURNS} - Screenshots: ${screenshotSuccess}/${turn}, Tools: ${toolSuccess}/${turn}, Disconnects: ${browserDisconnects}, Elapsed: ${elapsed}s`,
)
}
// Small delay between turns
await new Promise((r) => setTimeout(r, 200))
}
// Cleanup
console.log('\nClosing window...')
await callMcpTool('browser_close_window', { windowId })
// Summary
const totalTime = ((Date.now() - startTime) / 1000).toFixed(1)
console.log(`\n${'='.repeat(60)}`)
console.log('SUMMARY')
console.log('='.repeat(60))
console.log(`Total time: ${totalTime}s`)
console.log(
`Screenshots: ${screenshotSuccess}/${NUM_TURNS} (${((screenshotSuccess / NUM_TURNS) * 100).toFixed(1)}%)`,
)
console.log(
`Tool calls: ${toolSuccess}/${NUM_TURNS} (${((toolSuccess / NUM_TURNS) * 100).toFixed(1)}%)`,
)
console.log(`Browser disconnects: ${browserDisconnects}`)
if (screenshotFail > 0 || toolFail > 0 || browserDisconnects > 0) {
console.log('\n⚠ Issues detected during long run!')
} else {
console.log('\n✅ All operations completed successfully!')
}
}
main().catch(console.error)

View File

@@ -1,307 +0,0 @@
/**
* Debug script to test MCP server stability
* Run with: bun apps/eval/scripts/debug-mcp.ts
*/
import { Client } from '@modelcontextprotocol/sdk/client/index.js'
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js'
const SERVER_URL = 'http://127.0.0.1:9110'
const MCP_URL = `${SERVER_URL}/mcp`
interface TestResult {
test: string
success: boolean
duration: number
error?: string
}
const results: TestResult[] = []
async function checkHealth(): Promise<boolean> {
try {
const res = await fetch(`${SERVER_URL}/health`, {
signal: AbortSignal.timeout(5000),
})
return res.ok
} catch {
return false
}
}
async function checkBrowserReady(): Promise<boolean> {
try {
const res = await fetch(`${SERVER_URL}/health`, {
signal: AbortSignal.timeout(5000),
})
if (!res.ok) return false
const data = (await res.json()) as { cdpConnected?: boolean }
return data.cdpConnected === true
} catch {
return false
}
}
async function callMcpTool(
name: string,
args: Record<string, unknown> = {},
timeoutMs: number = 30000,
): Promise<{
success: boolean
result?: unknown
error?: string
duration: number
}> {
const start = Date.now()
const client = new Client({ name: 'debug-script', version: '1.0.0' })
const transport = new StreamableHTTPClientTransport(new URL(MCP_URL))
try {
await client.connect(transport)
const toolPromise = client.callTool({ name, arguments: args })
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(
() => reject(new Error(`Timeout after ${timeoutMs}ms`)),
timeoutMs,
),
)
const result = await Promise.race([toolPromise, timeoutPromise])
const duration = Date.now() - start
if ((result as any).isError) {
const errorText =
(result as any).content?.find((c: any) => c.type === 'text')?.text ||
'Unknown error'
return { success: false, error: errorText, duration }
}
return { success: true, result, duration }
} catch (error) {
return {
success: false,
error: error instanceof Error ? error.message : String(error),
duration: Date.now() - start,
}
} finally {
try {
await transport.close()
} catch {}
}
}
async function runTest(name: string, fn: () => Promise<void>): Promise<void> {
const start = Date.now()
try {
await fn()
results.push({ test: name, success: true, duration: Date.now() - start })
console.log(`${name} (${Date.now() - start}ms)`)
} catch (error) {
const errorMsg = error instanceof Error ? error.message : String(error)
results.push({
test: name,
success: false,
duration: Date.now() - start,
error: errorMsg,
})
console.log(`${name}: ${errorMsg} (${Date.now() - start}ms)`)
}
}
async function main() {
console.log('='.repeat(60))
console.log('MCP Server Debug Script')
console.log('='.repeat(60))
console.log(`Server URL: ${SERVER_URL}`)
console.log()
// Phase 1: Basic connectivity
console.log('\n--- Phase 1: Basic Connectivity ---\n')
await runTest('Health check', async () => {
const healthy = await checkHealth()
if (!healthy) throw new Error('Server not healthy')
})
await runTest('Browser status', async () => {
const connected = await checkBrowserReady()
if (!connected) throw new Error('Browser not ready')
})
// Phase 2: List tools
console.log('\n--- Phase 2: List Tools ---\n')
let tools: string[] = []
await runTest('List MCP tools', async () => {
const client = new Client({ name: 'debug-script', version: '1.0.0' })
const transport = new StreamableHTTPClientTransport(new URL(MCP_URL))
try {
await client.connect(transport)
const result = await client.listTools()
tools = result.tools.map((t) => t.name)
console.log(` Found ${tools.length} tools`)
} finally {
try {
await transport.close()
} catch {}
}
})
// Phase 3: Create window and test tools
console.log('\n--- Phase 3: Window & Screenshot Tests ---\n')
let windowId: number | null = null
let tabId: number | null = null
await runTest('Create window', async () => {
const res = await callMcpTool('browser_create_window', {
url: 'https://example.com',
focused: false,
})
if (!res.success) throw new Error(res.error)
const structured = (res.result as any)?.structuredContent
windowId = structured?.windowId
tabId = structured?.tabId
if (!windowId || !tabId) {
// Try parsing from text
const text =
(res.result as any)?.content?.find((c: any) => c.type === 'text')
?.text || ''
const windowMatch = text.match(/window\s+(\d+)/i)
const tabMatch = text.match(/tab\s+(?:ID:\s*)?(\d+)/i)
if (windowMatch) windowId = parseInt(windowMatch[1], 10)
if (tabMatch) tabId = parseInt(tabMatch[1], 10)
}
if (!windowId || !tabId) throw new Error('Could not get windowId/tabId')
console.log(` Window: ${windowId}, Tab: ${tabId}`)
})
// Wait for page to load
await new Promise((r) => setTimeout(r, 2000))
// Phase 4: Screenshot stress test
console.log('\n--- Phase 4: Screenshot Stress Test (10 screenshots) ---\n')
let screenshotSuccesses = 0
let screenshotFailures = 0
for (let i = 1; i <= 10; i++) {
const res = await callMcpTool(
'browser_get_screenshot',
{
tabId,
windowId,
size: 'small',
},
65000,
)
if (res.success) {
screenshotSuccesses++
console.log(` Screenshot ${i}: ✅ (${res.duration}ms)`)
} else {
screenshotFailures++
console.log(` Screenshot ${i}: ❌ ${res.error} (${res.duration}ms)`)
}
// Check browser status between screenshots
const extConnected = await checkBrowserReady()
if (!extConnected) {
console.log(` ⚠️ Browser became unavailable after screenshot ${i}!`)
}
// Small delay between screenshots
await new Promise((r) => setTimeout(r, 500))
}
console.log(
`\n Screenshot results: ${screenshotSuccesses}/10 success, ${screenshotFailures}/10 failed`,
)
// Phase 5: Other tool tests
console.log('\n--- Phase 5: Other Tool Tests ---\n')
await runTest('Get active tab', async () => {
const res = await callMcpTool('browser_get_active_tab', { windowId })
if (!res.success) throw new Error(res.error)
})
await runTest('List tabs', async () => {
const res = await callMcpTool('browser_list_tabs', { windowId })
if (!res.success) throw new Error(res.error)
})
await runTest('Get interactive elements', async () => {
const res = await callMcpTool('browser_get_interactive_elements', {
tabId,
windowId,
simplified: true,
})
if (!res.success) throw new Error(res.error)
})
await runTest('Navigate', async () => {
const res = await callMcpTool('browser_navigate', {
url: 'https://google.com',
tabId,
windowId,
})
if (!res.success) throw new Error(res.error)
})
await new Promise((r) => setTimeout(r, 2000))
await runTest('Get content snapshot', async () => {
const res = await callMcpTool('browser_get_content', { tabId, windowId })
if (!res.success) throw new Error(res.error)
})
// Phase 6: Cleanup
console.log('\n--- Phase 6: Cleanup ---\n')
if (windowId) {
await runTest('Close window', async () => {
const res = await callMcpTool('browser_close_window', { windowId })
if (!res.success) throw new Error(res.error)
})
}
// Final browser readiness check
await runTest('Final browser status', async () => {
const connected = await checkBrowserReady()
if (!connected) throw new Error('Browser not ready')
})
// Summary
console.log(`\n${'='.repeat(60)}`)
console.log('SUMMARY')
console.log('='.repeat(60))
const passed = results.filter((r) => r.success).length
const failed = results.filter((r) => !r.success).length
const avgDuration =
results.reduce((a, b) => a + b.duration, 0) / results.length
console.log(`Total tests: ${results.length}`)
console.log(`Passed: ${passed}`)
console.log(`Failed: ${failed}`)
console.log(`Avg duration: ${avgDuration.toFixed(0)}ms`)
console.log(
`Screenshot success rate: ${screenshotSuccesses}/10 (${screenshotSuccesses * 10}%)`,
)
if (failed > 0) {
console.log('\nFailed tests:')
for (const r of results.filter((r) => !r.success)) {
console.log(` - ${r.test}: ${r.error}`)
}
}
console.log()
}
main().catch(console.error)

View File

@@ -0,0 +1,82 @@
#!/usr/bin/env python3
"""
Evaluation helper for WebArena-Infinity verifier scripts.
Reads JSON from stdin with app_server_url, verifier_path, and task_id.
Runs the verifier against the app server and outputs a JSON result.
Verifiers have the signature: verify(server_url: str) -> tuple[bool, str]
They fetch /api/state internally and return (passed, message).
Usage:
echo '{"app_server_url": "http://localhost:8000", "verifier_path": "/path/to/verify.py"}' | python infinity-evaluate.py
"""
import importlib.util
import json
import sys
import traceback
def load_verifier(verifier_path: str):
spec = importlib.util.spec_from_file_location("verifier", verifier_path)
if spec is None or spec.loader is None:
raise ImportError(f"Cannot load verifier from {verifier_path}")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
return module
def main():
try:
data = json.loads(sys.stdin.read())
except json.JSONDecodeError as e:
print(json.dumps({"pass": False, "reward": 0.0, "message": f"Invalid JSON input: {e}"}))
sys.exit(1)
server_url = data.get("app_server_url", "")
verifier_path = data.get("verifier_path", "")
if not server_url or not verifier_path:
print(json.dumps({
"pass": False,
"reward": 0.0,
"message": "Missing app_server_url or verifier_path",
}))
sys.exit(1)
try:
verifier = load_verifier(verifier_path)
fn = getattr(verifier, "verify", None)
if not callable(fn):
raise AttributeError(
f"Verifier has no verify() function. "
f"Available: {[a for a in dir(verifier) if not a.startswith('_')]}"
)
# Verifiers take server_url and fetch state internally
result = fn(server_url)
# Return is tuple[bool, str]
if isinstance(result, tuple) and len(result) >= 2:
passed, message = result[0], str(result[1])
else:
passed, message = bool(result), str(result)
except Exception as e:
print(json.dumps({
"pass": False,
"reward": 0.0,
"message": f"Verifier error: {e}\n{traceback.format_exc()}",
}))
sys.exit(1)
print(json.dumps({
"pass": passed,
"reward": 1.0 if passed else 0.0,
"message": message,
}))
if __name__ == "__main__":
main()

View File

@@ -1,647 +0,0 @@
/**
* Test script to validate failure scenario handling
* Run with: bun apps/eval/scripts/test-failure-scenarios.ts
*
* This script simulates various failure scenarios and shows the recovery flow.
* Run each scenario individually to see how the system handles it.
*/
import { dirname, join } from 'node:path'
import { fileURLToPath } from 'node:url'
import { type Subprocess, spawn, spawnSync } from 'bun'
// Ports from config.dev.json - must match BrowserOS server_config.json
const EVAL_PORTS = {
cdp: 9005,
server: 9105, // http_mcp in config.dev.json
} as const
const MONOREPO_ROOT = join(dirname(fileURLToPath(import.meta.url)), '../../..')
// ============================================================================
// Utility Functions (copied from parallel-executor for testing)
// ============================================================================
function log(category: string, message: string): void {
const timestamp = new Date().toISOString().split('T')[1].slice(0, 12)
console.log(`[${timestamp}] [${category}] ${message}`)
}
function killPort(port: number): void {
log('UTIL', `Killing processes on port ${port}`)
spawnSync({
cmd: ['sh', '-c', `lsof -ti:${port} | xargs kill -9 2>/dev/null || true`],
})
}
function isBrowserOSAppRunning(): boolean {
const result = spawnSync({
cmd: ['sh', '-c', 'pgrep -f "BrowserOS" 2>/dev/null || true'],
})
const output = result.stdout?.toString().trim() ?? ''
return output.length > 0
}
async function killBrowserOSApp(): Promise<void> {
log('BROWSEROS', 'Killing BrowserOS application...')
spawnSync({
cmd: ['sh', '-c', 'pkill -9 -f "BrowserOS" 2>/dev/null || true'],
})
killPort(EVAL_PORTS.cdp)
for (let i = 0; i < 10; i++) {
if (!isBrowserOSAppRunning()) {
log('BROWSEROS', 'Application killed')
return
}
await sleep(500)
}
log('BROWSEROS', 'Warning: Application may not have fully terminated')
}
async function launchBrowserOSApp(): Promise<boolean> {
log(
'BROWSEROS',
`Launching BrowserOS (server disabled, CDP=${EVAL_PORTS.cdp})...`,
)
spawnSync({
cmd: [
'open',
'-a',
'BrowserOS',
'--args',
'--disable-browseros-server',
`--browseros-cdp-port=${EVAL_PORTS.cdp}`,
],
})
for (let i = 0; i < 30; i++) {
await sleep(1000)
if (isBrowserOSAppRunning()) {
log(
'BROWSEROS',
'Application launched, waiting for initialization (8s)...',
)
await sleep(8000)
return true
}
}
log('BROWSEROS', 'Failed to launch application')
return false
}
async function waitForPortFree(
port: number,
maxAttempts = 30,
): Promise<boolean> {
for (let i = 0; i < maxAttempts; i++) {
const result = spawnSync({
cmd: ['sh', '-c', `lsof -ti:${port} 2>/dev/null`],
})
if (!result.stdout || result.stdout.toString().trim() === '') {
return true
}
await sleep(500)
}
return false
}
async function waitForServerHealth(
port: number,
maxAttempts = 60,
): Promise<boolean> {
for (let i = 0; i < maxAttempts; i++) {
try {
const res = await fetch(`http://127.0.0.1:${port}/health`, {
signal: AbortSignal.timeout(1000),
})
if (res.ok) return true
} catch {
/* not ready */
}
await sleep(500)
}
return false
}
async function waitForBrowserReady(
port: number,
maxAttempts = 60,
): Promise<boolean> {
let connectedCount = 0
for (let i = 0; i < maxAttempts; i++) {
try {
const res = await fetch(`http://127.0.0.1:${port}/health`, {
signal: AbortSignal.timeout(2000),
})
if (res.ok) {
const data = (await res.json()) as { cdpConnected?: boolean }
if (data.cdpConnected) {
connectedCount++
if (connectedCount >= 3) return true
} else {
connectedCount = 0
}
}
} catch {
connectedCount = 0
}
await sleep(500)
}
return false
}
async function checkBrowserReady(port: number): Promise<boolean> {
try {
const res = await fetch(`http://127.0.0.1:${port}/health`, {
signal: AbortSignal.timeout(3000),
})
if (res.ok) {
const data = (await res.json()) as { cdpConnected?: boolean }
return data.cdpConnected === true
}
} catch {
/* failed */
}
return false
}
function sleep(ms: number): Promise<void> {
return new Promise((r) => setTimeout(r, ms))
}
let serverProc: Subprocess | null = null
async function startServer(): Promise<Subprocess> {
log('SERVER', 'Cleaning up ports...')
killPort(EVAL_PORTS.server)
await waitForPortFree(EVAL_PORTS.server, 30)
log('SERVER', 'Starting server process...')
const proc = spawn({
cmd: [
'bun',
'apps/server/src/index.ts',
'--server-port',
String(EVAL_PORTS.server),
'--cdp-port',
String(EVAL_PORTS.cdp),
],
cwd: MONOREPO_ROOT,
stdout: 'pipe',
stderr: 'pipe',
env: { ...process.env, NODE_ENV: 'development' },
})
serverProc = proc
log('SERVER', `Server started with PID ${proc.pid}`)
return proc
}
async function stopServer(proc: Subprocess): Promise<void> {
log('SERVER', 'Stopping server...')
try {
proc.kill('SIGKILL')
await Promise.race([proc.exited, sleep(5000)])
} catch {
/* ignore */
}
serverProc = null
log('SERVER', 'Server stopped')
}
// ============================================================================
// Scenario Tests
// ============================================================================
async function scenario1_AppNotRunningAtStart(): Promise<void> {
console.log(`\n${'='.repeat(70)}`)
console.log('SCENARIO 1: BrowserOS App Not Running at Start')
console.log('='.repeat(70))
console.log(
'Expected: Detect missing app → Launch app → Wait for init → Continue\n',
)
// Kill the app first
await killBrowserOSApp()
await sleep(2000)
// Now check what happens
log('CHECK', `Is BrowserOS running? ${isBrowserOSAppRunning()}`)
if (!isBrowserOSAppRunning()) {
log('FLOW', '→ App not running, attempting to launch...')
const launched = await launchBrowserOSApp()
if (launched) {
log('FLOW', '→ App launched successfully')
log('CHECK', `Is BrowserOS running now? ${isBrowserOSAppRunning()}`)
} else {
log('FLOW', '→ FAILED to launch app')
log(
'RESULT',
'Task would FAIL with: "BrowserOS application is not running"',
)
return
}
}
log('RESULT', 'SUCCESS - App is now running, can proceed with server start')
}
async function scenario2_BrowserNotReady(): Promise<void> {
console.log(`\n${'='.repeat(70)}`)
console.log('SCENARIO 2: Browser Does Not Become Ready Within 30 Seconds')
console.log('='.repeat(70))
console.log(
'Expected: Wait 30s → Restart BrowserOS app → Retry → Success or fail after 3 attempts\n',
)
// Make sure app is running first
if (!isBrowserOSAppRunning()) {
log('SETUP', 'Launching BrowserOS for test...')
await launchBrowserOSApp()
}
const MAX_RETRIES = 3
let browserOSRestartAttempted = false
for (let attempt = 1; attempt <= MAX_RETRIES; attempt++) {
log('ATTEMPT', `Server start attempt ${attempt}/${MAX_RETRIES}`)
try {
const proc = await startServer()
log('WAIT', 'Waiting for server health...')
const healthy = await waitForServerHealth(EVAL_PORTS.server, 30)
if (!healthy) {
throw new Error('Server health check failed')
}
log('HEALTH', 'Server health OK')
log('WAIT', 'Waiting for browser readiness (30s timeout)...')
const browserReady = await waitForBrowserReady(EVAL_PORTS.server, 60)
if (!browserReady) {
log('TIMEOUT', 'Browser did not become ready within 30 seconds')
await stopServer(proc)
if (!browserOSRestartAttempted) {
log('RECOVERY', '→ Restarting BrowserOS application...')
await killBrowserOSApp()
await sleep(2000)
const restarted = await launchBrowserOSApp()
browserOSRestartAttempted = true
if (restarted) {
log('RECOVERY', '→ BrowserOS restarted, will retry server')
continue
} else {
log('RECOVERY', '→ FAILED to restart BrowserOS')
}
}
throw new Error('Browser did not become ready')
}
log('CONNECTED', 'Browser ready!')
await stopServer(proc)
log('RESULT', 'SUCCESS - Would proceed with task execution')
return
} catch (error) {
log('ERROR', `Attempt ${attempt} failed: ${error}`)
if (attempt === MAX_RETRIES) {
log('RESULT', 'FAILURE - All retries exhausted, task would fail')
}
}
await sleep(5000)
}
}
async function scenario3_ServerCrashesMidTask(): Promise<void> {
console.log(`\n${'='.repeat(70)}`)
console.log('SCENARIO 3: Server Process Crashes Mid-Task')
console.log('='.repeat(70))
console.log(
'Expected: Task fails → Clean up ports → Next task restarts fresh\n',
)
if (!isBrowserOSAppRunning()) {
log('SETUP', 'Launching BrowserOS for test...')
await launchBrowserOSApp()
}
const proc = await startServer()
log('WAIT', 'Waiting for server to be ready...')
const healthy = await waitForServerHealth(EVAL_PORTS.server, 30)
if (!healthy) {
log('SETUP', 'Server failed to become healthy')
return
}
const browserReady = await waitForBrowserReady(EVAL_PORTS.server, 60)
if (!browserReady) {
log('SETUP', 'Browser did not become ready')
await stopServer(proc)
return
}
log('READY', 'Server and browser ready')
log('SIMULATE', 'Simulating server crash by killing the process...')
// Kill the server to simulate crash
proc.kill('SIGKILL')
await sleep(1000)
// Check what we see now
log('CHECK', 'Checking server health after crash...')
const stillHealthy = await waitForServerHealth(EVAL_PORTS.server, 5)
log('CHECK', `Server health: ${stillHealthy ? 'OK' : 'FAILED'}`)
log('CHECK', 'Checking browser readiness...')
const stillConnected = await checkBrowserReady(EVAL_PORTS.server)
log('CHECK', `Browser ready: ${stillConnected}`)
if (!stillHealthy || !stillConnected) {
log('DETECTED', '→ Infrastructure failure detected!')
log(
'RECOVERY',
'→ In real flow: Would clean up ports and restart for next task',
)
killPort(EVAL_PORTS.server)
log('CLEANUP', 'Ports cleaned')
log('RESULT', 'Task would FAIL, but next task gets clean environment')
}
}
async function scenario4_ToolTimeout(): Promise<void> {
console.log(`\n${'='.repeat(70)}`)
console.log('SCENARIO 4: Tool Execution Timeout')
console.log('='.repeat(70))
console.log(
'Expected: Tool times out → Error contains "timeout" → Classified as infra error → Clean restart\n',
)
// Simulate what happens when we get a timeout error
const errorMessage = 'MCP tool call timed out after 65000ms'
log('ERROR', `Received error: "${errorMessage}"`)
const isInfraError =
errorMessage.includes('BrowserOS') ||
errorMessage.includes('server') ||
errorMessage.includes('not connected') ||
errorMessage.includes('timed out') ||
errorMessage.includes('timeout')
log('CLASSIFY', `Is infrastructure error? ${isInfraError}`)
if (isInfraError) {
log('FLOW', '→ Error classified as infrastructure failure')
log('FLOW', '→ Would kill ports for clean next-task state')
log('FLOW', '→ killPort(9110)')
log('FLOW', '→ killPort(9310)')
log('RESULT', 'Task FAILS, but ports cleaned for next task')
} else {
log('FLOW', '→ Error classified as task-specific failure')
log('RESULT', 'Task FAILS, environment not reset')
}
}
async function scenario5_BrowserUnavailableMidTask(): Promise<void> {
console.log(`\n${'='.repeat(70)}`)
console.log('SCENARIO 5: Browser Becomes Unavailable Mid-Task (App Crashes)')
console.log('='.repeat(70))
console.log(
'Expected: Tool call fails → "not connected" error → Kill app → Restart for next task\n',
)
if (!isBrowserOSAppRunning()) {
log('SETUP', 'Launching BrowserOS for test...')
await launchBrowserOSApp()
}
const proc = await startServer()
log('WAIT', 'Waiting for server to be ready...')
await waitForServerHealth(EVAL_PORTS.server, 30)
await waitForBrowserReady(EVAL_PORTS.server, 60)
log('READY', 'Server and browser ready')
log('SIMULATE', 'Simulating BrowserOS crash by killing the app...')
await killBrowserOSApp()
await sleep(2000)
// Check browser status
log('CHECK', 'Checking browser readiness after app crash...')
const stillConnected = await checkBrowserReady(EVAL_PORTS.server)
log('CHECK', `Browser ready: ${stillConnected}`)
if (!stillConnected) {
log('DETECTED', '→ Browser became unavailable!')
const errorMessage = 'BrowserOS helper service not connected'
log('ERROR', `Tool call would fail with: "${errorMessage}"`)
const isInfraError = errorMessage.includes('not connected')
log('CLASSIFY', `Is infrastructure error? ${isInfraError}`)
if (isInfraError) {
log('RECOVERY', '→ Cleaning up for next task...')
await stopServer(proc)
killPort(EVAL_PORTS.server)
log('RECOVERY', '→ Next task would check if BrowserOS is running...')
const appRunning = isBrowserOSAppRunning()
log('CHECK', `BrowserOS running: ${appRunning}`)
if (!appRunning) {
log('RECOVERY', '→ Would launch BrowserOS app')
await launchBrowserOSApp()
}
log('RESULT', 'Current task FAILS, next task gets fresh environment')
}
} else {
await stopServer(proc)
}
}
async function scenario6_GracefulShutdown(): Promise<void> {
console.log(`\n${'='.repeat(70)}`)
console.log('SCENARIO 6: Graceful Shutdown (Ctrl+C)')
console.log('='.repeat(70))
console.log('Expected: SIGINT received → Kill server → Clean ports → Exit\n')
log('INFO', 'In real flow, signal handlers are registered at startup:')
log('CODE', ' process.on("SIGINT", cleanup)')
log('CODE', ' process.on("SIGTERM", cleanup)')
log('CODE', ' process.on("uncaughtException", cleanup)')
log('FLOW', 'When Ctrl+C is pressed:')
log('FLOW', ' 1. isShuttingDown = true (prevent duplicate cleanup)')
log('FLOW', ' 2. Kill server process if running')
log('FLOW', ' 3. Kill processes on ports 9110, 9310')
log('FLOW', ' 4. Exit with code 0')
log('RESULT', 'Clean shutdown, no orphaned processes')
}
async function scenario7_ConsecutiveFailures(): Promise<void> {
console.log(`\n${'='.repeat(70)}`)
console.log('SCENARIO 7: Consecutive Task Failures')
console.log('='.repeat(70))
console.log(
'Expected: Each failed task cleans up → Next task gets fresh start\n',
)
const tasks = ['task-1', 'task-2', 'task-3']
for (const taskId of tasks) {
log('TASK', `=== Starting ${taskId} ===`)
// Check if app is running
log('CHECK', `BrowserOS running: ${isBrowserOSAppRunning()}`)
if (!isBrowserOSAppRunning()) {
log('FLOW', '→ Would launch BrowserOS')
}
// Simulate infrastructure check before task
log('FLOW', '→ Start server')
log('FLOW', '→ Wait for health')
log('FLOW', '→ Wait for browser readiness')
// Simulate task failure
const failureReason =
taskId === 'task-1'
? 'Browser did not become ready'
: taskId === 'task-2'
? 'Tool timed out after 65000ms'
: 'BrowserOS helper service not connected'
log('ERROR', `Task failed: ${failureReason}`)
const isInfraError =
failureReason.includes('timeout') ||
failureReason.includes('not connected')
if (isInfraError) {
log('CLEANUP', '→ Detected infra error, cleaning ports')
log('CLEANUP', '→ killPort(9110)')
}
log('CLEANUP', '→ Stop server')
log('CLEANUP', '→ Wait 2s before next task')
console.log()
}
log('RESULT', 'Each task failure is isolated, next task starts clean')
}
// ============================================================================
// Main Menu
// ============================================================================
async function main() {
console.log('='.repeat(70))
console.log('Failure Scenario Test Suite')
console.log('='.repeat(70))
console.log(`Server Port: ${EVAL_PORTS.server}`)
console.log(`CDP Port: ${EVAL_PORTS.cdp}`)
console.log()
const scenarios = [
{
num: 1,
name: 'BrowserOS App Not Running at Start',
fn: scenario1_AppNotRunningAtStart,
},
{
num: 2,
name: 'Browser Does Not Become Ready (30s timeout)',
fn: scenario2_BrowserNotReady,
},
{
num: 3,
name: 'Server Process Crashes Mid-Task',
fn: scenario3_ServerCrashesMidTask,
},
{
num: 4,
name: 'Tool Execution Timeout (simulated)',
fn: scenario4_ToolTimeout,
},
{
num: 5,
name: 'Browser Becomes Unavailable Mid-Task (App Crash)',
fn: scenario5_BrowserUnavailableMidTask,
},
{
num: 6,
name: 'Graceful Shutdown (explanation)',
fn: scenario6_GracefulShutdown,
},
{
num: 7,
name: 'Consecutive Task Failures (simulated)',
fn: scenario7_ConsecutiveFailures,
},
]
console.log('Available scenarios:')
for (const s of scenarios) {
console.log(` ${s.num}. ${s.name}`)
}
console.log(' all. Run all scenarios')
console.log()
const arg = process.argv[2]
if (!arg) {
console.log(
'Usage: bun apps/eval/scripts/test-failure-scenarios.ts <scenario-number|all>',
)
console.log('Example: bun apps/eval/scripts/test-failure-scenarios.ts 1')
console.log('Example: bun apps/eval/scripts/test-failure-scenarios.ts all')
process.exit(0)
}
// Setup cleanup handler
const cleanup = async () => {
console.log('\n[CLEANUP] Cleaning up...')
if (serverProc) {
try {
serverProc.kill('SIGKILL')
} catch {}
}
killPort(EVAL_PORTS.server)
process.exit(0)
}
process.on('SIGINT', cleanup)
if (arg === 'all') {
for (const s of scenarios) {
await s.fn()
await sleep(3000)
}
} else {
const num = parseInt(arg, 10)
const scenario = scenarios.find((s) => s.num === num)
if (!scenario) {
console.log(`Unknown scenario: ${arg}`)
process.exit(1)
}
await scenario.fn()
}
// Cleanup
if (serverProc) {
await stopServer(serverProc)
}
console.log(`\n${'='.repeat(70)}`)
console.log('Test completed')
console.log('='.repeat(70))
}
main().catch(console.error)

View File

@@ -1,542 +0,0 @@
/**
* Test script to validate the complete eval lifecycle
* Run with: bun apps/eval/scripts/test-lifecycle.ts
*
* Tests:
* 1. BrowserOS app detection
* 2. Server start/stop
* 3. Browser readiness with verification
* 4. Window create/close
* 5. Screenshot capture
* 6. Multiple tasks in sequence with server restart
*/
import { dirname, join } from 'node:path'
import { fileURLToPath } from 'node:url'
import { Client } from '@modelcontextprotocol/sdk/client/index.js'
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js'
import { type Subprocess, spawn, spawnSync } from 'bun'
// Ports from config.dev.json - must match BrowserOS launch args
const EVAL_PORTS = {
cdp: 9005,
server: 9105, // http_mcp in config.dev.json
} as const
const MONOREPO_ROOT = join(dirname(fileURLToPath(import.meta.url)), '../../..')
const MCP_URL = `http://127.0.0.1:${EVAL_PORTS.server}/mcp`
let currentServerPid: number | null = null
// ============================================================================
// Utility Functions (same as parallel-executor)
// ============================================================================
function killPort(port: number): void {
spawnSync({
cmd: ['sh', '-c', `lsof -ti:${port} | xargs kill -9 2>/dev/null || true`],
})
}
function isBrowserOSAppRunning(): boolean {
const result = spawnSync({
cmd: ['sh', '-c', 'pgrep -f "BrowserOS" 2>/dev/null || true'],
})
const output = result.stdout?.toString().trim() ?? ''
return output.length > 0
}
async function _killBrowserOSApp(): Promise<void> {
console.log(' Killing BrowserOS app...')
spawnSync({
cmd: ['sh', '-c', 'pkill -9 -f "BrowserOS" 2>/dev/null || true'],
})
killPort(EVAL_PORTS.cdp)
for (let i = 0; i < 10; i++) {
if (!isBrowserOSAppRunning()) return
await new Promise((r) => setTimeout(r, 500))
}
}
async function _launchBrowserOSApp(): Promise<boolean> {
console.log(
` Launching BrowserOS (server disabled, CDP=${EVAL_PORTS.cdp})...`,
)
spawnSync({
cmd: [
'open',
'-a',
'BrowserOS',
'--args',
'--disable-browseros-server',
`--remote-debugging-port=${EVAL_PORTS.cdp}`,
`--browseros-cdp-port=${EVAL_PORTS.cdp}`,
`--browseros-mcp-port=${EVAL_PORTS.server}`,
],
})
for (let i = 0; i < 30; i++) {
await new Promise((r) => setTimeout(r, 1000))
if (isBrowserOSAppRunning()) {
await new Promise((r) => setTimeout(r, 8000))
return true
}
}
return false
}
async function waitForPortFree(
port: number,
maxAttempts = 30,
): Promise<boolean> {
for (let i = 0; i < maxAttempts; i++) {
const result = spawnSync({
cmd: ['sh', '-c', `lsof -ti:${port} 2>/dev/null`],
})
if (!result.stdout || result.stdout.toString().trim() === '') {
return true
}
await new Promise((resolve) => setTimeout(resolve, 500))
}
return false
}
async function waitForServerHealth(
serverPort: number,
maxAttempts = 60,
): Promise<boolean> {
for (let i = 0; i < maxAttempts; i++) {
try {
const response = await fetch(`http://127.0.0.1:${serverPort}/health`, {
signal: AbortSignal.timeout(1000),
})
if (response.ok) return true
} catch {
/* not ready */
}
await new Promise((resolve) => setTimeout(resolve, 500))
}
return false
}
async function waitForBrowserReady(
serverPort: number,
maxAttempts = 90,
): Promise<boolean> {
let connectedCount = 0
for (let i = 0; i < maxAttempts; i++) {
try {
const response = await fetch(`http://127.0.0.1:${serverPort}/health`, {
signal: AbortSignal.timeout(2000),
})
if (response.ok) {
const data = (await response.json()) as { cdpConnected?: boolean }
if (data.cdpConnected) {
connectedCount++
if (connectedCount >= 3) return true
} else {
connectedCount = 0
}
}
} catch {
connectedCount = 0
}
await new Promise((resolve) => setTimeout(resolve, 500))
}
return false
}
async function startServer(): Promise<Subprocess> {
killPort(EVAL_PORTS.server)
await waitForPortFree(EVAL_PORTS.server, 30)
const serverProc = spawn({
cmd: [
'bun',
'apps/server/src/index.ts',
'--server-port',
String(EVAL_PORTS.server),
'--cdp-port',
String(EVAL_PORTS.cdp),
],
cwd: MONOREPO_ROOT,
stdout: 'pipe',
stderr: 'pipe',
env: { ...process.env, NODE_ENV: 'development' },
})
currentServerPid = serverProc.pid
return serverProc
}
async function stopServer(proc: Subprocess): Promise<void> {
try {
proc.kill('SIGKILL')
await Promise.race([
proc.exited,
new Promise((resolve) => setTimeout(resolve, 5000)),
])
} catch {
/* ignore */
}
currentServerPid = null
}
async function callMcpTool(
name: string,
args: Record<string, unknown> = {},
timeoutMs = 60000,
): Promise<{ success: boolean; result?: any; error?: string }> {
const client = new Client({ name: 'lifecycle-test', version: '1.0.0' })
const transport = new StreamableHTTPClientTransport(new URL(MCP_URL))
try {
await client.connect(transport)
const toolPromise = client.callTool({ name, arguments: args })
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(
() => reject(new Error(`Timeout after ${timeoutMs}ms`)),
timeoutMs,
),
)
const result = await Promise.race([toolPromise, timeoutPromise])
if ((result as any).isError) {
const errorText =
(result as any).content?.find((c: any) => c.type === 'text')?.text ||
'Unknown error'
return { success: false, error: errorText }
}
return { success: true, result }
} catch (error) {
return {
success: false,
error: error instanceof Error ? error.message : String(error),
}
} finally {
try {
await transport.close()
} catch {}
}
}
// ============================================================================
// Tests
// ============================================================================
async function testBrowserOSDetection(): Promise<boolean> {
console.log('\n=== Test 1: BrowserOS App Detection ===')
const running = isBrowserOSAppRunning()
console.log(` BrowserOS running: ${running}`)
if (!running) {
console.log(' ❌ BrowserOS app is not running. Please start it.')
return false
}
console.log(' ✅ BrowserOS app detected')
return true
}
async function testServerStartStop(): Promise<boolean> {
console.log('\n=== Test 2: Server Start/Stop ===')
console.log(' Starting server...')
const proc = await startServer()
console.log(` Server PID: ${proc.pid}`)
console.log(' Waiting for health...')
const healthy = await waitForServerHealth(EVAL_PORTS.server, 30)
if (!healthy) {
console.log(' ❌ Server health check failed')
await stopServer(proc)
return false
}
console.log(' ✅ Server healthy')
console.log(' Waiting for browser readiness...')
const browserReady = await waitForBrowserReady(EVAL_PORTS.server, 60)
if (!browserReady) {
console.log(' ❌ Browser did not become ready')
await stopServer(proc)
return false
}
console.log(' ✅ Browser ready')
console.log(' Stopping server...')
await stopServer(proc)
console.log(' ✅ Server stopped')
return true
}
async function testWindowLifecycle(): Promise<boolean> {
console.log('\n=== Test 3: Window Create/Close ===')
console.log(' Starting server...')
const proc = await startServer()
const healthy = await waitForServerHealth(EVAL_PORTS.server, 30)
if (!healthy) {
console.log(' ❌ Server health check failed')
await stopServer(proc)
return false
}
const browserReady = await waitForBrowserReady(EVAL_PORTS.server, 60)
if (!browserReady) {
console.log(' ❌ Browser did not become ready')
await stopServer(proc)
return false
}
console.log(' Creating window...')
const createResult = await callMcpTool('browser_create_window', {
url: 'https://example.com',
focused: false,
})
if (!createResult.success) {
console.log(` ❌ Failed to create window: ${createResult.error}`)
await stopServer(proc)
return false
}
const windowId = createResult.result?.structuredContent?.windowId
const tabId = createResult.result?.structuredContent?.tabId
console.log(` ✅ Window created: windowId=${windowId}, tabId=${tabId}`)
// Wait for page load
await new Promise((r) => setTimeout(r, 2000))
// Take screenshot
console.log(' Taking screenshot...')
const ssResult = await callMcpTool('browser_get_screenshot', {
tabId,
windowId,
size: 'small',
})
if (!ssResult.success) {
console.log(` ❌ Screenshot failed: ${ssResult.error}`)
} else {
console.log(' ✅ Screenshot captured')
}
// Close window
console.log(' Closing window...')
const closeResult = await callMcpTool('browser_close_window', { windowId })
if (!closeResult.success) {
console.log(
` ⚠️ Close window returned error (may be expected): ${closeResult.error}`,
)
} else {
console.log(' ✅ Window closed')
}
console.log(' Stopping server...')
await stopServer(proc)
console.log(' ✅ Server stopped')
return true
}
async function testMultipleTasksWithRestart(): Promise<boolean> {
console.log('\n=== Test 4: Multiple Tasks with Server Restart ===')
const tasks = [
{ id: 'task-1', url: 'https://example.com' },
{ id: 'task-2', url: 'https://google.com' },
{ id: 'task-3', url: 'https://github.com' },
]
let successCount = 0
for (const task of tasks) {
console.log(`\n --- Task: ${task.id} ---`)
// Start server
console.log(' Starting server...')
const proc = await startServer()
const healthy = await waitForServerHealth(EVAL_PORTS.server, 30)
if (!healthy) {
console.log(` ❌ Task ${task.id}: Server health failed`)
await stopServer(proc)
continue
}
const browserReady = await waitForBrowserReady(EVAL_PORTS.server, 60)
if (!browserReady) {
console.log(` ❌ Task ${task.id}: Browser not ready`)
await stopServer(proc)
continue
}
// Create window
const createResult = await callMcpTool('browser_create_window', {
url: task.url,
focused: false,
})
if (!createResult.success) {
console.log(
` ❌ Task ${task.id}: Window creation failed - ${createResult.error}`,
)
await stopServer(proc)
continue
}
const windowId = createResult.result?.structuredContent?.windowId
console.log(` Window created: ${windowId}`)
await new Promise((r) => setTimeout(r, 2000))
// Close window
await callMcpTool('browser_close_window', { windowId })
console.log(` Window closed`)
// Stop server
await stopServer(proc)
console.log(` Server stopped`)
successCount++
console.log(` ✅ Task ${task.id} completed`)
// Delay between tasks
await new Promise((r) => setTimeout(r, 2000))
}
console.log(`\n Results: ${successCount}/${tasks.length} tasks successful`)
return successCount === tasks.length
}
async function testBrowserStability(): Promise<boolean> {
console.log('\n=== Test 5: Browser Stability (30 seconds) ===')
console.log(' Starting server...')
const proc = await startServer()
const healthy = await waitForServerHealth(EVAL_PORTS.server, 30)
if (!healthy) {
console.log(' ❌ Server health check failed')
await stopServer(proc)
return false
}
const browserReady = await waitForBrowserReady(EVAL_PORTS.server, 60)
if (!browserReady) {
console.log(' ❌ Browser did not become ready')
await stopServer(proc)
return false
}
console.log(' Monitoring browser readiness for 30 seconds...')
let disconnects = 0
const checkInterval = 2000
const totalChecks = 30000 / checkInterval
for (let i = 0; i < totalChecks; i++) {
try {
const response = await fetch(
`http://127.0.0.1:${EVAL_PORTS.server}/health`,
{
signal: AbortSignal.timeout(2000),
},
)
const data = (await response.json()) as { cdpConnected?: boolean }
if (!data.cdpConnected) {
disconnects++
console.log(
` ⚠️ Browser became unavailable at check ${i + 1}/${totalChecks}`,
)
}
} catch {
disconnects++
console.log(` ⚠️ Failed to check browser at ${i + 1}/${totalChecks}`)
}
await new Promise((r) => setTimeout(r, checkInterval))
}
await stopServer(proc)
if (disconnects > 0) {
console.log(` ❌ Browser had ${disconnects} readiness failures`)
return false
}
console.log(' ✅ Browser stayed ready for 30 seconds')
return true
}
// ============================================================================
// Main
// ============================================================================
async function main() {
console.log('='.repeat(60))
console.log('Eval Lifecycle Test Suite')
console.log('='.repeat(60))
console.log(`Server Port: ${EVAL_PORTS.server}`)
console.log(`CDP Port: ${EVAL_PORTS.cdp}`)
const results: { name: string; passed: boolean }[] = []
// Test 1: BrowserOS Detection
results.push({
name: 'BrowserOS Detection',
passed: await testBrowserOSDetection(),
})
if (!results[0].passed) {
console.log('\n❌ Cannot continue without BrowserOS app running')
process.exit(1)
}
// Test 2: Server Start/Stop
results.push({
name: 'Server Start/Stop',
passed: await testServerStartStop(),
})
// Test 3: Window Lifecycle
results.push({
name: 'Window Lifecycle',
passed: await testWindowLifecycle(),
})
// Test 4: Multiple Tasks
results.push({
name: 'Multiple Tasks',
passed: await testMultipleTasksWithRestart(),
})
// Test 5: Browser Stability
results.push({
name: 'Browser Stability',
passed: await testBrowserStability(),
})
// Summary
console.log(`\n${'='.repeat(60)}`)
console.log('SUMMARY')
console.log('='.repeat(60))
const passed = results.filter((r) => r.passed).length
const failed = results.filter((r) => !r.passed).length
for (const r of results) {
console.log(` ${r.passed ? '✅' : '❌'} ${r.name}`)
}
console.log(`\nTotal: ${passed} passed, ${failed} failed`)
if (failed > 0) {
process.exit(1)
}
}
main().catch((error) => {
console.error('Test suite failed:', error)
if (currentServerPid) {
try {
process.kill(currentServerPid, 'SIGKILL')
} catch {}
}
process.exit(1)
})

View File

@@ -1,180 +0,0 @@
/**
* Test script for the PerformanceGrader.
*
* Runs against a real trajectory and logs:
* - Pre-computed metrics passed to the agent
* - Every tool call the agent makes (what it reads/greps)
* - The final grading result with per-axis scores
*
* Uses the running Claude Code process for auth (no API key needed).
*
* Usage: bun run apps/eval/scripts/test-performance-grader.ts [output-dir]
*/
import { readFile } from 'node:fs/promises'
import { join } from 'node:path'
import { query } from '@anthropic-ai/claude-agent-sdk'
import {
buildUserPrompt,
DEFAULT_AXES,
PERFORMANCE_SYSTEM_PROMPT,
} from '../src/graders/performance/axes'
import { extractMetrics } from '../src/graders/performance/metadata-extractor'
import {
DEFAULT_MAX_BUDGET_USD,
DEFAULT_MAX_TURNS,
DEFAULT_PASS_THRESHOLD,
} from '../src/graders/performance/performance-grader'
import {
PERFORMANCE_EVAL_SCHEMA,
type PerformanceEvalResponse,
} from '../src/graders/performance/types'
import { MessageSchema } from '../src/types/message'
const DEFAULT_SAMPLE = 'results/webvoyager-restart/Allrecipes--0'
async function main() {
const outputDir = process.argv[2]
? process.argv[2]
: join(process.cwd(), DEFAULT_SAMPLE)
console.log(`\n=== Performance Grader Test ===`)
console.log(`Output dir: ${outputDir}\n`)
// 1. Load messages
const rawLines = (await readFile(join(outputDir, 'messages.jsonl'), 'utf-8'))
.split('\n')
.filter(Boolean)
const messages = rawLines.map((line) => MessageSchema.parse(JSON.parse(line)))
console.log(`Loaded ${messages.length} messages from messages.jsonl`)
// 2. Load metadata
const metadata = JSON.parse(
await readFile(join(outputDir, 'metadata.json'), 'utf-8'),
)
console.log(`Task: ${metadata.query}`)
console.log(`Duration: ${metadata.total_duration_ms}ms`)
console.log(`Screenshots: ${metadata.total_steps}`)
// 3. Extract metrics
const metrics = extractMetrics(
messages,
metadata.total_steps,
metadata.termination_reason || 'unknown',
)
console.log(`\n--- Pre-Computed Metrics (passed to agent) ---`)
console.log(JSON.stringify(metrics, null, 2))
// 4. Build prompt
const systemPrompt = PERFORMANCE_SYSTEM_PROMPT.replace(
/\{screenshot_count\}/g,
String(metrics.screenshotCount),
)
const userPrompt = buildUserPrompt(
metadata.query,
metadata.final_answer,
metrics,
DEFAULT_AXES,
)
console.log(`\nPrompt size: ${userPrompt.length} chars`)
console.log(`System prompt size: ${systemPrompt.length} chars`)
// 5. Run agent — log every tool call to see its trajectory
console.log(`\n=== Agent Trajectory ===\n`)
let turnCount = 0
let toolCallCount = 0
for await (const message of query({
prompt: userPrompt,
options: {
model: 'claude-sonnet-4-20250514',
cwd: outputDir,
systemPrompt,
allowedTools: ['Read', 'Glob', 'Grep'],
permissionMode: 'bypassPermissions',
allowDangerouslySkipPermissions: true,
maxTurns: DEFAULT_MAX_TURNS,
maxBudgetUsd: DEFAULT_MAX_BUDGET_USD,
outputFormat: {
type: 'json_schema',
schema: PERFORMANCE_EVAL_SCHEMA,
},
env: {
...process.env,
CLAUDECODE: '',
},
},
})) {
if (message.type === 'assistant') {
turnCount++
console.log(`--- Turn ${turnCount} ---`)
for (const block of message.message.content) {
if (block.type === 'text' && block.text) {
const preview =
block.text.length > 400
? `${block.text.slice(0, 400)}...`
: block.text
console.log(` [text] ${preview}`)
}
if (block.type === 'tool_use') {
toolCallCount++
const input = block.input as Record<string, unknown>
// Show what the agent is reading/grepping
if (block.name === 'Read') {
console.log(
` [tool #${toolCallCount}] Read → ${input.file_path}${input.limit ? ` (lines ${input.offset || 1}-${(input.offset || 1) + Number(input.limit)})` : ''}`,
)
} else if (block.name === 'Grep') {
console.log(
` [tool #${toolCallCount}] Grep → pattern="${input.pattern}" path="${input.path || '.'}"`,
)
} else if (block.name === 'Glob') {
console.log(` [tool #${toolCallCount}] Glob → ${input.pattern}`)
} else {
console.log(
` [tool #${toolCallCount}] ${block.name}(${JSON.stringify(input).slice(0, 150)})`,
)
}
}
}
}
if (message.type === 'result') {
console.log(`\n=== Result ===`)
console.log(`Status: ${message.subtype}`)
console.log(`Turns: ${message.num_turns}`)
console.log(`Tool calls: ${toolCallCount}`)
console.log(`Cost: $${message.total_cost_usd.toFixed(4)}`)
if (message.subtype === 'success') {
console.log(`\n--- Scores ---`)
const axes = (
message.structured_output as PerformanceEvalResponse | undefined
)?.axes
if (Array.isArray(axes)) {
let composite = 0
for (const a of axes) {
const def = DEFAULT_AXES.find((d) => d.name === a.axis)
const weight = def?.weight ?? 0
composite += a.score * weight
console.log(
` ${a.axis}: ${a.score}/100 (weight: ${weight}) — ${a.reasoning}`,
)
}
console.log(`\n Composite: ${composite.toFixed(1)}/100`)
console.log(
` Pass (>= ${DEFAULT_PASS_THRESHOLD}): ${composite >= DEFAULT_PASS_THRESHOLD ? 'YES' : 'NO'}`,
)
}
} else {
console.log(`Error: ${message.result}`)
}
}
}
}
main().catch(console.error)

View File

@@ -1,200 +0,0 @@
/**
* Validation script for Gemini Computer Use integration
* Run: bun apps/eval/scripts/validate-computer-use-tools.ts
*/
import { Client } from '@modelcontextprotocol/sdk/client/index.js'
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js'
const MCP_URL = process.env.MCP_URL || 'http://127.0.0.1:9105/mcp'
interface McpToolResult {
content: Array<{
type: string
text?: string
data?: string
mimeType?: string
}>
isError?: boolean
}
async function callMcpTool(
serverUrl: string,
name: string,
args: Record<string, unknown> = {},
): Promise<McpToolResult> {
const client = new Client({ name: 'validate-computer-use', version: '1.0.0' })
const transport = new StreamableHTTPClientTransport(new URL(serverUrl), {
requestInit: { headers: { 'X-BrowserOS-Source': 'validation' } },
})
try {
await client.connect(transport)
return (await client.callTool({ name, arguments: args })) as McpToolResult
} finally {
try {
await transport.close()
} catch {}
}
}
async function validateTools() {
console.log('🔍 Validating MCP tools for Gemini Computer Use integration\n')
console.log(`MCP URL: ${MCP_URL}\n`)
// Get active tab first
console.log('1. Getting active tab...')
const tabResult = await callMcpTool(MCP_URL, 'browser_get_active_tab', {})
if (tabResult.isError) {
console.error('❌ Failed to get active tab:', tabResult.content)
process.exit(1)
}
const tabText = tabResult.content.find((c) => c.type === 'text')?.text ?? ''
const tabIdMatch = tabText.match(/ID: (\d+)/)
const tabId = tabIdMatch ? parseInt(tabIdMatch[1], 10) : 1
console.log(` ✅ Active tab ID: ${tabId}\n`)
// Validate each tool needed for Computer Use
const toolTests = [
{
name: 'browser_get_screenshot',
args: { tabId, size: 'medium' },
description: 'Screenshot capture',
validate: (r: McpToolResult) => r.content.some((c) => c.type === 'image'),
},
{
name: 'browser_click_coordinates',
args: { tabId, x: 100, y: 100 },
description: 'Click at coordinates',
validate: (r: McpToolResult) => !r.isError,
},
{
name: 'browser_type_at_coordinates',
args: { tabId, x: 100, y: 100, text: 'test' },
description: 'Type at coordinates',
validate: (r: McpToolResult) => !r.isError,
},
{
name: 'browser_scroll_down',
args: { tabId },
description: 'Scroll down',
validate: (r: McpToolResult) => !r.isError,
},
{
name: 'browser_scroll_up',
args: { tabId },
description: 'Scroll up',
validate: (r: McpToolResult) => !r.isError,
},
{
name: 'browser_send_keys',
args: { tabId, key: 'Enter' },
description: 'Send keyboard key',
validate: (r: McpToolResult) => !r.isError,
},
{
name: 'browser_execute_javascript',
args: { tabId, code: 'window.location.href' },
description: 'Execute JavaScript (for go_back/forward workaround)',
validate: (r: McpToolResult) => !r.isError,
},
]
let passed = 0
let failed = 0
for (const test of toolTests) {
process.stdout.write(`2. Testing ${test.name} (${test.description})... `)
try {
const result = await callMcpTool(MCP_URL, test.name, test.args)
if (test.validate(result)) {
console.log('✅')
passed++
} else {
console.log('❌ Validation failed')
console.log(' Result:', JSON.stringify(result, null, 2))
failed++
}
} catch (err) {
console.log('❌ Error:', err instanceof Error ? err.message : err)
failed++
}
}
console.log(`\n${'='.repeat(50)}`)
console.log(`Results: ${passed} passed, ${failed} failed`)
console.log('='.repeat(50))
if (failed === 0) {
console.log(
'\n✅ All tools validated! Gemini Computer Use integration should work.',
)
console.log('\nGaps to address with workarounds:')
console.log(' - key_combination: Use browser_execute_javascript')
console.log(
' - go_back/go_forward: Use browser_execute_javascript with history.back()/forward()',
)
console.log(
' - type_text_at press_enter: Chain browser_send_keys after typing',
)
} else {
console.log('\n⚠ Some tools failed. Check your server is running.')
}
}
// Validate Gemini API access
async function validateGeminiApi() {
const apiKey = process.env.GOOGLE_AI_API_KEY || process.env.GEMINI_API_KEY
if (!apiKey) {
console.log('\n⚠ GOOGLE_AI_API_KEY not set - skipping API validation')
return
}
console.log('\n3. Validating Gemini Computer Use API access...')
const MODEL = 'gemini-2.5-computer-use-preview-10-2025'
const url = `https://generativelanguage.googleapis.com/v1beta/models/${MODEL}:generateContent`
// Minimal test - just check if model is accessible
const testPayload = {
contents: [{ role: 'user', parts: [{ text: 'test' }] }],
}
try {
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-goog-api-key': apiKey,
},
body: JSON.stringify(testPayload),
})
if (response.ok) {
console.log(' ✅ Gemini Computer Use API is accessible')
} else {
const error = await response.json()
console.log(
' ❌ API error:',
error.error?.message || response.statusText,
)
}
} catch (err) {
console.log(
' ❌ Network error:',
err instanceof Error ? err.message : err,
)
}
}
async function main() {
try {
await validateTools()
await validateGeminiApi()
} catch (err) {
console.error('Validation failed:', err)
process.exit(1)
}
}
main()

View File

@@ -59,10 +59,9 @@ interface RunSummary {
}
const PASS_FAIL_GRADER_ORDER = [
'agisdk_state_diff',
'infinity_state',
'performance_grader',
'webvoyager_grader',
'fara_combined',
'fara_grader',
]
function requireEnv(name: string): string {
@@ -332,9 +331,7 @@ const html = `<!DOCTYPE html>
? 'Orch-Exec'
: r.agentType === 'single'
? 'Tool Loop'
: r.agentType === 'gemini-computer-use'
? 'Gemini CU'
: r.agentType || '—'
: r.agentType || ''
return `<tr data-config="${escHtml(r.runId)}" data-search="${escHtml(`${r.date} ${r.runId} ${r.model} ${r.dataset} ${archLabel}`)}">
<td>${escHtml(r.date)}</td>
<td class="mono">${escHtml(r.runId)}</td>
@@ -383,7 +380,6 @@ const html = `<!DOCTYPE html>
var latest = runs[runs.length - 1];
var archLabel = latest.agentType === 'orchestrator-executor' ? 'Orchestrator-Executor'
: latest.agentType === 'single' ? 'Single Agent (Tool Loop)'
: latest.agentType === 'gemini-computer-use' ? 'Gemini Computer Use'
: latest.agentType || 'Unknown';
var scoreColor = latest.avgScore >= 75 ? '#3fb950' : latest.avgScore >= 40 ? '#f0883e' : '#f85149';
el.innerHTML =

View File

@@ -1,643 +0,0 @@
/**
* Maps Gemini Computer Use actions to MCP tool calls
*
* Coordinate System:
* - Screenshots captured with size='large' (1028px width, aspect ratio preserved)
* - Gemini outputs normalized coordinates (0-999) relative to the screenshot
* - We convert these to actual viewport coordinates by:
* 1. Getting the real viewport dimensions via JavaScript
* 2. Scaling normalized coords to actual viewport pixels
*/
import { Client } from '@modelcontextprotocol/sdk/client/index.js'
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js'
import type { ActionContext, ComputerUseAction, ScreenSize } from './types'
import { DEFAULTS } from './types'
interface McpToolResult {
content: Array<{
type: string
text?: string
data?: string
mimeType?: string
}>
isError?: boolean
}
const MCP_TIMEOUT_MS = 30000
export class ActionMapper {
private ctx: ActionContext
private cachedViewport: ScreenSize | null = null
constructor(ctx: ActionContext) {
this.ctx = ctx
}
// Store debug info about viewport detection for inclusion in responses
private viewportDebugInfo: string = ''
/**
* Get the actual browser viewport size via JavaScript
* Caches the result to avoid repeated calls
* Also stores debug info for troubleshooting
*/
async getViewportSize(): Promise<ScreenSize> {
if (this.cachedViewport) {
return this.cachedViewport
}
try {
const result = await this.callMcp('browser_execute_javascript', {
tabId: this.ctx.tabId,
windowId: this.ctx.windowId,
code: '[window.innerWidth, window.innerHeight]',
})
const textContent =
result.content.find((c) => c.type === 'text')?.text ?? ''
// Check for error in result
if (result.isError) {
this.viewportDebugInfo = `[VIEWPORT ERROR] JS execution failed: ${textContent}. Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
return this.ctx.screenSize
}
// Response format can be multiline:
// "Result: [1440, 900]" or "Result: [\n 1200,\n 712\n]"
const arrayMatch = textContent.match(/\[\s*(\d+)\s*,\s*(\d+)\s*\]/s)
if (arrayMatch) {
const width = parseInt(arrayMatch[1], 10)
const height = parseInt(arrayMatch[2], 10)
if (width > 0 && height > 0) {
this.cachedViewport = { width, height }
this.viewportDebugInfo = `[VIEWPORT OK] Detected: ${width}x${height} (raw response: "${textContent.substring(0, 100)}")`
console.log(this.viewportDebugInfo)
return this.cachedViewport
} else {
this.viewportDebugInfo = `[VIEWPORT PARSE ERROR] Invalid dimensions: ${width}x${height} from "${textContent}". Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
}
} else {
this.viewportDebugInfo = `[VIEWPORT PARSE ERROR] Could not parse response: "${textContent}". Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
}
} catch (error) {
const errMsg = error instanceof Error ? error.message : String(error)
this.viewportDebugInfo = `[VIEWPORT EXCEPTION] ${errMsg}. Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
}
// Fallback to configured screenSize
return this.ctx.screenSize
}
/**
* Get the current viewport debug info
*/
getViewportDebugInfo(): string {
return this.viewportDebugInfo
}
/**
* Clear cached viewport (call when tab/window changes or before new task)
*/
clearViewportCache(): void {
this.cachedViewport = null
}
/**
* Scale normalized coordinate (0-999) to actual viewport pixel value
*/
private async scaleCoordinates(
normalizedX: number,
normalizedY: number,
): Promise<{ x: number; y: number }> {
const viewport = await this.getViewportSize()
return {
x: Math.round((normalizedX / 1000) * viewport.width),
y: Math.round((normalizedY / 1000) * viewport.height),
}
}
/**
* Call an MCP tool
*/
private async callMcp(
name: string,
args: Record<string, unknown> = {},
): Promise<McpToolResult> {
const client = new Client({
name: 'gemini-computer-use',
version: '1.0.0',
})
const transport = new StreamableHTTPClientTransport(
new URL(this.ctx.mcpUrl),
{
requestInit: {
headers: { 'X-BrowserOS-Source': 'gemini-computer-use' },
},
},
)
try {
await client.connect(transport)
const toolCallPromise = client.callTool({ name, arguments: args })
let timeoutId: ReturnType<typeof setTimeout> | null = null
const timeoutPromise = new Promise<never>((_, reject) => {
timeoutId = setTimeout(
() =>
reject(
new Error(`MCP tool call timed out after ${MCP_TIMEOUT_MS}ms`),
),
MCP_TIMEOUT_MS,
)
})
try {
return (await Promise.race([
toolCallPromise,
timeoutPromise,
])) as McpToolResult
} finally {
if (timeoutId) clearTimeout(timeoutId)
}
} finally {
try {
await transport.close()
} catch {
// Ignore close errors
}
}
}
/**
* Execute a Computer Use action by mapping to MCP tools
*/
async execute(
action: ComputerUseAction,
): Promise<{ success: boolean; message: string }> {
const { tabId, windowId } = this.ctx
try {
switch (action.name) {
case 'click_at': {
const viewport = await this.getViewportSize()
const { x, y } = await this.scaleCoordinates(
action.args.x,
action.args.y,
)
await this.callMcp('browser_click_coordinates', {
tabId,
windowId,
x,
y,
})
// Return original coordinates + debug info for troubleshooting
// Debug info shows: model input → viewport coords, viewport size, and any errors
const debugInfo = `[DEBUG: input=(${action.args.x},${action.args.y}) → viewport=(${x},${y}), viewport=${viewport.width}x${viewport.height}] ${this.viewportDebugInfo}`
return {
success: true,
message: `Clicked at (${action.args.x}, ${action.args.y}). ${debugInfo}`,
}
}
case 'type_text_at': {
const viewport = await this.getViewportSize()
const { x, y } = await this.scaleCoordinates(
action.args.x,
action.args.y,
)
const { text, press_enter, clear_before_typing } = action.args
// Clear field first if requested (select all + delete)
if (clear_before_typing) {
await this.callMcp('browser_click_coordinates', {
tabId,
windowId,
x,
y,
})
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `document.execCommand('selectAll')`,
})
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Delete',
})
}
// Type the text
await this.callMcp('browser_type_at_coordinates', {
tabId,
windowId,
x,
y,
text,
})
// Press Enter if requested
if (press_enter) {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Enter',
})
}
// Return original coordinates + debug info
const debugInfo = `[DEBUG: input=(${action.args.x},${action.args.y}) → viewport=(${x},${y}), viewport=${viewport.width}x${viewport.height}] ${this.viewportDebugInfo}`
return {
success: true,
message: `Typed "${text.substring(0, 50)}${text.length > 50 ? '...' : ''}" at (${action.args.x}, ${action.args.y}). ${debugInfo}`,
}
}
case 'navigate': {
await this.callMcp('browser_navigate', {
tabId,
windowId,
url: action.args.url,
})
return { success: true, message: `Navigated to ${action.args.url}` }
}
case 'scroll_document': {
const { direction } = action.args
if (direction === 'up') {
await this.callMcp('browser_scroll_up', { tabId, windowId })
} else if (direction === 'down') {
await this.callMcp('browser_scroll_down', { tabId, windowId })
} else {
// Left/right scroll via JavaScript
const scrollCode =
direction === 'left'
? 'window.scrollBy(-window.innerWidth, 0)'
: 'window.scrollBy(window.innerWidth, 0)'
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: scrollCode,
})
}
return { success: true, message: `Scrolled ${direction}` }
}
case 'scroll_at': {
const { x, y } = await this.scaleCoordinates(
action.args.x,
action.args.y,
)
const { direction, magnitude = 500 } = action.args
// Click at position first to focus element
await this.callMcp('browser_click_coordinates', {
tabId,
windowId,
x,
y,
})
// Scale magnitude from 0-999 to actual pixels
const viewport = await this.getViewportSize()
const scrollAmount = Math.round((magnitude / 1000) * viewport.height)
// Use JavaScript scrollBy for precise control with magnitude
const scrollCode =
direction === 'up'
? `window.scrollBy(0, -${scrollAmount})`
: direction === 'down'
? `window.scrollBy(0, ${scrollAmount})`
: direction === 'left'
? `window.scrollBy(-${scrollAmount}, 0)`
: `window.scrollBy(${scrollAmount}, 0)`
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: scrollCode,
})
// Return original coordinates to avoid confusing the model
return {
success: true,
message: `Scrolled ${direction} at (${action.args.x}, ${action.args.y})`,
}
}
case 'key_combination': {
const { keys } = action.args
// Map common key combinations to JavaScript or available keys
const keyMap: Record<string, () => Promise<void>> = {
'Control+a': async () => {
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `document.execCommand('selectAll')`,
})
},
'Control+c': async () => {
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `document.execCommand('copy')`,
})
},
'Control+v': async () => {
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `document.execCommand('paste')`,
})
},
'Control+z': async () => {
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `document.execCommand('undo')`,
})
},
Enter: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Enter',
})
},
Escape: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Escape',
})
},
Tab: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Tab',
})
},
Backspace: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Backspace',
})
},
Delete: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Delete',
})
},
ArrowUp: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'ArrowUp',
})
},
ArrowDown: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'ArrowDown',
})
},
ArrowLeft: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'ArrowLeft',
})
},
ArrowRight: async () => {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'ArrowRight',
})
},
}
// Normalize key string (case insensitive for modifiers)
const normalizedKeys = keys
.replace(/ctrl/i, 'Control')
.replace(/cmd/i, 'Control')
const handler = keyMap[normalizedKeys] || keyMap[keys]
if (handler) {
await handler()
} else {
const keyName = keys.split('+').pop() || ''
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `
const event = new KeyboardEvent('keydown', {
key: ${JSON.stringify(keyName)},
ctrlKey: ${keys.toLowerCase().includes('control')},
shiftKey: ${keys.toLowerCase().includes('shift')},
altKey: ${keys.toLowerCase().includes('alt')},
metaKey: ${keys.toLowerCase().includes('meta')},
bubbles: true
});
document.activeElement?.dispatchEvent(event);
`,
})
}
return { success: true, message: `Pressed ${keys}` }
}
case 'hover_at': {
const { x, y } = await this.scaleCoordinates(
action.args.x,
action.args.y,
)
// Simulate hover via JavaScript mouseover event
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `
const elem = document.elementFromPoint(${x}, ${y});
if (elem) {
const event = new MouseEvent('mouseover', { bubbles: true, clientX: ${x}, clientY: ${y} });
elem.dispatchEvent(event);
}
`,
})
// Return original coordinates to avoid confusing the model
return {
success: true,
message: `Hovered at (${action.args.x}, ${action.args.y})`,
}
}
case 'go_back': {
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: 'history.back()',
})
return { success: true, message: 'Navigated back' }
}
case 'go_forward': {
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: 'history.forward()',
})
return { success: true, message: 'Navigated forward' }
}
case 'wait_5_seconds': {
await new Promise((resolve) => setTimeout(resolve, 5000))
return { success: true, message: 'Waited 5 seconds' }
}
case 'drag_and_drop': {
const start = await this.scaleCoordinates(
action.args.x,
action.args.y,
)
const end = await this.scaleCoordinates(
action.args.destination_x,
action.args.destination_y,
)
// Simulate drag and drop via JavaScript
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `
const startElem = document.elementFromPoint(${start.x}, ${start.y});
const endElem = document.elementFromPoint(${end.x}, ${end.y});
if (startElem && endElem) {
const dragStart = new DragEvent('dragstart', { bubbles: true, clientX: ${start.x}, clientY: ${start.y} });
const drop = new DragEvent('drop', { bubbles: true, clientX: ${end.x}, clientY: ${end.y} });
const dragEnd = new DragEvent('dragend', { bubbles: true });
startElem.dispatchEvent(dragStart);
endElem.dispatchEvent(drop);
startElem.dispatchEvent(dragEnd);
}
`,
})
// Return original coordinates to avoid confusing the model
return {
success: true,
message: `Dragged from (${action.args.x}, ${action.args.y}) to (${action.args.destination_x}, ${action.args.destination_y})`,
}
}
default: {
const _exhaustive: never = action
return {
success: false,
message: `Unknown action: ${JSON.stringify(action)}`,
}
}
}
} catch (error) {
const message = error instanceof Error ? error.message : String(error)
return { success: false, message: `Action failed: ${message}` }
}
}
/**
* Capture a screenshot via MCP with retry logic
*
* Uses Gemini's recommended screenshot size (1440x900) for optimal model performance.
* Now that viewport detection is working correctly, the coordinate mapping will be accurate.
*/
async captureScreenshot(retries = 2): Promise<string | null> {
const { width, height } = DEFAULTS.screenshotSize
for (let attempt = 0; attempt <= retries; attempt++) {
try {
const result = await this.callMcp('browser_get_screenshot', {
tabId: this.ctx.tabId,
windowId: this.ctx.windowId,
width,
height,
showHighlights: false,
})
if (result.isError) {
const errorText =
result.content?.find((c) => c.type === 'text')?.text ??
'Unknown error'
if (attempt < retries) {
console.warn(
`Screenshot attempt ${attempt + 1} failed: ${errorText}, retrying...`,
)
await new Promise((r) => setTimeout(r, 500))
continue
}
console.warn('Screenshot capture failed:', errorText)
return null
}
const imageContent = result.content.find((c) => c.type === 'image')
if (imageContent?.data) {
return imageContent.data
}
if (attempt < retries) {
console.warn(
`Screenshot attempt ${attempt + 1}: No image data, retrying...`,
)
await new Promise((r) => setTimeout(r, 500))
continue
}
return null
} catch (error) {
if (attempt < retries) {
console.warn(
`Screenshot attempt ${attempt + 1} error:`,
error,
'retrying...',
)
await new Promise((r) => setTimeout(r, 500))
continue
}
console.warn('Screenshot capture error:', error)
return null
}
}
return null
}
/**
* Get current page URL via MCP
*/
async getCurrentUrl(): Promise<string> {
try {
const result = await this.callMcp('browser_execute_javascript', {
tabId: this.ctx.tabId,
windowId: this.ctx.windowId,
code: 'window.location.href',
})
const textContent =
result.content.find((c) => c.type === 'text')?.text ?? ''
// Extract URL from result text
const urlMatch = textContent.match(/Result:\s*"?([^"\n]+)"?/)
return urlMatch?.[1] ?? 'unknown'
} catch {
return 'unknown'
}
}
}

View File

@@ -1,327 +0,0 @@
/**
* Gemini Computer Use Agent
* Implements the agent loop that calls Gemini API and executes actions
* Uses UIMessageStreamEvent format for logging compatibility
*/
import { randomUUID } from 'node:crypto'
import { ActionMapper } from './action-mapper'
import {
type ComputerUseAction,
DEFAULTS,
type GeminiComputerUseAgentConfig,
type GeminiContent,
type GeminiPart,
type GeminiResponse,
} from './types'
const GEMINI_API_BASE = 'https://generativelanguage.googleapis.com/v1beta'
interface StreamWriter {
write: (data: string) => Promise<void>
}
type ActionHook = (
action: ComputerUseAction,
result: { success: boolean; message: string },
) => Promise<void>
/**
* Emit SSE-formatted UIMessageStreamEvent
*/
function emitEvent(
writer: StreamWriter,
event: Record<string, unknown>,
): Promise<void> {
return writer.write(`data: ${JSON.stringify(event)}\n\n`)
}
export class GeminiComputerUseAgent {
private config: GeminiComputerUseAgentConfig
private actionMapper: ActionMapper
private actionHook?: ActionHook
private contents: GeminiContent[] = []
constructor(config: GeminiComputerUseAgentConfig) {
this.config = config
this.actionMapper = new ActionMapper({
mcpUrl: config.mcpUrl,
tabId: config.tabId,
windowId: config.windowId,
screenSize: config.screenSize,
})
}
/**
* Set a hook to be called after each action execution
*/
setActionHook(hook: ActionHook): void {
this.actionHook = hook
}
/**
* Call the Gemini Computer Use API
*/
private async callGeminiApi(): Promise<GeminiResponse> {
const url = `${GEMINI_API_BASE}/models/${DEFAULTS.model}:generateContent`
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-goog-api-key': this.config.apiKey,
},
body: JSON.stringify({
contents: this.contents,
tools: [
{
computer_use: {
environment: 'ENVIRONMENT_BROWSER',
},
},
],
}),
})
if (!response.ok) {
const errorBody = await response.text()
throw new Error(
`Gemini API error: ${response.status} ${response.statusText} - ${errorBody}`,
)
}
return response.json()
}
/**
* Extract function calls from a Gemini response
*/
private extractFunctionCalls(response: GeminiResponse): ComputerUseAction[] {
const candidate = response.candidates?.[0]
if (!candidate?.content?.parts) {
return []
}
const actions: ComputerUseAction[] = []
for (const part of candidate.content.parts) {
if (part.functionCall) {
const { name, args } = part.functionCall
// Construct action object
actions.push({ name, args: args ?? {} } as ComputerUseAction)
}
}
return actions
}
/**
* Extract text response from Gemini response
*/
private extractTextResponse(response: GeminiResponse): string | null {
const candidate = response.candidates?.[0]
if (!candidate?.content?.parts) {
return null
}
const textParts = candidate.content.parts
.map((p) => p.text)
.filter((text): text is string => text !== undefined)
return textParts.length > 0 ? textParts.join('\n') : null
}
/**
* Build function response parts for the next turn
*/
private buildFunctionResponses(
actions: ComputerUseAction[],
currentUrl: string,
screenshotBase64: string | null,
): GeminiPart[] {
const parts: GeminiPart[] = []
for (const action of actions) {
parts.push({
functionResponse: {
name: action.name,
response: { url: currentUrl },
},
})
}
// Add screenshot as inline data
if (screenshotBase64) {
parts.push({
inlineData: {
mimeType: 'image/png',
data: screenshotBase64,
},
})
}
return parts
}
/**
* Execute the agent loop
*/
async execute(
query: string,
streamWriter: StreamWriter,
signal: AbortSignal,
): Promise<{ finalText: string | null; totalActions: number }> {
let totalActions = 0
let finalText: string | null = null
// Wait for page to stabilize before first screenshot
await new Promise((resolve) => setTimeout(resolve, 2000))
// Capture initial screenshot with retries
let initialScreenshot: string | null = null
for (let attempt = 1; attempt <= 3; attempt++) {
initialScreenshot = await this.actionMapper.captureScreenshot()
if (initialScreenshot) break
console.warn(`Initial screenshot attempt ${attempt} failed, retrying...`)
await new Promise((resolve) => setTimeout(resolve, 1000))
}
if (!initialScreenshot) {
throw new Error('Failed to capture initial screenshot after 3 attempts')
}
// Build initial content
const initialParts: GeminiPart[] = [
{ text: query },
{ inlineData: { mimeType: 'image/png', data: initialScreenshot } },
]
this.contents.push({ role: 'user', parts: initialParts })
const messageId = randomUUID()
await emitEvent(streamWriter, { type: 'start', messageId })
let finished = false
for (let turn = 0; turn < this.config.turnLimit; turn++) {
if (signal.aborted) {
await emitEvent(streamWriter, { type: 'abort' })
break
}
// Start step (turn)
await emitEvent(streamWriter, { type: 'start-step' })
// Call Gemini API
let response: GeminiResponse
try {
response = await this.callGeminiApi()
} catch (error) {
const errorMsg = error instanceof Error ? error.message : String(error)
await emitEvent(streamWriter, {
type: 'error',
errorText: `API error: ${errorMsg}`,
})
throw error
}
// Check for API errors
if (response.error) {
await emitEvent(streamWriter, {
type: 'error',
errorText: response.error.message,
})
throw new Error(`Gemini API error: ${response.error.message}`)
}
// Extract text response
const textResponse = this.extractTextResponse(response)
if (textResponse) {
finalText = textResponse
const textId = randomUUID()
await emitEvent(streamWriter, { type: 'text-start', id: textId })
await emitEvent(streamWriter, {
type: 'text-delta',
id: textId,
delta: textResponse,
})
await emitEvent(streamWriter, { type: 'text-end', id: textId })
}
// Extract function calls
const actions = this.extractFunctionCalls(response)
// If no actions, task is complete
if (actions.length === 0) {
await emitEvent(streamWriter, { type: 'finish-step' })
await emitEvent(streamWriter, {
type: 'finish',
finishReason: 'completed',
})
finished = true
break
}
// Add model response to conversation
const candidate = response.candidates?.[0]
if (candidate?.content) {
this.contents.push(candidate.content)
}
// Execute each action
for (const action of actions) {
if (signal.aborted) break
const toolCallId = randomUUID()
// Tool input events
await emitEvent(streamWriter, {
type: 'tool-input-start',
toolCallId,
toolName: action.name,
})
await emitEvent(streamWriter, {
type: 'tool-input-available',
toolCallId,
toolName: action.name,
input: action.args,
})
const result = await this.actionMapper.execute(action)
totalActions++
// Tool output event
await emitEvent(streamWriter, {
type: 'tool-output-available',
toolCallId,
output: result,
})
// Call action hook (for screenshot capture)
if (this.actionHook) {
await this.actionHook(action, result)
}
}
// Capture new screenshot and URL
const newScreenshot = await this.actionMapper.captureScreenshot()
const currentUrl = await this.actionMapper.getCurrentUrl()
// Build function responses and add to conversation
const functionResponseParts = this.buildFunctionResponses(
actions,
currentUrl,
newScreenshot,
)
this.contents.push({ role: 'user', parts: functionResponseParts })
// Finish step (turn)
await emitEvent(streamWriter, { type: 'finish-step' })
}
if (!finished && !signal.aborted) {
await emitEvent(streamWriter, {
type: 'finish',
finishReason: 'max_turns',
})
}
return { finalText, totalActions }
}
}

View File

@@ -1,97 +0,0 @@
/**
* Gemini Computer Use Evaluator
* Implements AgentEvaluator interface for the eval framework
*/
import { DEFAULT_TIMEOUT_MS } from '../../constants'
import type { GeminiComputerUseConfig, TaskMetadata } from '../../types'
import { resolveEnvValue } from '../../utils/resolve-env'
import { withEvalTimeout } from '../../utils/with-eval-timeout'
import type { AgentContext, AgentEvaluator, AgentResult } from '../types'
import { GeminiComputerUseAgent } from './agent'
import { DEFAULTS } from './types'
export class GeminiComputerUseEvaluator implements AgentEvaluator {
constructor(private ctx: AgentContext) {}
async execute(): Promise<AgentResult> {
const { config, task, capture, windowId = 0, tabId = 0 } = this.ctx
const agentConfig = config.agent as GeminiComputerUseConfig
const startTime = Date.now()
const timeoutMs = config.timeout_ms ?? DEFAULT_TIMEOUT_MS
await capture.messageLogger.logUser(task.query)
const apiKey = resolveEnvValue(agentConfig.apiKey)
if (!apiKey) {
throw new Error(
`API key not found. Set ${agentConfig.apiKey} environment variable or provide the key directly.`,
)
}
const agent = new GeminiComputerUseAgent({
apiKey,
turnLimit: agentConfig.turnLimit ?? DEFAULTS.turnLimit,
screenSize: agentConfig.screenSize ?? DEFAULTS.screenSize,
tabId,
windowId,
mcpUrl: `${config.browseros.server_url}/mcp`,
})
agent.setActionHook(async (_action, _result) => {
try {
await capture.screenshot.capture(capture.getActivePageId())
} catch (err) {
console.warn('Screenshot capture failed in hook:', err)
}
})
const streamWriter = capture.createStreamWriter()
let finalText: string | null = null
let totalActions = 0
const { terminationReason } = await withEvalTimeout(
timeoutMs,
capture,
async (signal) => {
const result = await agent.execute(task.query, streamWriter, signal)
finalText = result.finalText
totalActions = result.totalActions
return result
},
)
const endTime = Date.now()
const metadata: TaskMetadata = {
query_id: task.query_id,
dataset: task.dataset,
query: task.query,
started_at: new Date(startTime).toISOString(),
completed_at: new Date(endTime).toISOString(),
total_duration_ms: endTime - startTime,
total_steps: totalActions,
termination_reason: terminationReason,
final_answer: finalText ?? capture.getLastAssistantText(),
errors: capture.getErrors(),
warnings: capture.getWarnings(),
agent_config: {
type: 'gemini-computer-use',
model: DEFAULTS.model,
turnLimit: agentConfig.turnLimit ?? DEFAULTS.turnLimit,
screenSize: agentConfig.screenSize ?? DEFAULTS.screenSize,
},
grader_results: {},
}
await capture.trajectorySaver.saveMetadata(metadata)
return {
metadata,
messages: capture.getMessages(),
finalAnswer: finalText ?? capture.getLastAssistantText(),
}
}
}

View File

@@ -1,156 +0,0 @@
/**
* Types for Gemini Computer Use agent
*/
import { z } from 'zod'
// Gemini Computer Use predefined actions (from API docs)
export const ComputerUseActionSchema = z.discriminatedUnion('name', [
z.object({
name: z.literal('click_at'),
args: z.object({
x: z.number().min(0).max(999),
y: z.number().min(0).max(999),
}),
}),
z.object({
name: z.literal('type_text_at'),
args: z.object({
x: z.number().min(0).max(999),
y: z.number().min(0).max(999),
text: z.string(),
press_enter: z.boolean().optional(),
clear_before_typing: z.boolean().optional(),
}),
}),
z.object({
name: z.literal('navigate'),
args: z.object({
url: z.string(),
}),
}),
z.object({
name: z.literal('scroll_document'),
args: z.object({
direction: z.enum(['up', 'down', 'left', 'right']),
}),
}),
z.object({
name: z.literal('scroll_at'),
args: z.object({
x: z.number().min(0).max(999),
y: z.number().min(0).max(999),
direction: z.enum(['up', 'down', 'left', 'right']),
magnitude: z.number().optional(),
}),
}),
z.object({
name: z.literal('key_combination'),
args: z.object({
keys: z.string(),
}),
}),
z.object({
name: z.literal('hover_at'),
args: z.object({
x: z.number().min(0).max(999),
y: z.number().min(0).max(999),
}),
}),
z.object({
name: z.literal('go_back'),
args: z.object({}).optional(),
}),
z.object({
name: z.literal('go_forward'),
args: z.object({}).optional(),
}),
z.object({
name: z.literal('wait_5_seconds'),
args: z.object({}).optional(),
}),
z.object({
name: z.literal('drag_and_drop'),
args: z.object({
x: z.number().min(0).max(999),
y: z.number().min(0).max(999),
destination_x: z.number().min(0).max(999),
destination_y: z.number().min(0).max(999),
}),
}),
])
export type ComputerUseAction = z.infer<typeof ComputerUseActionSchema>
// Screen size configuration
export interface ScreenSize {
width: number
height: number
}
// Context for action execution
export interface ActionContext {
mcpUrl: string
tabId: number
windowId: number
screenSize: ScreenSize
}
// Gemini API types
export interface GeminiContent {
role: 'user' | 'model'
parts: GeminiPart[]
}
export interface GeminiPart {
text?: string
inlineData?: {
mimeType: string
data: string
}
functionCall?: {
name: string
args?: Record<string, unknown>
}
functionResponse?: {
name: string
response: Record<string, unknown>
}
}
export interface GeminiResponse {
candidates?: Array<{
content: GeminiContent
finishReason?: string
}>
error?: {
message: string
code: number
}
}
// Safety decision from Computer Use
export interface SafetyDecision {
decision: 'allow' | 'require_confirmation' | 'block'
explanation?: string
}
// Agent configuration
export interface GeminiComputerUseAgentConfig {
apiKey: string
turnLimit: number
screenSize: ScreenSize
tabId: number
windowId: number
mcpUrl: string
}
// Defaults
export const DEFAULTS = {
// Gemini's recommended screenshot size for optimal model accuracy
screenshotSize: { width: 1440, height: 900 },
// Fallback viewport size (used when actual viewport can't be determined)
screenSize: { width: 1440, height: 900 },
turnLimit: 30,
model: 'gemini-2.5-computer-use-preview-10-2025',
} as const

View File

@@ -1,26 +1,14 @@
import { GeminiComputerUseEvaluator } from './gemini-computer-use'
import { OrchestratorExecutorEvaluator } from './orchestrator-executor'
import { registerAgent } from './registry'
import { SingleAgentEvaluator } from './single-agent'
import { YutoriNavigatorEvaluator } from './yutori-navigator'
import type { AgentContext, AgentEvaluator } from './types'
// Register built-in agent types
registerAgent('single', (ctx) => new SingleAgentEvaluator(ctx))
registerAgent(
'orchestrator-executor',
(ctx) => new OrchestratorExecutorEvaluator(ctx),
)
registerAgent(
'gemini-computer-use',
(ctx) => new GeminiComputerUseEvaluator(ctx),
)
registerAgent('yutori-navigator', (ctx) => new YutoriNavigatorEvaluator(ctx))
export function createAgent(context: AgentContext): AgentEvaluator {
switch (context.config.agent.type) {
case 'single':
return new SingleAgentEvaluator(context)
case 'orchestrator-executor':
return new OrchestratorExecutorEvaluator(context)
}
}
// Re-exports
export {
createAgent,
getRegisteredAgentTypes,
isAgentTypeRegistered,
registerAgent,
} from './registry'
export type { AgentContext, AgentEvaluator, AgentResult } from './types'

View File

@@ -14,7 +14,6 @@ import { CdpBackend } from '@browseros/server/browser/backends/cdp'
import { CaptchaWaiter } from '../../capture/captcha-waiter'
import { DEFAULT_TIMEOUT_MS } from '../../constants'
import type {
EvalConfig,
OrchestratorExecutorConfig,
TaskMetadata,
UIMessageStreamEvent,
@@ -30,15 +29,6 @@ import { Executor, type ExecutorCallbacks } from './executor'
import { OrchestratorAgent } from './orchestrator-agent'
import type { ExecutorFactory, ExecutorResult } from './types'
function extractCdpPort(config: EvalConfig): number {
const serverUrl = config.browseros.server_url
const match = serverUrl.match(/:(\d+)$/)
if (!match) return config.browseros.base_cdp_port
const serverPort = Number.parseInt(match[1], 10)
const workerOffset = serverPort - config.browseros.base_server_port
return config.browseros.base_cdp_port + workerOffset
}
interface ResolvedConfigs {
orchestratorConfig: ResolvedAgentConfig & { maxTurns?: number }
executorConfig: ResolvedAgentConfig
@@ -124,7 +114,7 @@ export class OrchestratorExecutorEvaluator implements AgentEvaluator {
constructor(private ctx: AgentContext) {}
async execute(): Promise<AgentResult> {
const { config, task, capture } = this.ctx
const { config, task, capture, workerIndex } = this.ctx
const startTime = Date.now()
const timeoutMs = config.timeout_ms ?? DEFAULT_TIMEOUT_MS
@@ -140,8 +130,8 @@ export class OrchestratorExecutorEvaluator implements AgentEvaluator {
const { orchestratorConfig, executorConfig, isCladoAction } =
await resolveAgentConfig(agentConfig)
// Connect to Chrome via CDP
const cdpPort = extractCdpPort(config)
// Connect to Chrome via CDP — same per-worker offset used by app-manager.
const cdpPort = config.browseros.base_cdp_port + workerIndex
const cdp = new CdpBackend({ port: cdpPort })
await cdp.connect()
const browser = new Browser(cdp)

View File

@@ -1,51 +0,0 @@
import type { AgentContext, AgentEvaluator } from './types'
/**
* Factory function signature for creating agents
*/
type AgentFactory = (context: AgentContext) => AgentEvaluator
/**
* Registry of agent factories by type
*/
const registry = new Map<string, AgentFactory>()
/**
* Register an agent type with its factory function
* @throws If type is already registered
*/
export function registerAgent(type: string, factory: AgentFactory): void {
if (registry.has(type)) {
throw new Error(`Agent type "${type}" is already registered`)
}
registry.set(type, factory)
}
/**
* Create an agent evaluator from context
* @throws If agent type is not registered
*/
export function createAgent(context: AgentContext): AgentEvaluator {
const factory = registry.get(context.config.agent.type)
if (!factory) {
const available = Array.from(registry.keys()).join(', ')
throw new Error(
`Unknown agent type: "${context.config.agent.type}". Available types: ${available || 'none'}`,
)
}
return factory(context)
}
/**
* Get list of all registered agent types
*/
export function getRegisteredAgentTypes(): string[] {
return Array.from(registry.keys())
}
/**
* Check if an agent type is registered
*/
export function isAgentTypeRegistered(type: string): boolean {
return registry.has(type)
}

View File

@@ -9,25 +9,16 @@ import { CdpBackend } from '@browseros/server/browser/backends/cdp'
import { registry } from '@browseros/server/tools/registry'
import { CaptchaWaiter } from '../capture/captcha-waiter'
import { DEFAULT_TIMEOUT_MS } from '../constants'
import type { EvalConfig, TaskMetadata } from '../types'
import type { TaskMetadata } from '../types'
import { resolveProviderConfig } from '../utils/resolve-provider-config'
import { withEvalTimeout } from '../utils/with-eval-timeout'
import type { AgentContext, AgentEvaluator, AgentResult } from './types'
function extractCdpPort(config: EvalConfig): number {
const serverUrl = config.browseros.server_url
const match = serverUrl.match(/:(\d+)$/)
if (!match) return config.browseros.base_cdp_port
const serverPort = Number.parseInt(match[1], 10)
const workerOffset = serverPort - config.browseros.base_server_port
return config.browseros.base_cdp_port + workerOffset
}
export class SingleAgentEvaluator implements AgentEvaluator {
constructor(private ctx: AgentContext) {}
async execute(): Promise<AgentResult> {
const { config, task, capture } = this.ctx
const { config, task, capture, workerIndex } = this.ctx
const startTime = Date.now()
const timeoutMs = config.timeout_ms ?? DEFAULT_TIMEOUT_MS
@@ -50,8 +41,8 @@ export class SingleAgentEvaluator implements AgentEvaluator {
supportsImages,
}
// Connect to Chrome via CDP
const cdpPort = extractCdpPort(config)
// Connect to Chrome via CDP — same per-worker offset used by app-manager.
const cdpPort = config.browseros.base_cdp_port + workerIndex
const cdp = new CdpBackend({ port: cdpPort })
await cdp.connect()

View File

@@ -1,26 +1,17 @@
import type { CaptureContext } from '../capture/context'
import type { EvalConfig, Message, Task, TaskMetadata } from '../types'
/**
* All dependencies an agent evaluator needs - passed via factory
*/
export interface AgentContext {
// Configuration
config: EvalConfig
task: Task
workerIndex: number
// Page resolved once at task start (fresh browser has exactly one page)
// Resolved once at task start (fresh browser has exactly one page).
initialPageId: number
// Browser window info for agents that operate on explicit window/tab ids
windowId?: number
tabId?: number
// Output paths
outputDir: string // Root output directory
taskOutputDir: string // Task-specific: outputDir/query_id/
// Capture infrastructure (pre-initialized by runner)
capture: CaptureContext
}

View File

@@ -1,677 +0,0 @@
/**
* Maps Yutori n1 actions to MCP tool calls
*
* Coordinate System:
* - n1 outputs normalized coordinates in 1000x1000 grid
* - Screenshots captured with size='large' (1028px width, aspect ratio preserved)
* - We scale normalized coords to actual viewport pixels
*
* Action Mapping (prioritize MCP tools over execute_javascript):
* - click → browser_click_coordinates ✅
* - type → browser_type_at_coordinates (uses last clicked coords) ✅
* - scroll up/down → browser_scroll_up/down ✅
* - scroll left/right → browser_execute_javascript (no horizontal scroll tool)
* - key_press → browser_send_keys (for supported keys) ✅
* - hover → browser_execute_javascript (no dedicated MCP tool)
* - drag → browser_execute_javascript (no dedicated MCP tool)
* - wait → setTimeout
* - refresh → browser_execute_javascript (no dedicated MCP tool)
* - go_back → browser_execute_javascript (no dedicated MCP tool)
* - goto_url → browser_navigate ✅
* - stop → returns answer (no MCP call)
* - read_texts_and_links → browser_get_page_content ✅
*/
import { Client } from '@modelcontextprotocol/sdk/client/index.js'
import { StreamableHTTPClientTransport } from '@modelcontextprotocol/sdk/client/streamableHttp.js'
import sharp from 'sharp'
import type { ActionContext, N1Action, ScreenSize } from './types'
import { DEFAULTS } from './types'
/**
* Convert PNG base64 to WebP base64 for smaller payload size.
* Yutori n1 recommends WebP format for better compression.
*/
async function convertToWebP(pngBase64: string): Promise<string> {
const pngBuffer = Buffer.from(pngBase64, 'base64')
const webpBuffer = await sharp(pngBuffer)
.webp({ quality: 80 }) // Good balance of quality and size
.toBuffer()
return webpBuffer.toString('base64')
}
interface McpToolResult {
content: Array<{
type: string
text?: string
data?: string
mimeType?: string
}>
isError?: boolean
}
const MCP_TIMEOUT_MS = 30000
// Scroll amount per unit (n1 recommends treating each amount as 10-15% of screen)
const SCROLL_PERCENT_PER_UNIT = 0.12 // 12% of viewport per scroll unit
export class ActionMapper {
private ctx: ActionContext
private cachedViewport: ScreenSize | null = null
// Track last clicked coordinates for type action (n1 type has no coords)
private lastClickCoordinates: { x: number; y: number } | null = null
constructor(ctx: ActionContext) {
this.ctx = ctx
}
// Store debug info about viewport detection for inclusion in responses
private viewportDebugInfo: string = ''
/**
* Get the actual browser viewport size via JavaScript
* This is critical for correct coordinate mapping:
* - Screenshot is scaled to 1028px width (aspect ratio preserved)
* - Clicks must be at actual viewport coordinates
* - We scale: (normalized/1000) * viewport
* Caches the result to avoid repeated calls
* Also stores debug info for troubleshooting
*/
async getViewportSize(): Promise<ScreenSize> {
if (this.cachedViewport) {
return this.cachedViewport
}
try {
const result = await this.callMcp('browser_execute_javascript', {
tabId: this.ctx.tabId,
windowId: this.ctx.windowId,
code: '[window.innerWidth, window.innerHeight]',
})
const textContent =
result.content.find((c) => c.type === 'text')?.text ?? ''
// Check for error in result
if (result.isError) {
this.viewportDebugInfo = `[VIEWPORT ERROR] JS execution failed: ${textContent}. Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
return this.ctx.screenSize
}
// Parse array format - can be multiline: [1440, 900] or "Result: [\n 1200,\n 712\n]"
const arrayMatch = textContent.match(/\[\s*(\d+)\s*,\s*(\d+)\s*\]/s)
if (arrayMatch) {
const width = parseInt(arrayMatch[1], 10)
const height = parseInt(arrayMatch[2], 10)
if (width > 0 && height > 0) {
this.cachedViewport = { width, height }
this.viewportDebugInfo = `[VIEWPORT OK] Detected: ${width}x${height} (raw: "${textContent.substring(0, 100)}")`
console.log(this.viewportDebugInfo)
return this.cachedViewport
} else {
this.viewportDebugInfo = `[VIEWPORT PARSE ERROR] Invalid dimensions: ${width}x${height} from "${textContent}". Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
}
} else {
this.viewportDebugInfo = `[VIEWPORT PARSE ERROR] Could not parse: "${textContent}". Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
}
} catch (error) {
const errMsg = error instanceof Error ? error.message : String(error)
this.viewportDebugInfo = `[VIEWPORT EXCEPTION] ${errMsg}. Using fallback: ${this.ctx.screenSize.width}x${this.ctx.screenSize.height}`
console.warn(this.viewportDebugInfo)
}
// Fallback to config screenSize
return this.ctx.screenSize
}
/**
* Clear cached viewport (call when tab/window changes or before new task)
*/
clearViewportCache(): void {
this.cachedViewport = null
}
/**
* Reset all tracked state (call before starting a new task)
*/
reset(): void {
this.cachedViewport = null
this.lastClickCoordinates = null
}
/**
* Scale normalized coordinate (0-1000) to actual viewport pixel value
*
* How it works:
* - Screenshot is captured at 1028px width with preserved aspect ratio
* - n1 predicts normalized coords (0-1000) for that screenshot
* - Since aspect ratio is preserved, we can scale directly to viewport
* - Formula: actualX = (normalizedX / 1000) * viewport.innerWidth
*/
private async scaleCoordinates(
normalizedX: number,
normalizedY: number,
): Promise<{ x: number; y: number }> {
const viewport = await this.getViewportSize()
return {
x: Math.round((normalizedX / DEFAULTS.normalizedMax) * viewport.width),
y: Math.round((normalizedY / DEFAULTS.normalizedMax) * viewport.height),
}
}
/**
* Call an MCP tool
*/
private async callMcp(
name: string,
args: Record<string, unknown> = {},
): Promise<McpToolResult> {
const client = new Client({
name: 'yutori-navigator',
version: '1.0.0',
})
const transport = new StreamableHTTPClientTransport(
new URL(this.ctx.mcpUrl),
{
requestInit: {
headers: { 'X-BrowserOS-Source': 'yutori-navigator' },
},
},
)
try {
await client.connect(transport)
const toolCallPromise = client.callTool({ name, arguments: args })
let timeoutId: ReturnType<typeof setTimeout> | null = null
const timeoutPromise = new Promise<never>((_, reject) => {
timeoutId = setTimeout(
() =>
reject(
new Error(`MCP tool call timed out after ${MCP_TIMEOUT_MS}ms`),
),
MCP_TIMEOUT_MS,
)
})
try {
return (await Promise.race([
toolCallPromise,
timeoutPromise,
])) as McpToolResult
} finally {
if (timeoutId) clearTimeout(timeoutId)
}
} finally {
try {
await transport.close()
} catch {
// Ignore close errors
}
}
}
/**
* Execute an n1 action by mapping to MCP tools
* Prioritizes native MCP tools over browser_execute_javascript for reliability
* Returns the result message and optionally the stop answer
*/
async execute(
action: N1Action,
): Promise<{ success: boolean; message: string; stopAnswer?: string }> {
const { tabId, windowId } = this.ctx
try {
switch (action.action_type) {
case 'click': {
const [normX, normY] = action.center_coordinates
const viewport = await this.getViewportSize()
const { x, y } = await this.scaleCoordinates(normX, normY)
// Track coordinates for subsequent type action (n1 type has no coords)
this.lastClickCoordinates = { x, y }
await this.callMcp('browser_click_coordinates', {
tabId,
windowId,
x,
y,
})
// Return original coordinates + debug info
const debugInfo = `[DEBUG: input=(${normX},${normY}) → viewport=(${x},${y}), viewport=${viewport.width}x${viewport.height}] ${this.viewportDebugInfo}`
return {
success: true,
message: `Clicked at (${normX}, ${normY}). ${debugInfo}`,
}
}
case 'type': {
const { text, press_enter_after, clear_before_typing } = action
// n1 type action has no coordinates - it expects element to be focused
// Use last clicked coordinates with browser_type_at_coordinates
if (!this.lastClickCoordinates) {
// Fallback: click center of screen if no prior click
const viewport = await this.getViewportSize()
this.lastClickCoordinates = {
x: Math.round(viewport.width / 2),
y: Math.round(viewport.height / 2),
}
}
const { x, y } = this.lastClickCoordinates
// Clear field first if requested using native MCP tools
if (clear_before_typing) {
// Triple-click to select all text in the field
await this.callMcp('browser_click_coordinates', {
tabId,
windowId,
x,
y,
})
// Use Delete key to clear
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Delete',
})
}
// Use browser_type_at_coordinates - the proper MCP tool for typing
await this.callMcp('browser_type_at_coordinates', {
tabId,
windowId,
x,
y,
text,
})
// Press Enter if requested using native MCP tool
if (press_enter_after) {
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: 'Enter',
})
}
// n1 type action has no coordinates - don't include viewport coords in response
return {
success: true,
message: `Typed "${text.substring(0, 50)}${text.length > 50 ? '...' : ''}"`,
}
}
case 'scroll': {
const { direction, center_coordinates, amount } = action
const [normX, normY] = center_coordinates
const { x, y } = await this.scaleCoordinates(normX, normY)
// Track coordinates
this.lastClickCoordinates = { x, y }
// Click at position first to focus element (for scrollable containers)
await this.callMcp('browser_click_coordinates', {
tabId,
windowId,
x,
y,
})
// For vertical scroll (up/down): use native MCP scroll tools
// For horizontal scroll (left/right): use JS (no MCP tool available)
if (direction === 'up' || direction === 'down') {
const scrollTool =
direction === 'up' ? 'browser_scroll_up' : 'browser_scroll_down'
// Calculate how many scroll calls based on amount
// n1 amount 1-2 = ~20% viewport, our tool = 100% viewport
// So we scroll once for small amounts, more for larger
const scrollCount = Math.max(1, Math.round(amount / 5))
for (let i = 0; i < scrollCount; i++) {
await this.callMcp(scrollTool, { tabId, windowId })
// Small delay between scrolls for stability
if (i < scrollCount - 1) {
await new Promise((r) => setTimeout(r, 100))
}
}
// Return original normalized coordinates
return {
success: true,
message: `Scrolled ${direction} at (${normX}, ${normY})`,
}
} else {
// Horizontal scroll - no MCP tool, use JS
const viewport = await this.getViewportSize()
const scrollPixels = Math.round(
amount * SCROLL_PERCENT_PER_UNIT * viewport.width,
)
const scrollCode =
direction === 'left'
? `window.scrollBy(-${scrollPixels}, 0)`
: `window.scrollBy(${scrollPixels}, 0)`
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: scrollCode,
})
// Return original normalized coordinates
return {
success: true,
message: `Scrolled ${direction} at (${normX}, ${normY})`,
}
}
}
case 'key_press': {
const { key_comb } = action
// Map keys to browser_send_keys supported keys
// browser_send_keys supports: Enter, Delete, Backspace, Tab, Escape,
// ArrowUp, ArrowDown, ArrowLeft, ArrowRight, Home, End, PageUp, PageDown
const keyMap: Record<string, string> = {
Enter: 'Enter',
Escape: 'Escape',
Tab: 'Tab',
Backspace: 'Backspace',
Delete: 'Delete',
ArrowUp: 'ArrowUp',
ArrowDown: 'ArrowDown',
ArrowLeft: 'ArrowLeft',
ArrowRight: 'ArrowRight',
Home: 'Home',
End: 'End',
PageUp: 'PageUp',
PageDown: 'PageDown',
// Alternative names n1 might use
Return: 'Enter',
Esc: 'Escape',
Up: 'ArrowUp',
Down: 'ArrowDown',
Left: 'ArrowLeft',
Right: 'ArrowRight',
}
const mappedKey = keyMap[key_comb]
if (mappedKey) {
// Use native MCP tool
await this.callMcp('browser_send_keys', {
tabId,
windowId,
key: mappedKey,
})
} else {
// For complex key combinations (Ctrl+A, etc.), use JavaScript
const parts = key_comb.split('+')
const mainKey = parts.pop() || ''
const modifiers = parts.map((p) => p.toLowerCase())
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `
const event = new KeyboardEvent('keydown', {
key: '${mainKey}',
code: 'Key${mainKey.toUpperCase()}',
ctrlKey: ${modifiers.includes('control') || modifiers.includes('ctrl')},
shiftKey: ${modifiers.includes('shift')},
altKey: ${modifiers.includes('alt')},
metaKey: ${modifiers.includes('meta') || modifiers.includes('cmd')},
bubbles: true
});
document.activeElement?.dispatchEvent(event);
`,
})
}
return { success: true, message: `Pressed ${key_comb}` }
}
case 'hover': {
// No dedicated MCP hover tool - use JS
const [normX, normY] = action.center_coordinates
const { x, y } = await this.scaleCoordinates(normX, normY)
// Track coordinates
this.lastClickCoordinates = { x, y }
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `
const elem = document.elementFromPoint(${x}, ${y});
if (elem) {
const event = new MouseEvent('mouseover', {
bubbles: true,
clientX: ${x},
clientY: ${y}
});
elem.dispatchEvent(event);
}
`,
})
// Return original normalized coordinates
return { success: true, message: `Hovered at (${normX}, ${normY})` }
}
case 'drag': {
// No dedicated MCP drag tool - use JS
const [startNormX, startNormY] = action.start_coordinates
const [endNormX, endNormY] = action.center_coordinates
const start = await this.scaleCoordinates(startNormX, startNormY)
const end = await this.scaleCoordinates(endNormX, endNormY)
// Track end coordinates
this.lastClickCoordinates = end
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: `
const startElem = document.elementFromPoint(${start.x}, ${start.y});
const endElem = document.elementFromPoint(${end.x}, ${end.y});
if (startElem && endElem) {
const dragStart = new DragEvent('dragstart', {
bubbles: true,
clientX: ${start.x},
clientY: ${start.y}
});
const drop = new DragEvent('drop', {
bubbles: true,
clientX: ${end.x},
clientY: ${end.y}
});
const dragEnd = new DragEvent('dragend', { bubbles: true });
startElem.dispatchEvent(dragStart);
endElem.dispatchEvent(drop);
startElem.dispatchEvent(dragEnd);
}
`,
})
// Return original normalized coordinates
return {
success: true,
message: `Dragged from (${startNormX}, ${startNormY}) to (${endNormX}, ${endNormY})`,
}
}
case 'wait': {
// n1 uses this for page loads
await new Promise((resolve) => setTimeout(resolve, 2000))
return { success: true, message: 'Waited 2 seconds' }
}
case 'refresh': {
// No dedicated MCP refresh tool - use JS
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: 'location.reload()',
})
// Wait for page to start reloading
await new Promise((resolve) => setTimeout(resolve, 1000))
return { success: true, message: 'Refreshed page' }
}
case 'go_back': {
// No dedicated MCP go_back tool - use JS
await this.callMcp('browser_execute_javascript', {
tabId,
windowId,
code: 'history.back()',
})
return { success: true, message: 'Navigated back' }
}
case 'goto_url': {
// Use native MCP navigate tool
await this.callMcp('browser_navigate', {
tabId,
windowId,
url: action.url,
})
return { success: true, message: `Navigated to ${action.url}` }
}
case 'read_texts_and_links': {
// Use native MCP tool
const result = await this.callMcp('browser_get_page_content', {
tabId,
windowId,
type: 'text-with-links',
})
const content =
result.content.find((c) => c.type === 'text')?.text ?? ''
return {
success: true,
message: `Read page content (${content.length} chars)`,
}
}
case 'stop': {
// Stop action - task is complete, return the answer
return {
success: true,
message: 'Task completed',
stopAnswer: action.answer,
}
}
default: {
const _exhaustive: never = action
return {
success: false,
message: `Unknown action: ${JSON.stringify(action)}`,
}
}
}
} catch (error) {
const message = error instanceof Error ? error.message : String(error)
return { success: false, message: `Action failed: ${message}` }
}
}
/**
* Capture a screenshot via MCP with retry logic
*
* Uses Yutori's recommended screenshot size (1280x800) for optimal model performance.
* Now that viewport detection is working correctly, the coordinate mapping will be accurate.
*
* Returns WebP base64 string
*/
async captureScreenshot(retries = 2): Promise<string | null> {
const { width, height } = DEFAULTS.screenshotSize
for (let attempt = 0; attempt <= retries; attempt++) {
try {
const result = await this.callMcp('browser_get_screenshot', {
tabId: this.ctx.tabId,
windowId: this.ctx.windowId,
width,
height,
showHighlights: false,
})
if (result.isError) {
const errorText =
result.content?.find((c) => c.type === 'text')?.text ??
'Unknown error'
if (attempt < retries) {
console.warn(
`Screenshot attempt ${attempt + 1} failed: ${errorText}, retrying...`,
)
await new Promise((r) => setTimeout(r, 500))
continue
}
console.warn('Screenshot capture failed:', errorText)
return null
}
const imageContent = result.content.find((c) => c.type === 'image')
if (imageContent?.data) {
// Convert PNG to WebP for smaller payload (n1 recommends WebP)
try {
const webpBase64 = await convertToWebP(imageContent.data)
return webpBase64
} catch (conversionError) {
console.warn('WebP conversion failed, using PNG:', conversionError)
return imageContent.data
}
}
if (attempt < retries) {
console.warn(
`Screenshot attempt ${attempt + 1}: No image data, retrying...`,
)
await new Promise((r) => setTimeout(r, 500))
continue
}
return null
} catch (error) {
if (attempt < retries) {
console.warn(
`Screenshot attempt ${attempt + 1} error:`,
error,
'retrying...',
)
await new Promise((r) => setTimeout(r, 500))
continue
}
console.warn('Screenshot capture error:', error)
return null
}
}
return null
}
/**
* Get current page URL via MCP
*/
async getCurrentUrl(): Promise<string> {
try {
const result = await this.callMcp('browser_execute_javascript', {
tabId: this.ctx.tabId,
windowId: this.ctx.windowId,
code: 'window.location.href',
})
const textContent =
result.content.find((c) => c.type === 'text')?.text ?? ''
const urlMatch = textContent.match(/Result:\s*"?([^"\n]+)"?/)
return urlMatch?.[1] ?? 'unknown'
} catch {
return 'unknown'
}
}
}

View File

@@ -1,353 +0,0 @@
/**
* Yutori Navigator n1 Agent
*
* Implements the agent loop that calls Yutori n1 API and executes actions.
* Uses UIMessageStreamEvent format for logging compatibility.
*
* n1 API follows OpenAI Chat Completions interface with special 'observation' role
* for screenshots. Full conversation history must be maintained.
*/
import { randomUUID } from 'node:crypto'
import { ActionMapper } from './action-mapper'
import {
DEFAULTS,
type N1Action,
type N1ChatCompletionResponse,
type N1Message,
N1ResponseSchema,
YUTORI_API_BASE,
type YutoriNavigatorAgentConfig,
} from './types'
interface StreamWriter {
write: (data: string) => Promise<void>
}
type ActionHook = (
action: N1Action,
result: { success: boolean; message: string },
) => Promise<void>
/**
* Emit SSE-formatted UIMessageStreamEvent
*/
function emitEvent(
writer: StreamWriter,
event: Record<string, unknown>,
): Promise<void> {
return writer.write(`data: ${JSON.stringify(event)}\n\n`)
}
export class YutoriNavigatorAgent {
private config: YutoriNavigatorAgentConfig
private actionMapper: ActionMapper
private actionHook?: ActionHook
private messages: N1Message[] = []
constructor(config: YutoriNavigatorAgentConfig) {
this.config = config
this.actionMapper = new ActionMapper({
mcpUrl: config.mcpUrl,
tabId: config.tabId,
windowId: config.windowId,
screenSize: config.screenSize,
})
}
/**
* Set a hook to be called after each action execution
*/
setActionHook(hook: ActionHook): void {
this.actionHook = hook
}
/**
* Build observation message with screenshot and optional URL
*/
private buildObservationMessage(
screenshotBase64: string,
currentUrl?: string,
): N1Message {
const content: N1Message['content'] = []
// Include URL if available (recommended by Yutori for better attribution)
if (currentUrl) {
content.push({
type: 'text',
text: `Current URL: ${currentUrl}`,
})
}
// Add screenshot as base64 data URL (WebP for smaller payload)
content.push({
type: 'image_url',
image_url: {
url: `data:image/webp;base64,${screenshotBase64}`,
},
})
return {
role: 'observation',
content,
}
}
/**
* Call the Yutori n1 API
*/
private async callN1Api(): Promise<N1ChatCompletionResponse> {
const url = `${YUTORI_API_BASE}/chat/completions`
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${this.config.apiKey}`,
},
body: JSON.stringify({
model: DEFAULTS.model,
messages: this.messages,
temperature: DEFAULTS.temperature,
}),
})
if (!response.ok) {
const errorBody = await response.text()
throw new Error(
`Yutori n1 API error: ${response.status} ${response.statusText} - ${errorBody}`,
)
}
return response.json()
}
/**
* Parse n1 response content to extract thoughts and actions
*/
private parseN1Response(
content: string,
): { thoughts: string; actions: N1Action[] } | null {
try {
const parsed = JSON.parse(content)
const validated = N1ResponseSchema.safeParse(parsed)
if (validated.success) {
return validated.data
}
console.warn('n1 response validation failed:', validated.error.message)
// Try to extract what we can
return {
thoughts: parsed.thoughts ?? '',
actions: Array.isArray(parsed.actions) ? parsed.actions : [],
}
} catch (error) {
console.warn('Failed to parse n1 response:', error)
return null
}
}
/**
* Execute the agent loop
*/
async execute(
query: string,
streamWriter: StreamWriter,
signal: AbortSignal,
): Promise<{ finalText: string | null; totalActions: number }> {
let totalActions = 0
let finalText: string | null = null
// Wait for page to stabilize before first screenshot
await new Promise((resolve) => setTimeout(resolve, 2000))
// Capture initial screenshot with retries
let initialScreenshot: string | null = null
for (let attempt = 1; attempt <= 3; attempt++) {
initialScreenshot = await this.actionMapper.captureScreenshot()
if (initialScreenshot) break
console.warn(`Initial screenshot attempt ${attempt} failed, retrying...`)
await new Promise((resolve) => setTimeout(resolve, 1000))
}
if (!initialScreenshot) {
throw new Error('Failed to capture initial screenshot after 3 attempts')
}
// Get initial URL
const initialUrl = await this.actionMapper.getCurrentUrl()
// Build initial messages
// 1. User message with task
this.messages.push({
role: 'user',
content: [{ type: 'text', text: query }],
})
// 2. Initial observation with screenshot
this.messages.push(
this.buildObservationMessage(initialScreenshot, initialUrl),
)
// Emit start event
const messageId = randomUUID()
await emitEvent(streamWriter, { type: 'start', messageId })
let finished = false
for (let turn = 0; turn < this.config.turnLimit; turn++) {
if (signal.aborted) {
await emitEvent(streamWriter, { type: 'abort' })
break
}
// Start step (turn)
await emitEvent(streamWriter, { type: 'start-step' })
// Call n1 API
let response: N1ChatCompletionResponse
try {
response = await this.callN1Api()
} catch (error) {
const errorMsg = error instanceof Error ? error.message : String(error)
await emitEvent(streamWriter, {
type: 'error',
errorText: `API error: ${errorMsg}`,
})
throw error
}
// Extract response content
const choice = response.choices?.[0]
if (!choice?.message?.content) {
await emitEvent(streamWriter, {
type: 'error',
errorText: 'Empty response from n1 API',
})
throw new Error('Empty response from n1 API')
}
const assistantContent = choice.message.content
// Parse the JSON response
const parsed = this.parseN1Response(assistantContent)
if (!parsed) {
await emitEvent(streamWriter, {
type: 'error',
errorText: 'Failed to parse n1 response',
})
throw new Error('Failed to parse n1 response')
}
const { thoughts, actions } = parsed
// Emit thoughts as text
if (thoughts) {
finalText = thoughts
const textId = randomUUID()
await emitEvent(streamWriter, { type: 'text-start', id: textId })
await emitEvent(streamWriter, {
type: 'text-delta',
id: textId,
delta: thoughts,
})
await emitEvent(streamWriter, { type: 'text-end', id: textId })
}
// Check for stop action or no actions
const stopAction = actions.find((a) => a.action_type === 'stop')
if (stopAction && stopAction.action_type === 'stop') {
finalText = stopAction.answer
await emitEvent(streamWriter, { type: 'finish-step' })
await emitEvent(streamWriter, {
type: 'finish',
finishReason: 'completed',
})
finished = true
break
}
if (actions.length === 0) {
await emitEvent(streamWriter, { type: 'finish-step' })
await emitEvent(streamWriter, {
type: 'finish',
finishReason: 'completed',
})
finished = true
break
}
// Add assistant response to conversation history
this.messages.push({
role: 'assistant',
content: assistantContent,
})
// Execute each action
for (const action of actions) {
if (signal.aborted) break
// Skip stop actions (handled above)
if (action.action_type === 'stop') continue
const toolCallId = randomUUID()
// Tool input events
await emitEvent(streamWriter, {
type: 'tool-input-start',
toolCallId,
toolName: action.action_type,
})
await emitEvent(streamWriter, {
type: 'tool-input-available',
toolCallId,
toolName: action.action_type,
input: action,
})
const result = await this.actionMapper.execute(action)
totalActions++
// Check if this was a stop action that returned an answer
if (result.stopAnswer) {
finalText = result.stopAnswer
}
// Tool output event
await emitEvent(streamWriter, {
type: 'tool-output-available',
toolCallId,
output: result,
})
// Call action hook (for screenshot capture)
if (this.actionHook) {
await this.actionHook(action, result)
}
}
// Capture new screenshot and URL for next turn
const newScreenshot = await this.actionMapper.captureScreenshot()
const currentUrl = await this.actionMapper.getCurrentUrl()
// Add observation for next turn (n1 requires full history)
if (newScreenshot) {
this.messages.push(
this.buildObservationMessage(newScreenshot, currentUrl),
)
}
// Finish step (turn)
await emitEvent(streamWriter, { type: 'finish-step' })
}
if (!finished && !signal.aborted) {
await emitEvent(streamWriter, {
type: 'finish',
finishReason: 'max_turns',
})
}
return { finalText, totalActions }
}
}

View File

@@ -1,97 +0,0 @@
/**
* Yutori Navigator Evaluator
* Implements AgentEvaluator interface for the eval framework
*/
import { DEFAULT_TIMEOUT_MS } from '../../constants'
import type { TaskMetadata, YutoriNavigatorConfig } from '../../types'
import { resolveEnvValue } from '../../utils/resolve-env'
import { withEvalTimeout } from '../../utils/with-eval-timeout'
import type { AgentContext, AgentEvaluator, AgentResult } from '../types'
import { YutoriNavigatorAgent } from './agent'
import { DEFAULTS } from './types'
export class YutoriNavigatorEvaluator implements AgentEvaluator {
constructor(private ctx: AgentContext) {}
async execute(): Promise<AgentResult> {
const { config, task, capture, windowId = 0, tabId = 0 } = this.ctx
const agentConfig = config.agent as YutoriNavigatorConfig
const startTime = Date.now()
const timeoutMs = config.timeout_ms ?? DEFAULT_TIMEOUT_MS
await capture.messageLogger.logUser(task.query)
const apiKey = resolveEnvValue(agentConfig.apiKey)
if (!apiKey) {
throw new Error(
`API key not found. Set ${agentConfig.apiKey} environment variable or provide the key directly.`,
)
}
const agent = new YutoriNavigatorAgent({
apiKey,
turnLimit: agentConfig.turnLimit ?? DEFAULTS.turnLimit,
screenSize: agentConfig.screenSize ?? DEFAULTS.screenSize,
tabId,
windowId,
mcpUrl: `${config.browseros.server_url}/mcp`,
})
agent.setActionHook(async (_action, _result) => {
try {
await capture.screenshot.capture(capture.getActivePageId())
} catch (err) {
console.warn('Screenshot capture failed in hook:', err)
}
})
const streamWriter = capture.createStreamWriter()
let finalText: string | null = null
let totalActions = 0
const { terminationReason } = await withEvalTimeout(
timeoutMs,
capture,
async (signal) => {
const result = await agent.execute(task.query, streamWriter, signal)
finalText = result.finalText
totalActions = result.totalActions
return result
},
)
const endTime = Date.now()
const metadata: TaskMetadata = {
query_id: task.query_id,
dataset: task.dataset,
query: task.query,
started_at: new Date(startTime).toISOString(),
completed_at: new Date(endTime).toISOString(),
total_duration_ms: endTime - startTime,
total_steps: totalActions,
termination_reason: terminationReason,
final_answer: finalText ?? capture.getLastAssistantText(),
errors: capture.getErrors(),
warnings: capture.getWarnings(),
agent_config: {
type: 'yutori-navigator',
model: DEFAULTS.model,
turnLimit: agentConfig.turnLimit ?? DEFAULTS.turnLimit,
screenSize: agentConfig.screenSize ?? DEFAULTS.screenSize,
},
grader_results: {},
}
await capture.trajectorySaver.saveMetadata(metadata)
return {
metadata,
messages: capture.getMessages(),
finalAnswer: finalText ?? capture.getLastAssistantText(),
}
}
}

View File

@@ -1,158 +0,0 @@
/**
* Types for Yutori Navigator n1 agent
*
* n1 is a pixels-to-actions LLM that follows OpenAI Chat Completions interface.
* Coordinates are normalized to 1000x1000 grid.
* Recommended screenshot size: 1280x800 (WXGA 16:10)
*/
import { z } from 'zod'
// n1 action schemas based on API documentation
export const N1ActionSchema = z.discriminatedUnion('action_type', [
z.object({
action_type: z.literal('click'),
center_coordinates: z.tuple([z.number(), z.number()]),
}),
z.object({
action_type: z.literal('scroll'),
direction: z.enum(['up', 'down', 'left', 'right']),
center_coordinates: z.tuple([z.number(), z.number()]),
amount: z.number().int().min(1).max(10),
}),
z.object({
action_type: z.literal('type'),
text: z.string(),
press_enter_after: z.boolean().optional(),
clear_before_typing: z.boolean().optional(),
}),
z.object({
action_type: z.literal('key_press'),
key_comb: z.string(), // Playwright keyboard press format
}),
z.object({
action_type: z.literal('hover'),
center_coordinates: z.tuple([z.number(), z.number()]),
}),
z.object({
action_type: z.literal('drag'),
start_coordinates: z.tuple([z.number(), z.number()]),
center_coordinates: z.tuple([z.number(), z.number()]), // destination
}),
z.object({
action_type: z.literal('wait'),
}),
z.object({
action_type: z.literal('refresh'),
}),
z.object({
action_type: z.literal('go_back'),
}),
z.object({
action_type: z.literal('goto_url'),
url: z.string(),
}),
z.object({
action_type: z.literal('read_texts_and_links'),
}),
z.object({
action_type: z.literal('stop'),
answer: z.string(),
}),
])
export type N1Action = z.infer<typeof N1ActionSchema>
// n1 API response format
export const N1ResponseSchema = z.object({
thoughts: z.string(),
actions: z.array(N1ActionSchema),
})
export type N1Response = z.infer<typeof N1ResponseSchema>
// Screen size configuration
export interface ScreenSize {
width: number
height: number
}
// Context for action execution
export interface ActionContext {
mcpUrl: string
tabId: number
windowId: number
screenSize: ScreenSize
}
// OpenAI-compatible message types for n1 API
export type N1MessageRole = 'user' | 'assistant' | 'observation'
export interface N1TextContent {
type: 'text'
text: string
}
export interface N1ImageContent {
type: 'image_url'
image_url: {
url: string // Can be URL or data:image/webp;base64,...
}
}
export type N1ContentPart = N1TextContent | N1ImageContent
export interface N1Message {
role: N1MessageRole
content: string | N1ContentPart[]
}
export interface N1ChatCompletionRequest {
model: string
messages: N1Message[]
temperature?: number
}
export interface N1ChatCompletionResponse {
id: string
object: string
created: number
model: string
choices: Array<{
index: number
message: {
role: 'assistant'
content: string // JSON string containing N1Response
}
finish_reason: string
}>
usage?: {
prompt_tokens: number
completion_tokens: number
total_tokens: number
}
}
// Agent configuration
export interface YutoriNavigatorAgentConfig {
apiKey: string
turnLimit: number
screenSize: ScreenSize
tabId: number
windowId: number
mcpUrl: string
}
// Defaults based on Yutori documentation
export const DEFAULTS = {
// WXGA 16:10 - Yutori's recommended screenshot size
screenshotSize: { width: 1280, height: 800 },
screenSize: { width: 1280, height: 800 },
turnLimit: 30,
model: 'n1-preview-2025-11',
temperature: 0.3,
// n1 uses 1000x1000 normalized coordinate system
normalizedMax: 1000,
} as const
export const YUTORI_API_BASE = 'https://api.yutori.com/v1'

View File

@@ -190,8 +190,6 @@
<select id="cfg-agent-type" onchange="onAgentTypeChange(this.value)">
<option value="single">Single Agent</option>
<option value="orchestrator-executor">Orchestrator-Executor</option>
<option value="gemini-computer-use">Gemini Computer Use</option>
<option value="yutori-navigator">Yutori Navigator</option>
</select>
</div>
@@ -280,50 +278,6 @@
</div>
</div>
<!-- Gemini Computer Use fields -->
<div class="agent-fields" id="fields-gemini-computer-use">
<div class="config-field">
<label>API Key <span class="req">*</span></label>
<input type="password" id="cfg-gemini-apikey" placeholder="GOOGLE_AI_API_KEY">
</div>
<div class="config-row">
<div class="config-field">
<label>Screen Width</label>
<input type="number" id="cfg-gemini-width" value="1440" min="800" max="2560">
</div>
<div class="config-field">
<label>Screen Height</label>
<input type="number" id="cfg-gemini-height" value="900" min="600" max="1440">
</div>
</div>
<div class="config-field">
<label>Turn Limit</label>
<input type="number" id="cfg-gemini-turns" value="30" min="1" max="100">
</div>
</div>
<!-- Yutori Navigator fields -->
<div class="agent-fields" id="fields-yutori-navigator">
<div class="config-field">
<label>API Key <span class="req">*</span></label>
<input type="password" id="cfg-yutori-apikey" placeholder="YUTORI_API_KEY">
</div>
<div class="config-row">
<div class="config-field">
<label>Screen Width</label>
<input type="number" id="cfg-yutori-width" value="1280" min="800" max="2560">
</div>
<div class="config-field">
<label>Screen Height</label>
<input type="number" id="cfg-yutori-height" value="800" min="600" max="1440">
</div>
</div>
<div class="config-field">
<label>Turn Limit</label>
<input type="number" id="cfg-yutori-turns" value="30" min="1" max="100">
</div>
</div>
</div>
<!-- Infrastructure (center) -->
@@ -420,23 +374,10 @@
<label>Graders</label>
<div style="display: flex; flex-direction: column; gap: 4px; margin-top: 2px;">
<div class="config-field-inline"><input type="checkbox" id="cfg-grader-performance" value="performance_grader"><label for="cfg-grader-performance">Performance Grader</label></div>
<div class="config-field-inline"><input type="checkbox" id="cfg-grader-webvoyager" value="webvoyager_grader"><label for="cfg-grader-webvoyager">WebVoyager Grader</label></div>
<div class="config-field-inline"><input type="checkbox" id="cfg-grader-fara" value="fara_combined"><label for="cfg-grader-fara">Fara Combined</label></div>
<div class="config-field-inline"><input type="checkbox" id="cfg-grader-mind2web" value="mind2web_judge"><label for="cfg-grader-mind2web">Mind2Web Judge</label></div>
<div class="config-field-inline"><input type="checkbox" id="cfg-grader-agisdk" value="agisdk_state_diff"><label for="cfg-grader-agisdk">AGI SDK State Diff</label></div>
<div class="config-field-inline"><input type="checkbox" id="cfg-grader-infinity" value="infinity_state"><label for="cfg-grader-infinity">Infinity State</label></div>
</div>
</div>
<div class="config-field">
<label>Grader Model</label>
<input type="text" id="cfg-grader-model" placeholder="e.g. openai/gpt-4.1">
</div>
<div class="config-field">
<label>Grader API Key</label>
<input type="password" id="cfg-grader-key-env" placeholder="Key or env var e.g. OPENROUTER_API_KEY">
</div>
<div class="config-field">
<label>Grader Base URL</label>
<input type="text" id="cfg-grader-baseurl" placeholder="https://openrouter.ai/api/v1">
</div>
</div>
<!-- Actions bar (full width) -->
@@ -514,7 +455,7 @@ let passCount = 0;
let failCount = 0;
let loadedConfigName = null;
const PASS_FAIL_GRADER_ORDER = ['performance_grader', 'webvoyager_grader', 'fara_combined', 'fara_grader'];
const PASS_FAIL_GRADER_ORDER = ['agisdk_state_diff', 'infinity_state', 'performance_grader'];
function getPrimaryGrader(graderResults) {
for (const name of PASS_FAIL_GRADER_ORDER) {
if (graderResults[name]) return graderResults[name];
@@ -751,20 +692,6 @@ function fillForm(cfg) {
setVal('cfg-exec-model', exec.model);
setVal('cfg-exec-apikey', exec.apiKey);
setVal('cfg-exec-baseurl', exec.baseUrl);
} else if (type === 'gemini-computer-use') {
setVal('cfg-gemini-apikey', agent.apiKey);
if (agent.screenSize) {
setVal('cfg-gemini-width', agent.screenSize.width);
setVal('cfg-gemini-height', agent.screenSize.height);
}
setVal('cfg-gemini-turns', agent.turnLimit);
} else if (type === 'yutori-navigator') {
setVal('cfg-yutori-apikey', agent.apiKey);
if (agent.screenSize) {
setVal('cfg-yutori-width', agent.screenSize.width);
setVal('cfg-yutori-height', agent.screenSize.height);
}
setVal('cfg-yutori-turns', agent.turnLimit);
}
// Infrastructure
@@ -797,17 +724,13 @@ function fillForm(cfg) {
// Grader checkboxes
const graderMap = {
'performance_grader': 'cfg-grader-performance',
'webvoyager_grader': 'cfg-grader-webvoyager',
'fara_combined': 'cfg-grader-fara',
'mind2web_judge': 'cfg-grader-mind2web',
'agisdk_state_diff': 'cfg-grader-agisdk',
'infinity_state': 'cfg-grader-infinity',
};
const configGraders = cfg.graders || [];
for (const [name, id] of Object.entries(graderMap)) {
document.getElementById(id).checked = configGraders.includes(name);
}
setVal('cfg-grader-model', cfg.grader_model);
setVal('cfg-grader-key-env', cfg.grader_api_key_env);
setVal('cfg-grader-baseurl', cfg.grader_base_url);
}
function setVal(id, val) {
@@ -848,26 +771,6 @@ function buildConfigFromForm() {
baseUrl: getVal('cfg-exec-baseurl') || undefined,
},
};
} else if (type === 'gemini-computer-use') {
agent = {
type: 'gemini-computer-use',
apiKey: getVal('cfg-gemini-apikey'),
screenSize: {
width: parseInt(getVal('cfg-gemini-width'), 10) || 1440,
height: parseInt(getVal('cfg-gemini-height'), 10) || 900,
},
turnLimit: parseInt(getVal('cfg-gemini-turns'), 10) || 30,
};
} else if (type === 'yutori-navigator') {
agent = {
type: 'yutori-navigator',
apiKey: getVal('cfg-yutori-apikey'),
screenSize: {
width: parseInt(getVal('cfg-yutori-width'), 10) || 1280,
height: parseInt(getVal('cfg-yutori-height'), 10) || 800,
},
turnLimit: parseInt(getVal('cfg-yutori-turns'), 10) || 30,
};
}
// Dataset: use dropdown value unless custom is selected
@@ -894,16 +797,10 @@ function buildConfigFromForm() {
if (outputDir) config.output_dir = outputDir;
const timeoutMs = parseInt(getVal('cfg-timeout'), 10);
if (timeoutMs) config.timeout_ms = timeoutMs;
const selectedGraders = ['cfg-grader-performance', 'cfg-grader-webvoyager', 'cfg-grader-fara', 'cfg-grader-mind2web']
const selectedGraders = ['cfg-grader-performance', 'cfg-grader-agisdk', 'cfg-grader-infinity']
.filter(id => document.getElementById(id).checked)
.map(id => document.getElementById(id).value);
if (selectedGraders.length) config.graders = selectedGraders;
const graderModel = getVal('cfg-grader-model');
if (graderModel) config.grader_model = graderModel;
const graderKeyEnv = getVal('cfg-grader-key-env');
if (graderKeyEnv) config.grader_api_key_env = graderKeyEnv;
const graderBaseUrl = getVal('cfg-grader-baseurl');
if (graderBaseUrl) config.grader_base_url = graderBaseUrl;
return config;
}
@@ -1417,8 +1314,6 @@ function renderGraderPanel() {
let bodyHtml = '';
if (primaryName === 'performance_grader') {
bodyHtml = renderPerformanceGrader(primaryResult);
} else if (primaryName === 'fara_combined' || primaryName === 'fara_grader') {
bodyHtml = renderFaraCombined(primaryResult);
} else {
bodyHtml = renderGenericGrader(primaryResult);
}
@@ -1477,35 +1372,6 @@ function renderPerformanceGrader(result) {
return html;
}
function renderFaraCombined(result) {
const details = result.details || {};
const verifiers = details.verifiers;
const voting = details.votingResult;
if (!verifiers || typeof verifiers !== 'object') {
return renderGenericGrader(result);
}
let html = '';
if (voting) {
html += `<div style="font-size:11px;color:#8b949e;margin-bottom:8px">Majority vote: ${voting.passCount}/${voting.totalVerifiers} passed &rarr; <strong style="color:${voting.decision === 'PASS' ? '#3fb950' : '#f85149'}">${voting.decision}</strong></div>`;
}
html += '<div class="grader-verifiers">';
for (const [name, v] of Object.entries(verifiers)) {
const badge = v.pass ? '<span class="grader-verifier-badge pass">PASS</span>' : '<span class="grader-verifier-badge fail">FAIL</span>';
const score = typeof v.score === 'number' ? `${(v.score * 100).toFixed(0)}%` : '';
const label = name.charAt(0).toUpperCase() + name.slice(1);
html += `
<div class="grader-verifier">
<span class="grader-verifier-name">${label}</span>
${badge}
<span style="font-size:11px;color:#8b949e;margin-left:auto">${score}</span>
</div>
`;
}
html += '</div>';
return html;
}
function renderGenericGrader(result) {
const reasoning = result.reasoning || '';
if (!reasoning) return '';

View File

@@ -4,7 +4,6 @@ import { Hono } from 'hono'
import { streamSSE } from 'hono/streaming'
import { ParallelExecutor } from '../runner/parallel-executor'
import { loadTasks } from '../runner/task-loader'
import { resolveGraderOptions } from '../runner/types'
import { EvalConfigSchema, type Task } from '../types'
// ============================================================================
@@ -431,14 +430,11 @@ app.post('/api/run', async (c) => {
const configLabel = body.configName || 'dashboard'
dashboardState.init(tasks, configLabel, config.agent.type, outputDir)
const graderOptions = resolveGraderOptions(config)
// Run eval in background — don't await
const executor = new ParallelExecutor({
numWorkers: config.num_workers || 1,
config,
outputDir,
graderOptions,
restartServerPerTask: config.restart_server_per_task,
onEvent: (taskId, event) =>
dashboardState.broadcastStreamEvent(taskId, event),

View File

@@ -0,0 +1,202 @@
import { spawn } from 'node:child_process'
import { join } from 'node:path'
import type { GraderResult } from '../../types'
import { callMcpTool } from '../../utils/mcp-client'
import type { Grader, GraderInput } from '../types'
const EVAL_SCRIPT = join(
import.meta.dirname,
'..',
'..',
'..',
'scripts',
'agisdk-evaluate.py',
)
export class AgisdkStateDiffGrader implements Grader {
name = 'agisdk_state_diff'
async grade(input: GraderInput): Promise<GraderResult> {
const taskId = this.extractTaskId(input.task.query_id)
const startUrl = this.extractStartUrl(input)
const mcpEndpoint =
input.mcpUrl ||
`${process.env.BROWSEROS_SERVER_URL || 'http://127.0.0.1:9110'}/mcp`
if (!startUrl) {
return {
score: 0,
pass: false,
reasoning: 'Could not determine clone site URL from task',
}
}
const origin = new URL(startUrl).origin
let envState: Record<string, unknown>
try {
envState = await this.fetchFinishState(origin, mcpEndpoint)
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Failed to fetch /finish endpoint: ${error instanceof Error ? error.message : String(error)}`,
details: { origin, error: true },
}
}
try {
const result = await this.runPythonEvaluator(
taskId,
envState,
input.finalAnswer || '',
)
return {
score: result.reward,
pass: result.pass,
reasoning:
result.message ||
(result.pass ? 'All criteria passed' : 'Some criteria failed'),
details: {
reward: result.reward,
per_criterion: result.per_criterion,
origin,
agisdk_task_id: taskId,
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Python evaluator error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true },
}
}
}
private extractTaskId(queryId: string): string {
return queryId.replace(/^agisdk-/, '')
}
private extractStartUrl(input: GraderInput): string | null {
// Derive from task_id: "dashdish-10" → "https://evals-dashdish.vercel.app"
// Task IDs are "{site}-{number}" where site may contain hyphens (e.g. "fly-unified-5")
const taskId = this.extractTaskId(input.task.query_id)
const siteId = taskId.replace(/-\d+$/, '')
if (siteId) return `https://evals-${siteId}.vercel.app`
// Fallback: search messages for vercel.app URLs
for (const msg of input.messages) {
const text =
msg.type === 'user'
? msg.content
: msg.type === 'tool-input-available'
? JSON.stringify(msg.input)
: ''
const urlMatch = text.match(/https?:\/\/[^\s"']+\.vercel\.app/)
if (urlMatch) return urlMatch[0]
}
return null
}
private async fetchFinishState(
origin: string,
mcpEndpoint: string,
): Promise<Record<string, unknown>> {
const finishUrl = `${origin}/finish`
// Navigate browser to /finish page (state diff is rendered client-side)
await callMcpTool(mcpEndpoint, 'navigate_page', {
url: finishUrl,
page: 1,
})
// Wait for the page to render, then extract JSON from <pre> element
const result = await callMcpTool(mcpEndpoint, 'evaluate_script', {
page: 1,
expression: `
new Promise((resolve, reject) => {
let attempts = 0;
const check = () => {
const pre = document.querySelector('pre');
if (pre && pre.textContent.trim().startsWith('{')) {
resolve(pre.textContent);
} else if (++attempts > 20) {
reject(new Error('Timed out waiting for <pre> JSON on /finish'));
} else {
setTimeout(check, 500);
}
};
check();
})
`,
})
const textContent = result.content?.find(
(c: { type: string }) => c.type === 'text',
)
if (!textContent?.text) {
throw new Error('No text content returned from /finish page')
}
return JSON.parse(textContent.text) as Record<string, unknown>
}
private runPythonEvaluator(
taskId: string,
envState: Record<string, unknown>,
modelResponse: string,
): Promise<{
reward: number
pass: boolean
message: string
per_criterion: unknown[]
}> {
return new Promise((resolve, reject) => {
const proc = spawn('python3', [EVAL_SCRIPT], {
stdio: ['pipe', 'pipe', 'pipe'],
})
const inputData = JSON.stringify({
task_id: taskId,
env_state: envState,
model_response: modelResponse,
})
let stdout = ''
let stderr = ''
proc.stdout.on('data', (data: Buffer) => {
stdout += data.toString()
})
proc.stderr.on('data', (data: Buffer) => {
stderr += data.toString()
})
proc.on('close', (code) => {
if (code !== 0) {
reject(
new Error(`Python evaluator exited with code ${code}: ${stderr}`),
)
return
}
try {
const result = JSON.parse(stdout.trim())
resolve(result)
} catch {
reject(new Error(`Failed to parse evaluator output: ${stdout}`))
}
})
proc.on('error', (err) => {
reject(new Error(`Failed to spawn Python evaluator: ${err.message}`))
})
proc.stdin.write(inputData)
proc.stdin.end()
})
}
}

View File

@@ -0,0 +1,134 @@
import { join, resolve } from 'node:path'
import type { GraderResult } from '../../types'
import type { Grader, GraderInput } from '../types'
interface InfinityEvalInput {
app_server_url: string
verifier_path: string
task_id: string
}
interface InfinityEvalOutput {
pass: boolean
reward: number
message: string
}
const EVAL_SCRIPT = resolve(
import.meta.dir,
'../../../scripts/infinity-evaluate.py',
)
export class InfinityStateGrader implements Grader {
name = 'infinity_state'
async grade(input: GraderInput): Promise<GraderResult> {
const parsed = this.parseQueryId(input.task.query_id)
if (!parsed) {
return {
score: 0,
pass: false,
reasoning: `Cannot parse query_id "${input.task.query_id}" — expected format: infinity-{app}-{task_id}`,
}
}
const appServerUrl = this.resolveAppServerUrl(input)
if (!appServerUrl) {
return {
score: 0,
pass: false,
reasoning: 'Cannot determine app server URL',
}
}
const infinityDir = process.env.WEBARENA_INFINITY_DIR
if (!infinityDir) {
return {
score: 0,
pass: false,
reasoning:
'WEBARENA_INFINITY_DIR env var not set. Point it to the webarena-infinity repo root.',
}
}
const verifierPath = join(
infinityDir,
'apps',
parsed.appName,
'real-tasks',
`${parsed.taskId}.py`,
)
const evalInput: InfinityEvalInput = {
app_server_url: appServerUrl,
verifier_path: verifierPath,
task_id: input.task.query_id,
}
try {
const result = await this.runPythonEvaluator(evalInput)
return {
score: result.pass ? 1 : 0,
pass: result.pass,
reasoning: result.message,
details: {
reward: result.reward,
app_name: parsed.appName,
app_server_url: appServerUrl,
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Evaluator process error: ${error instanceof Error ? error.message : String(error)}`,
}
}
}
private parseQueryId(
queryId: string,
): { appName: string; taskId: string } | null {
// Task IDs start with "task_", app names may contain hyphens
// e.g. "infinity-elation-prescriptions-task_h69"
const match = queryId.match(/^infinity-(.+)-(task_.+)$/)
if (!match) return null
return { appName: match[1], taskId: match[2] }
}
private resolveAppServerUrl(input: GraderInput): string | null {
// Passed directly from task executor (started by InfinityAppManager)
if (input.infinityAppUrl) return input.infinityAppUrl
// Fallback: env var for manual testing
if (process.env.INFINITY_APP_URL) return process.env.INFINITY_APP_URL
return null
}
private async runPythonEvaluator(
evalInput: InfinityEvalInput,
): Promise<InfinityEvalOutput> {
const proc = Bun.spawn(['python3', EVAL_SCRIPT], {
stdin: 'pipe',
stdout: 'pipe',
stderr: 'pipe',
})
const inputJson = JSON.stringify(evalInput)
proc.stdin.write(inputJson)
proc.stdin.end()
const stdout = await new Response(proc.stdout).text()
const stderr = await new Response(proc.stderr).text()
const exitCode = await proc.exited
if (exitCode !== 0) {
throw new Error(
`Python evaluator exited with code ${exitCode}: ${stderr || stdout}`,
)
}
return JSON.parse(stdout.trim()) as InfinityEvalOutput
}
}

View File

@@ -1,355 +0,0 @@
import { readFile } from 'node:fs/promises'
import { join } from 'node:path'
import OpenAI from 'openai'
import type { ChatCompletionContentPart } from 'openai/resources/chat/completions'
import { type GraderResult, isToolInputAvailable } from '../../types'
import type { Grader, GraderInput } from '../types'
/**
* Mind2Web WebJudge Grader - 3-step automatic evaluation
* Reference: https://github.com/OSU-NLP-Group/Online-Mind2Web/tree/main/src/methods
*
* Steps:
* 1. Key Point Identification - Extract critical requirements from task
* 2. Key Screenshot Identification - Score screenshots for relevance (1-5)
* 3. Outcome Judgment - Final success/failure determination
*/
// ============================================================================
// Prompts (Exact from Online-Mind2Web repository)
// ============================================================================
const STEP1_KEY_POINTS_SYSTEM = `You are an expert tasked with analyzing a given task to identify the key points explicitly stated in the task description.
**Objective**: Carefully analyze the task description and extract the critical elements explicitly mentioned in the task for achieving its goal.
**Instructions**:
1. Read the task description carefully.
2. Identify and extract **key points** directly stated in the task description.
- A **key point** is a critical element, condition, or step explicitly mentioned in the task description.
- Do not infer or add any unstated elements.
- Words such as "best," "highest," "cheapest," "latest," "most recent," "lowest," "closest," "highest-rated," "largest," and "newest" must go through the sort function(e.g., the key point should be "Filter by highest").
**Respond with**:
- **Key Points**: A numbered list of the explicit key points for completing this task, one per line, without explanations or additional details.`
const STEP2_IMAGE_SCORING_SYSTEM = `You are an expert evaluator tasked with determining whether an image contains information about the necessary steps to complete a task.
**Objective**: Analyze the provided image and decide if it shows essential steps or evidence required for completing the task. Use your reasoning to explain your decision before assigning a score.
**Instructions**:
1. Provide a detailed description of the image, including its contents, visible elements, text (if any), and any notable features.
2. Carefully examine the image and evaluate whether it contains necessary steps or evidence crucial to task completion:
- Identify key points that could be relevant to task completion, such as actions, progress indicators, tool usage, applied filters, or step-by-step instructions.
- Does the image show actions, progress indicators, or critical information directly related to completing the task?
- Is this information indispensable for understanding or ensuring task success?
- If the image contains partial but relevant information, consider its usefulness rather than dismissing it outright.
3. Provide your response in the following format:
- **Reasoning**: Explain your thought process and observations. Mention specific elements in the image that indicate necessary steps, evidence, or lack thereof.
- **Score**: Assign a score based on the reasoning, using the following scale:
- **1**: The image does not contain any necessary steps or relevant information.
- **2**: The image contains minimal or ambiguous information, unlikely to be essential.
- **3**: The image includes some relevant steps or hints but lacks clarity or completeness.
- **4**: The image contains important steps or evidence that are highly relevant but not fully comprehensive.
- **5**: The image clearly displays necessary steps or evidence crucial for completing the task.
Respond with:
1. **Reasoning**: [Your explanation]
2. **Score**: [1-5]`
const STEP3_OUTCOME_SYSTEM = `You are an expert in evaluating the performance of a web navigation agent. The agent is designed to help a human user navigate a website to complete a task. Given the user's task, the agent's action history, key points for task completion, some potentially important web pages in the agent's trajectory and their reasons, your goal is to determine whether the agent has completed the task and achieved all requirements.
Your response must strictly follow the following evaluation criteria!
*Important Evaluation Criteria*:
1: The filtered results must be displayed correctly. If filters were not properly applied (i.e., missing selection, missing confirmation, or no visible effect in results), the task is not considered successful.
2: You must carefully check whether these snapshots and action history meet these key points. Ensure that specific filter conditions, such as "best," "highest," "cheapest," "latest," "most recent," "lowest," "closest," "highest-rated," "largest," and "newest" are correctly applied using the filter function(e.g., sort function).
3: Certain key points or requirements should be applied by the filter. Otherwise, a search with all requirements as input will be deemed a failure since it cannot guarantee that all results meet the requirements!
4: If the task requires filtering by a specific range of money, years, or the number of beds and bathrooms, the applied filter must exactly match the given requirement. Any deviation results in failure. To ensure the task is successful, the applied filter must precisely match the specified range without being too broad or too narrow.
Examples of Failure Cases:
- If the requirement is less than $50, but the applied filter is less than $25, it is a failure.
- If the requirement is $1500-$2500, but the applied filter is $2000-$2500, it is a failure.
- If the requirement is $25-$200, but the applied filter is $0-$200, it is a failure.
- If the required years are 2004-2012, but the filter applied is 2001-2012, it is a failure.
- If the required years are before 2015, but the applied filter is 2000-2014, it is a failure.
- If the task requires exactly 2 beds, but the filter applied is 2+ beds, it is a failure.
5: Some tasks require a submission action or a display of results to be considered successful.
6: If the retrieved information is invalid or empty(e.g., No match was found), but the agent has correctly performed the required action, it should still be considered successful.
7: If the current page already displays all available items, then applying a filter is not necessary. As long as the agent selects items that meet the requirements (e.g., the cheapest or lowest price), the task is still considered successful.
*IMPORTANT*
Format your response into two lines as shown below:
Thoughts: <your thoughts and reasoning process based on double-checking each key points and the evaluation criteria>
Status: "success" or "failure"`
// ============================================================================
// Mind2Web WebJudge Grader Implementation
// ============================================================================
export class Mind2WebJudgeGrader implements Grader {
name = 'mind2web_judge'
private client: OpenAI
private model: string
private scoreThreshold = 3
private maxImages = 50
constructor(apiKey: string, baseURL?: string, model?: string) {
this.client = new OpenAI({
apiKey,
baseURL: baseURL || undefined,
})
this.model = model || 'gpt-4o'
}
async grade(input: GraderInput): Promise<GraderResult> {
try {
// Step 1: Identify key points from task
const keyPoints = await this.identifyKeyPoints(input.task.query)
// Step 2: Score screenshots and filter relevant ones
const screenshotResults = await this.scoreScreenshots(
input.task.query,
keyPoints,
input.outputDir,
input.screenshotCount,
)
// Step 3: Final outcome judgment
const actionHistory = this.extractActionHistory(input.messages)
const outcome = await this.judgeOutcome(
input.task.query,
keyPoints,
actionHistory,
screenshotResults.relevantImages,
screenshotResults.thoughts,
)
return {
score: outcome.success ? 1 : 0,
pass: outcome.success,
reasoning: outcome.reasoning,
details: {
keyPoints,
screenshotsEvaluated: screenshotResults.totalEvaluated,
screenshotsRelevant: screenshotResults.relevantImages.length,
model: this.model,
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Grader error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true },
}
}
}
/**
* Step 1: Key Point Identification
*/
private async identifyKeyPoints(task: string): Promise<string> {
const response = await this.client.chat.completions.create({
model: this.model,
temperature: 0,
messages: [
{ role: 'system', content: STEP1_KEY_POINTS_SYSTEM },
{ role: 'user', content: `Task: ${task}` },
],
max_tokens: 512,
})
const content = response.choices[0]?.message?.content || ''
// Extract key points section
if (content.includes('**Key Points**:')) {
return content.split('**Key Points**:')[1].trim()
}
if (content.includes('Key Points:')) {
return content.split('Key Points:')[1].trim()
}
return content
}
/**
* Step 2: Key Screenshot Identification
*/
private async scoreScreenshots(
task: string,
keyPoints: string,
outputDir: string,
screenshotCount: number,
): Promise<{
relevantImages: { data: string; score: number }[]
thoughts: string[]
totalEvaluated: number
}> {
const relevantImages: { data: string; score: number }[] = []
const thoughts: string[] = []
let totalEvaluated = 0
// Evaluate each screenshot
for (let i = 1; i <= screenshotCount; i++) {
try {
const filepath = join(outputDir, 'screenshots', `${i}.png`)
const buffer = await readFile(filepath)
const base64 = buffer.toString('base64')
const imageUrl = `data:image/png;base64,${base64}`
totalEvaluated++
// Score this image
const response = await this.client.chat.completions.create({
model: this.model,
temperature: 0,
messages: [
{ role: 'system', content: STEP2_IMAGE_SCORING_SYSTEM },
{
role: 'user',
content: [
{
type: 'text',
text: `**Task**: ${task}\n\n**Key Points for Task Completion**: ${keyPoints}\n\nThe snapshot of the web page is shown in the image.`,
},
{
type: 'image_url',
image_url: { url: imageUrl, detail: 'high' },
},
],
},
],
max_tokens: 512,
})
const content = response.choices[0]?.message?.content || ''
// Extract score
const scoreMatch = content.match(/Score[:\s]*\**\s*([1-5])/i)
const score = scoreMatch ? parseInt(scoreMatch[1], 10) : 1
// Extract reasoning/thought
const thoughtMatch = content.match(
/\*\*Reasoning\*\*:?\s*([\s\S]*?)(?=\n\n|\*\*Score|$)/i,
)
const thought = thoughtMatch
? thoughtMatch[1].trim().replace(/\n/g, ' ')
: content.split('\n')[0]
// Keep if above threshold
if (score >= this.scoreThreshold) {
relevantImages.push({ data: imageUrl, score })
thoughts.push(`Screenshot ${i} (score ${score}): ${thought}`)
}
} catch {
// Skip missing files
}
}
// Limit to max images
if (relevantImages.length > this.maxImages) {
relevantImages.splice(0, relevantImages.length - this.maxImages)
thoughts.splice(0, thoughts.length - this.maxImages)
}
return { relevantImages, thoughts, totalEvaluated }
}
/**
* Step 3: Outcome Judgment
*/
private async judgeOutcome(
task: string,
keyPoints: string,
actionHistory: string[],
relevantImages: { data: string; score: number }[],
thoughts: string[],
): Promise<{ success: boolean; reasoning: string }> {
// Format action history
const actionsFormatted = actionHistory
.map((action, i) => `${i + 1}. ${action}`)
.join('\n')
// Format thoughts
const thoughtsFormatted = thoughts
.map((thought, i) => `${i + 1}. ${thought}`)
.join('\n')
// Build message content
const messageContent: ChatCompletionContentPart[] = []
if (relevantImages.length > 0) {
messageContent.push({
type: 'text',
text: `User Task: ${task}
Key Points: ${keyPoints}
Action History:
${actionsFormatted || 'No actions recorded'}
The potentially important snapshots of the webpage in the agent's trajectory and their reasons:
${thoughtsFormatted || 'No relevant screenshots identified'}`,
})
// Add images
for (const img of relevantImages) {
messageContent.push({
type: 'image_url',
image_url: { url: img.data, detail: 'high' as const },
})
}
} else {
// No images - text only
messageContent.push({
type: 'text',
text: `User Task: ${task}
Key Points: ${keyPoints}
Action History:
${actionsFormatted || 'No actions recorded'}`,
})
}
const response = await this.client.chat.completions.create({
model: this.model,
temperature: 0,
messages: [
{ role: 'system', content: STEP3_OUTCOME_SYSTEM },
{ role: 'user', content: messageContent },
],
max_tokens: 1000,
})
const content = response.choices[0]?.message?.content || ''
const statusMatch = content.match(/Status:\s*"?(success|failure)"?/i)
const isSuccess = statusMatch
? statusMatch[1].toLowerCase() === 'success'
: false
return {
success: isSuccess,
reasoning: content,
}
}
/**
* Extract action history from messages
*/
private extractActionHistory(messages: GraderInput['messages']): string[] {
const actions: string[] = []
for (const msg of messages) {
if (isToolInputAvailable(msg)) {
const params = JSON.stringify(msg.input as Record<string, unknown>)
actions.push(`${msg.toolName}(${params})`)
}
}
return actions
}
}

View File

@@ -1,138 +0,0 @@
import { readFile } from 'node:fs/promises'
import { join } from 'node:path'
import OpenAI from 'openai'
import type { GraderResult } from '../../types'
import type { Grader, GraderInput } from '../types'
/**
* WebVoyager Grader - Exact implementation based on original WebVoyager auto_eval.py
* Reference: https://github.com/MinorJerry/WebVoyager/blob/main/evaluation/auto_eval.py
*
* Uses GPT-4V to evaluate task completion by analyzing screenshots and final response.
*/
const WEBVOYAGER_SYSTEM_PROMPT = `As an evaluator, you will be presented with three primary components to assist you in your role:
1. Web Task Instruction: This is a clear and specific directive provided in natural language, detailing the online activity to be carried out. These requirements may include conducting searches, verifying information, comparing prices, checking availability, or any other action relevant to the specified web service (such as Amazon, Apple, ArXiv, BBC News, Booking etc).
2. Result Screenshots: This is a visual representation of the screen showing the result or intermediate state of performing a web task. It serves as visual proof of the actions taken in response to the instruction.
3. Result Response: This is a textual response obtained after the execution of the web task. It serves as textual result in response to the instruction.
-- You DO NOT NEED to interact with web pages or perform actions such as booking flights or conducting searches on websites.
-- You SHOULD NOT make assumptions based on information not presented in the screenshot when comparing it to the instructions.
-- Your primary responsibility is to conduct a thorough assessment of the web task instruction against the outcome depicted in the screenshot and in the response, evaluating whether the actions taken align with the given instructions.
-- NOTE that the instruction may involve more than one task, for example, locating the garage and summarizing the review. Failing to complete either task, such as not providing a summary, should be considered unsuccessful.
-- NOTE that the screenshot is authentic, but the response provided by LLM is generated at the end of web browsing, and there may be discrepancies between the text and the screenshots.
-- Note the difference: 1) Result response may contradict the screenshot, then the content of the screenshot prevails, 2) The content in the Result response is not mentioned on the screenshot, choose to believe the content.
You should elaborate on how you arrived at your final evaluation and then provide a definitive verdict on whether the task has been successfully accomplished, either as 'SUCCESS' or 'NOT SUCCESS'.`
export class WebVoyagerGrader implements Grader {
name = 'webvoyager_grader'
private client: OpenAI
private maxScreenshots = 15
private model: string
constructor(apiKey: string, baseURL?: string, model?: string) {
this.client = new OpenAI({
apiKey,
baseURL: baseURL || undefined,
})
this.model = model || 'gpt-4o'
}
async grade(input: GraderInput): Promise<GraderResult> {
// Load screenshots (last N screenshots)
const startNum = Math.max(
1,
input.screenshotCount - this.maxScreenshots + 1,
)
const endNum = input.screenshotCount
const images: { type: 'image_url'; image_url: { url: string } }[] = []
const loadedScreenshots: number[] = []
for (let i = startNum; i <= endNum; i++) {
try {
const filepath = join(input.outputDir, 'screenshots', `${i}.png`)
const buffer = await readFile(filepath)
const base64 = buffer.toString('base64')
images.push({
type: 'image_url',
image_url: { url: `data:image/png;base64,${base64}` },
})
loadedScreenshots.push(i)
} catch {
// Skip missing files
}
}
if (images.length === 0) {
return {
score: 0,
pass: false,
reasoning: 'No screenshots available for evaluation',
}
}
// Build user prompt (matching original WebVoyager format)
const userPrompt = `TASK: ${input.task.query}
Result Response: ${input.finalAnswer || '[No response provided]'}
${images.length} screenshots at the end:`
try {
const response = await this.client.chat.completions.create({
model: this.model,
temperature: 0,
seed: 42,
messages: [
{ role: 'system', content: WEBVOYAGER_SYSTEM_PROMPT },
{
role: 'user',
content: [
{ type: 'text', text: userPrompt },
...images,
{ type: 'text', text: 'Your verdict:\n' },
],
},
],
max_tokens: 1000,
})
const content = response.choices[0]?.message?.content || ''
// Parse verdict (matching original logic)
// "NOT SUCCESS" must be checked first as it contains "SUCCESS"
let isSuccess: boolean
if (content.toUpperCase().includes('NOT SUCCESS')) {
isSuccess = false
} else if (content.toUpperCase().includes('SUCCESS')) {
isSuccess = true
} else {
// Ambiguous response - default to failure
isSuccess = false
}
return {
score: isSuccess ? 1 : 0,
pass: isSuccess,
reasoning: content,
details: {
screenshotsEvaluated: images.length,
screenshotRange: `${loadedScreenshots[0]}-${loadedScreenshots[loadedScreenshots.length - 1]}`,
model: this.model,
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Grader error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true },
}
}
}
}

View File

@@ -1,234 +0,0 @@
import OpenAI from 'openai'
import {
countToolCalls,
type GraderResult,
isToolInputAvailable,
} from '../../types'
import type { Grader, GraderInput } from '../types'
/**
* Fara Alignment Verifier
*
* Based on the Fara paper (Microsoft Research, 2024):
* "A text-only verifier designed to judge whether the actions taken and final
* response of a trajectory aligns with the given task. The purpose of this
* verifier is to give a high-level judgement of whether the trajectory likely
* satisfies the intent of the task."
*
* For transactional tasks: verifies whether the trajectory correctly identified
* target URLs matching requested products/services.
*
* For information-seeking tasks: checks whether the response correctly answers
* the input question.
*/
const ALIGNMENT_SYSTEM_PROMPT = `You are an expert evaluator verifying if a web agent's trajectory aligns with the given task intent.
Your role is to provide a high-level judgment of whether the agent's actions and final response satisfy the intent of the task.
**Evaluation Criteria:**
1. **Task Intent Alignment**: Do the actions taken directly address what the task is asking for?
2. **Action Relevance**: Were the actions purposeful and directed toward completing the task?
- Did the agent navigate to relevant pages?
- Did it interact with appropriate elements (buttons, forms, links)?
- Were there unnecessary detours or irrelevant actions?
3. **Response Accuracy** (for information-seeking tasks):
- Does the final response correctly answer the question asked?
- Is the information retrieved from the correct source?
4. **Target Completion** (for transactional tasks):
- Did the agent reach the correct destination (product page, search results, etc.)?
- Were the correct parameters/filters applied?
**Output Format:**
Provide your analysis, then conclude with a clear verdict.
VERDICT: PASS or FAIL
REASONING: <One sentence summary of your decision>`
export class FaraAlignmentGrader implements Grader {
name = 'fara_alignment'
private client: OpenAI
private model: string
private maxRetries = 3
private retryDelayMs = 1000
constructor(apiKey: string, baseUrl?: string, model?: string) {
this.client = new OpenAI({
apiKey,
baseURL: baseUrl || undefined,
})
this.model = model || 'gpt-4o-mini'
}
async grade(input: GraderInput): Promise<GraderResult> {
const actionSequence = this.extractActionSequence(input)
const taskType = this.classifyTaskType(input.task.query)
const userPrompt = `**Task:** ${input.task.query}
**Task Type:** ${taskType}
**Action Sequence:**
${actionSequence || 'No actions taken'}
**Final Response:** ${input.finalAnswer || '[No response provided]'}
Evaluate whether this trajectory aligns with the task intent and provide your verdict.`
try {
const response = await this.callWithRetry(userPrompt)
const content = response.choices[0]?.message?.content || ''
const isPass = this.parseVerdict(content)
return {
score: isPass ? 1 : 0,
pass: isPass,
reasoning: content,
details: {
verifier: 'alignment',
taskType,
actionCount: countToolCalls(input.messages),
model: this.model,
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Alignment verifier error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true, verifier: 'alignment' },
}
}
}
private extractActionSequence(input: GraderInput): string {
const actions: string[] = []
let stepNum = 1
for (const msg of input.messages) {
if (isToolInputAvailable(msg)) {
const paramsStr = this.formatParams(
msg.input as Record<string, unknown>,
)
actions.push(`${stepNum}. ${msg.toolName}(${paramsStr})`)
stepNum++
}
}
return actions.join('\n')
}
private formatParams(params: Record<string, unknown>): string {
const entries = Object.entries(params)
if (entries.length === 0) return ''
return entries
.map(([key, value]) => {
const strValue =
typeof value === 'string'
? `"${value.substring(0, 100)}${value.length > 100 ? '...' : ''}"`
: JSON.stringify(value)
return `${key}=${strValue}`
})
.join(', ')
}
private classifyTaskType(query: string): string {
const lowerQuery = query.toLowerCase()
const infoKeywords = [
'find',
'search',
'look up',
'what is',
'how to',
'tell me',
'show me',
'get information',
'check',
'verify',
'confirm',
'list',
'summarize',
'review',
]
const transactionalKeywords = [
'buy',
'purchase',
'add to cart',
'book',
'reserve',
'order',
'subscribe',
'sign up',
'register',
'download',
'submit',
'apply',
]
for (const keyword of transactionalKeywords) {
if (lowerQuery.includes(keyword)) {
return 'transactional'
}
}
for (const keyword of infoKeywords) {
if (lowerQuery.includes(keyword)) {
return 'information-seeking'
}
}
return 'general'
}
private parseVerdict(content: string): boolean {
const upperContent = content.toUpperCase()
if (upperContent.includes('VERDICT: PASS')) {
return true
}
if (upperContent.includes('VERDICT: FAIL')) {
return false
}
if (upperContent.includes('VERDICT:')) {
const verdictMatch = upperContent.match(/VERDICT:\s*(PASS|FAIL)/)
if (verdictMatch) {
return verdictMatch[1] === 'PASS'
}
}
return false
}
private async callWithRetry(
userPrompt: string,
attempt = 1,
): Promise<OpenAI.Chat.Completions.ChatCompletion> {
try {
return await this.client.chat.completions.create({
model: this.model,
temperature: 0,
messages: [
{ role: 'system', content: ALIGNMENT_SYSTEM_PROMPT },
{ role: 'user', content: userPrompt },
],
max_tokens: 1000,
})
} catch (error) {
if (attempt < this.maxRetries) {
const delay = this.retryDelayMs * 2 ** (attempt - 1)
await new Promise((resolve) => setTimeout(resolve, delay))
return this.callWithRetry(userPrompt, attempt + 1)
}
throw error
}
}
}

View File

@@ -1,284 +0,0 @@
import type { GraderResult } from '../../types'
import type { Grader, GraderInput } from '../types'
import { FaraAlignmentGrader } from './alignment'
import { FaraMultimodalGrader } from './multimodal'
import { FaraRubricGrader } from './rubric'
/**
* Fara Combined Verifier (3-Verifier System)
*
* Based on the Fara paper (Microsoft Research, 2024):
* "Before using any tasks for training, three verifier agents evaluate if a task
* was 'successful': The Alignment Verifier checks if the trajectory of actions
* match the task's intent; the Rubric Verifier defines completion criteria and
* scores the trajectory against them; and the Multimodal Verifier reviews screenshots
* and responses to confirm visual evidence supports successful completion."
*
* Decision Strategy: Majority Voting
* - All three verifiers run independently
* - A trajectory passes if at least 2 of 3 verifiers pass
* - Combined score is the average of individual scores
* - Detailed breakdown of each verifier's decision is provided
*
* This combined approach addresses different failure modes:
* - Alignment: catches trajectories that wander off-task
* - Rubric: catches partial completions via granular scoring
* - Multimodal: catches hallucinations via visual evidence verification
*/
interface VerifierResult {
name: string
pass: boolean
score: number
reasoning: string
details?: Record<string, unknown>
}
export class FaraCombinedGrader implements Grader {
name = 'fara_combined'
private alignmentGrader: FaraAlignmentGrader
private rubricGrader: FaraRubricGrader
private multimodalGrader: FaraMultimodalGrader
private runInParallel: boolean
constructor(
apiKey: string,
baseUrl?: string,
model?: string,
options?: { parallel?: boolean },
) {
this.alignmentGrader = new FaraAlignmentGrader(
apiKey,
baseUrl,
model || 'gpt-4o-mini',
)
this.rubricGrader = new FaraRubricGrader(
apiKey,
baseUrl,
model || 'gpt-4o-mini',
)
this.multimodalGrader = new FaraMultimodalGrader(
apiKey,
baseUrl,
model || 'gpt-4o',
)
this.runInParallel = options?.parallel ?? true
}
async grade(input: GraderInput): Promise<GraderResult> {
try {
const verifierResults: VerifierResult[] = []
if (this.runInParallel) {
// Run all verifiers in parallel for speed
const [alignmentResult, rubricResult, multimodalResult] =
await Promise.all([
this.runVerifier('alignment', () =>
this.alignmentGrader.grade(input),
),
this.runVerifier('rubric', () => this.rubricGrader.grade(input)),
this.runVerifier('multimodal', () =>
this.multimodalGrader.grade(input),
),
])
verifierResults.push(alignmentResult, rubricResult, multimodalResult)
} else {
// Run sequentially (useful for debugging or rate limiting)
verifierResults.push(
await this.runVerifier('alignment', () =>
this.alignmentGrader.grade(input),
),
)
verifierResults.push(
await this.runVerifier('rubric', () =>
this.rubricGrader.grade(input),
),
)
verifierResults.push(
await this.runVerifier('multimodal', () =>
this.multimodalGrader.grade(input),
),
)
}
// Majority voting: pass if at least 2 of 3 verifiers pass
const passCount = verifierResults.filter((r) => r.pass).length
const majorityPass = passCount >= 2
// Combined score: average of individual scores
const averageScore =
verifierResults.reduce((sum, r) => sum + r.score, 0) /
verifierResults.length
// Build combined reasoning
const combinedReasoning = this.formatCombinedReasoning(
verifierResults,
majorityPass,
passCount,
)
return {
score: averageScore,
pass: majorityPass,
reasoning: combinedReasoning,
details: {
verifier: 'combined',
votingResult: {
passCount,
totalVerifiers: 3,
majorityThreshold: 2,
decision: majorityPass ? 'PASS' : 'FAIL',
},
verifiers: {
alignment: {
pass: verifierResults[0].pass,
score: verifierResults[0].score,
details: verifierResults[0].details,
},
rubric: {
pass: verifierResults[1].pass,
score: verifierResults[1].score,
details: verifierResults[1].details,
},
multimodal: {
pass: verifierResults[2].pass,
score: verifierResults[2].score,
details: verifierResults[2].details,
},
},
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Combined verifier error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true, verifier: 'combined' },
}
}
}
private async runVerifier(
name: string,
graderFn: () => Promise<GraderResult>,
): Promise<VerifierResult> {
try {
const result = await graderFn()
return {
name,
pass: result.pass,
score: result.score,
reasoning: result.reasoning,
details: result.details,
}
} catch (error) {
return {
name,
pass: false,
score: 0,
reasoning: `${name} verifier error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true },
}
}
}
private formatCombinedReasoning(
results: VerifierResult[],
majorityPass: boolean,
passCount: number,
): string {
const lines: string[] = []
lines.push('# Fara 3-Verifier Combined Evaluation\n')
lines.push(
`**Final Decision:** ${majorityPass ? 'PASS' : 'FAIL'} (${passCount}/3 verifiers passed)`,
)
lines.push(`**Majority Threshold:** 2/3 verifiers must pass\n`)
lines.push('---\n')
// Alignment Verifier Summary
const alignment = results[0]
lines.push(`## 1. Alignment Verifier: ${alignment.pass ? 'PASS' : 'FAIL'}`)
lines.push(`Score: ${alignment.score}`)
lines.push(`${this.truncateReasoning(alignment.reasoning, 500)}\n`)
// Rubric Verifier Summary
const rubric = results[1]
lines.push(`## 2. Rubric Verifier: ${rubric.pass ? 'PASS' : 'FAIL'}`)
lines.push(`Score: ${(rubric.score * 100).toFixed(1)}%`)
if (rubric.details && 'percentage' in rubric.details) {
lines.push(
`Rubric Score: ${rubric.details.percentage}% (threshold: ${rubric.details.threshold}%)`,
)
}
lines.push(`${this.truncateReasoning(rubric.reasoning, 500)}\n`)
// Multimodal Verifier Summary
const multimodal = results[2]
lines.push(
`## 3. Multimodal Verifier: ${multimodal.pass ? 'PASS' : 'FAIL'}`,
)
lines.push(`Score: ${multimodal.score}`)
if (multimodal.details) {
if ('responseConsistent' in multimodal.details) {
lines.push(
`Response Consistent: ${multimodal.details.responseConsistent ? 'Yes' : 'No'}`,
)
}
if ('taskSatisfied' in multimodal.details) {
lines.push(
`Task Satisfied: ${multimodal.details.taskSatisfied ? 'Yes' : 'No'}`,
)
}
if ('relevantScreenshots' in multimodal.details) {
lines.push(
`Screenshots Analyzed: ${multimodal.details.relevantScreenshots}/${multimodal.details.totalScreenshots}`,
)
}
}
lines.push(`${this.truncateReasoning(multimodal.reasoning, 500)}\n`)
lines.push('---\n')
lines.push('**Voting Summary:**')
lines.push(`- Alignment: ${alignment.pass ? 'YES' : 'NO'}`)
lines.push(`- Rubric: ${rubric.pass ? 'YES' : 'NO'}`)
lines.push(`- Multimodal: ${multimodal.pass ? 'YES' : 'NO'}`)
lines.push(
`- **Result: ${majorityPass ? 'MAJORITY PASS' : 'MAJORITY FAIL'}**`,
)
return lines.join('\n')
}
private truncateReasoning(reasoning: string, maxLength: number): string {
if (reasoning.length <= maxLength) {
return reasoning
}
return `${reasoning.substring(0, maxLength)}...`
}
}
/**
* Factory function to create Fara graders
*/
export function createFaraGrader(
type: 'alignment' | 'rubric' | 'multimodal' | 'combined',
apiKey: string,
baseUrl?: string,
model?: string,
): Grader {
switch (type) {
case 'alignment':
return new FaraAlignmentGrader(apiKey, baseUrl, model)
case 'rubric':
return new FaraRubricGrader(apiKey, baseUrl, model)
case 'multimodal':
return new FaraMultimodalGrader(apiKey, baseUrl, model)
case 'combined':
return new FaraCombinedGrader(apiKey, baseUrl, model)
default:
throw new Error(`Unknown Fara grader type: ${type}`)
}
}

View File

@@ -1,449 +0,0 @@
import { readFile } from 'node:fs/promises'
import { join } from 'node:path'
import OpenAI from 'openai'
import type { ChatCompletionContentPart } from 'openai/resources/chat/completions'
import type { GraderResult } from '../../types'
import type { Grader, GraderInput } from '../types'
/**
* Fara Multimodal Verifier
*
* Based on the Fara paper (Microsoft Research, 2024):
* "This verifier inspects the screenshots and final response of the trajectory
* to check whether the task was successfully completed. The verifier first selects
* the most relevant screenshots from the trajectory based on the task ranked by
* how informative they are."
*
* Two-phase evaluation:
* 1. Select most relevant screenshots based on task relevance
* 2. Judge:
* a) Whether the final response is fully consistent with screenshot evidence
* b) Whether the content in screenshots appears to satisfy the task
*
* "The Multimodal Verifier is especially important for combating hallucinations."
*/
const SCREENSHOT_SELECTION_PROMPT = `You are an expert evaluator selecting the most relevant screenshots from a web agent's trajectory.
**Instructions:**
1. You will see multiple screenshots from an agent's web navigation
2. Score each screenshot from 1-5 based on relevance to the task:
- 1: Not relevant at all
- 2: Minimal relevance
- 3: Somewhat relevant
- 4: Highly relevant
- 5: Critical/essential for verifying task completion
**Output Format:**
Return a JSON object:
{
"scores": [
{"index": <1-based index>, "score": <1-5>, "reason": "Brief reason"}
]
}`
const MULTIMODAL_VERIFICATION_PROMPT = `You are an expert evaluator verifying web agent task completion using visual evidence.
**Your role is to verify two critical aspects:**
1. **Response-Screenshot Consistency**: Is the agent's final response fully consistent with what is shown in the screenshots?
- Does the response accurately describe information visible in screenshots?
- Are there any claims in the response not supported by visual evidence?
- Look for hallucinations - information the agent claims but cannot be verified
2. **Task Completion Evidence**: Do the screenshots show evidence that the task was successfully completed?
- Can you see the target page, information, or action result?
- Is there visual confirmation of the requested action/information?
- For search tasks: are correct search results visible?
- For navigation tasks: did the agent reach the target page?
- For information tasks: is the answer visible on screen?
**Important:** The Multimodal Verifier is especially important for combating hallucinations. Be skeptical of claims not supported by visual evidence.
**Output Format:**
Provide your analysis, then conclude with:
RESPONSE_CONSISTENT: YES or NO
TASK_SATISFIED: YES or NO
VERDICT: PASS or FAIL
REASONING: <One sentence summary>`
interface ScreenshotScore {
index: number
score: number
reason: string
}
export class FaraMultimodalGrader implements Grader {
name = 'fara_multimodal'
private client: OpenAI
private model: string
private relevanceThreshold = 3
private maxSelectedScreenshots = 10
private maxEvaluationScreenshots = 30
private maxRetries = 3
private retryDelayMs = 1000
constructor(apiKey: string, baseUrl?: string, model?: string) {
this.client = new OpenAI({
apiKey,
baseURL: baseUrl || undefined,
})
this.model = model || 'gpt-4o'
}
async grade(input: GraderInput): Promise<GraderResult> {
try {
// Load available screenshots
const allScreenshots = await this.loadScreenshots(
input.outputDir,
input.screenshotCount,
)
if (allScreenshots.length === 0) {
return {
score: 0,
pass: false,
reasoning: 'No screenshots available for multimodal verification',
details: { verifier: 'multimodal', error: 'no_screenshots' },
}
}
// Step 1: Select most relevant screenshots
const selectedScreenshots = await this.selectRelevantScreenshots(
input.task.query,
allScreenshots,
)
if (selectedScreenshots.length === 0) {
return {
score: 0,
pass: false,
reasoning:
'No relevant screenshots found for verification. All screenshots scored below relevance threshold.',
details: {
verifier: 'multimodal',
totalScreenshots: allScreenshots.length,
relevantScreenshots: 0,
threshold: this.relevanceThreshold,
},
}
}
// Step 2: Verify task completion with selected screenshots
const verification = await this.verifyWithScreenshots(
input.task.query,
input.finalAnswer,
selectedScreenshots,
)
const isPass =
verification.responseConsistent && verification.taskSatisfied
return {
score: isPass ? 1 : 0,
pass: isPass,
reasoning: verification.fullReasoning,
details: {
verifier: 'multimodal',
totalScreenshots: allScreenshots.length,
relevantScreenshots: selectedScreenshots.length,
selectedIndices: selectedScreenshots.map((s) => s.index),
responseConsistent: verification.responseConsistent,
taskSatisfied: verification.taskSatisfied,
model: this.model,
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Multimodal verifier error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true, verifier: 'multimodal' },
}
}
}
private async loadScreenshots(
outputDir: string,
screenshotCount: number,
): Promise<{ index: number; data: string }[]> {
const screenshots: { index: number; data: string }[] = []
// Sample screenshots if too many
const indices: number[] = []
if (screenshotCount <= this.maxEvaluationScreenshots) {
for (let i = 1; i <= screenshotCount; i++) {
indices.push(i)
}
} else {
// Sample evenly across the trajectory, always include first, last, and recent
const step = Math.floor(screenshotCount / this.maxEvaluationScreenshots)
for (let i = 1; i <= screenshotCount; i += step) {
indices.push(i)
}
// Always include the last few screenshots (most likely to show completion)
for (let i = screenshotCount - 4; i <= screenshotCount; i++) {
if (i > 0 && !indices.includes(i)) {
indices.push(i)
}
}
indices.sort((a, b) => a - b)
}
for (const i of indices) {
try {
const filepath = join(outputDir, 'screenshots', `${i}.png`)
const buffer = await readFile(filepath)
const base64 = buffer.toString('base64')
screenshots.push({
index: i,
data: `data:image/png;base64,${base64}`,
})
} catch {
// Skip missing files
}
}
return screenshots
}
private async selectRelevantScreenshots(
task: string,
screenshots: { index: number; data: string }[],
): Promise<{ index: number; data: string; score: number }[]> {
if (screenshots.length <= this.maxSelectedScreenshots) {
return screenshots.map((s) => ({ ...s, score: 5 }))
}
// Use batched evaluation to score screenshots
const batchSize = 5
const allScores: ScreenshotScore[] = []
for (let i = 0; i < screenshots.length; i += batchSize) {
const batch = screenshots.slice(i, i + batchSize)
const scores = await this.scoreScreenshotBatch(task, batch, i)
allScores.push(...scores)
}
// Filter by threshold and sort by score
const relevant = allScores
.filter((s) => s.score >= this.relevanceThreshold)
.sort((a, b) => b.score - a.score)
.slice(0, this.maxSelectedScreenshots)
// If not enough relevant screenshots, include the highest scored ones anyway
if (relevant.length < 3 && allScores.length > 0) {
const topScores = allScores
.sort((a, b) => b.score - a.score)
.slice(0, Math.min(5, allScores.length))
for (const score of topScores) {
if (!relevant.find((r) => r.index === score.index)) {
relevant.push(score)
}
}
}
return relevant.map((score) => ({
index: score.index,
data: screenshots.find((s) => s.index === score.index)?.data ?? '',
score: score.score,
}))
}
private async scoreScreenshotBatch(
task: string,
batch: { index: number; data: string }[],
_startOffset: number,
): Promise<ScreenshotScore[]> {
const content: ChatCompletionContentPart[] = [
{
type: 'text',
text: `Task: ${task}\n\nScore the following ${batch.length} screenshots for relevance to this task. Screenshots are numbered ${batch[0].index} to ${batch[batch.length - 1].index}.`,
},
]
for (const screenshot of batch) {
content.push({
type: 'text',
text: `\n--- Screenshot ${screenshot.index} ---`,
})
content.push({
type: 'image_url',
image_url: { url: screenshot.data, detail: 'low' },
})
}
try {
const response = await this.callWithRetry(
[
{ role: 'system', content: SCREENSHOT_SELECTION_PROMPT },
{ role: 'user', content },
],
true,
)
const responseContent = response.choices[0]?.message?.content || ''
return this.parseScreenshotScores(responseContent, batch)
} catch {
// On error, give all screenshots average score
return batch.map((s) => ({
index: s.index,
score: 3,
reason: 'Could not evaluate',
}))
}
}
private parseScreenshotScores(
content: string,
batch: { index: number; data: string }[],
): ScreenshotScore[] {
try {
const jsonMatch = content.match(/\{[\s\S]*\}/)
if (jsonMatch) {
const parsed = JSON.parse(jsonMatch[0])
if (parsed.scores && Array.isArray(parsed.scores)) {
return parsed.scores.map((s: Partial<ScreenshotScore>) => ({
index: s.index ?? batch[0].index,
score: Math.min(5, Math.max(1, s.score ?? 3)),
reason: s.reason ?? 'No reason provided',
}))
}
}
} catch {
// Fall through
}
// Default scores
return batch.map((s) => ({
index: s.index,
score: 3,
reason: 'Could not parse score',
}))
}
private async verifyWithScreenshots(
task: string,
finalAnswer: string | null,
screenshots: { index: number; data: string; score: number }[],
): Promise<{
responseConsistent: boolean
taskSatisfied: boolean
fullReasoning: string
}> {
const content: ChatCompletionContentPart[] = [
{
type: 'text',
text: `**Task:** ${task}\n\n**Agent's Final Response:** ${finalAnswer || '[No response provided]'}\n\n**Selected Screenshots (${screenshots.length} most relevant):**`,
},
]
for (const screenshot of screenshots) {
content.push({
type: 'text',
text: `\n--- Screenshot ${screenshot.index} (relevance score: ${screenshot.score}/5) ---`,
})
content.push({
type: 'image_url',
image_url: { url: screenshot.data, detail: 'high' },
})
}
content.push({
type: 'text',
text: '\nVerify the task completion based on the screenshots and final response.',
})
const response = await this.callWithRetry([
{ role: 'system', content: MULTIMODAL_VERIFICATION_PROMPT },
{ role: 'user', content },
])
const responseContent = response.choices[0]?.message?.content || ''
return this.parseVerification(responseContent)
}
private parseVerification(content: string): {
responseConsistent: boolean
taskSatisfied: boolean
fullReasoning: string
} {
const upperContent = content.toUpperCase()
// Parse RESPONSE_CONSISTENT
let responseConsistent = false
if (upperContent.includes('RESPONSE_CONSISTENT: YES')) {
responseConsistent = true
} else if (upperContent.includes('RESPONSE_CONSISTENT: NO')) {
responseConsistent = false
} else {
// Fallback: check if there's any indication
responseConsistent =
!upperContent.includes('HALLUCINATION') &&
!upperContent.includes('INCONSISTENT') &&
!upperContent.includes('NOT SUPPORTED')
}
// Parse TASK_SATISFIED
let taskSatisfied = false
if (upperContent.includes('TASK_SATISFIED: YES')) {
taskSatisfied = true
} else if (upperContent.includes('TASK_SATISFIED: NO')) {
taskSatisfied = false
} else {
// Fallback: check verdict
if (upperContent.includes('VERDICT: PASS')) {
taskSatisfied = true
}
}
// Override with final verdict if present
if (upperContent.includes('VERDICT: FAIL')) {
// If explicit fail, at least one criterion failed
if (
!upperContent.includes('RESPONSE_CONSISTENT:') &&
!upperContent.includes('TASK_SATISFIED:')
) {
responseConsistent = false
taskSatisfied = false
}
}
return {
responseConsistent,
taskSatisfied,
fullReasoning: content,
}
}
private async callWithRetry(
messages: OpenAI.Chat.Completions.ChatCompletionMessageParam[],
useJsonFormat = false,
attempt = 1,
): Promise<OpenAI.Chat.Completions.ChatCompletion> {
try {
const options: OpenAI.Chat.Completions.ChatCompletionCreateParamsNonStreaming =
{
model: this.model,
temperature: 0,
messages,
max_tokens: 2000,
}
if (useJsonFormat) {
options.response_format = { type: 'json_object' }
}
return await this.client.chat.completions.create(options)
} catch (error) {
if (attempt < this.maxRetries) {
const delay = this.retryDelayMs * 2 ** (attempt - 1)
await new Promise((resolve) => setTimeout(resolve, delay))
return this.callWithRetry(messages, useJsonFormat, attempt + 1)
}
throw error
}
}
}

View File

@@ -1,447 +0,0 @@
import OpenAI from 'openai'
import { type GraderResult, isToolInputAvailable } from '../../types'
import type { Grader, GraderInput } from '../types'
/**
* Fara Rubric Verifier
*
* Based on the Fara paper (Microsoft Research, 2024):
* "The Rubric Verifier generates a rubric for each task and judges the
* corresponding trajectory against the rubric, crediting points for partial
* completion of various sub-goals. Each rubric is expressed as a list of
* criteria that a trajectory would likely need to meet in order to be successful."
*
* Two-step process:
* 1. Generate task-specific rubric with criteria and point values
* 2. Score trajectory against rubric, calculating proportion of points satisfied
*
* Uses threshold of 0.8 - trajectories scoring above this are marked successful.
*/
const RUBRIC_GENERATION_PROMPT = `You are an expert evaluator creating a rubric for assessing web agent task completion.
Given a task, generate a detailed rubric with specific, measurable criteria that a web agent would need to satisfy to successfully complete the task.
**Instructions:**
1. Break down the task into discrete, verifiable sub-goals
2. Assign point values based on importance (total should sum to 100)
3. Make criteria specific and observable from the action sequence
4. Include both process criteria (correct navigation, interactions) and outcome criteria (final result)
**Output Format:**
Return a JSON object with the following structure:
{
"criteria": [
{
"id": 1,
"description": "Description of criterion",
"points": <number>,
"required": <boolean>
}
],
"total_points": 100
}
**Guidelines:**
- Mark criteria as "required": true if failure means the task cannot be successful
- Include 4-8 criteria for most tasks
- Ensure criteria are observable from action sequence and final response
- Consider edge cases and partial completions`
const RUBRIC_SCORING_PROMPT = `You are an expert evaluator scoring a web agent's trajectory against a rubric.
**Instructions:**
1. Carefully review each criterion in the rubric
2. Determine if the agent's actions and response satisfy each criterion
3. Award full points, partial points (if applicable), or zero points for each criterion
4. Provide clear justification for each score
**Output Format:**
Return a JSON object with the following structure:
{
"scores": [
{
"criterion_id": <number>,
"points_earned": <number>,
"max_points": <number>,
"satisfied": <boolean>,
"justification": "Brief explanation"
}
],
"total_earned": <number>,
"total_possible": <number>,
"percentage": <number>,
"required_criteria_met": <boolean>,
"summary": "Overall assessment summary"
}`
interface RubricCriterion {
id: number
description: string
points: number
required: boolean
}
interface RubricScore {
criterion_id: number
points_earned: number
max_points: number
satisfied: boolean
justification: string
}
interface Rubric {
criteria: RubricCriterion[]
total_points: number
}
interface ScoringResult {
scores: RubricScore[]
total_earned: number
total_possible: number
percentage: number
required_criteria_met: boolean
summary: string
}
export class FaraRubricGrader implements Grader {
name = 'fara_rubric'
private client: OpenAI
private model: string
private passThreshold = 0.8
private maxRetries = 3
private retryDelayMs = 1000
constructor(apiKey: string, baseUrl?: string, model?: string) {
this.client = new OpenAI({
apiKey,
baseURL: baseUrl || undefined,
})
this.model = model || 'gpt-4o-mini'
}
async grade(input: GraderInput): Promise<GraderResult> {
try {
// Step 1: Generate rubric for the task
const rubric = await this.generateRubric(input.task.query)
// Step 2: Score trajectory against rubric
const actionSequence = this.extractActionSequence(input)
const scoringResult = await this.scoreAgainstRubric(
input.task.query,
rubric,
actionSequence,
input.finalAnswer,
)
const score = scoringResult.percentage / 100
const isPass =
score >= this.passThreshold && scoringResult.required_criteria_met
return {
score,
pass: isPass,
reasoning: this.formatReasoning(rubric, scoringResult),
details: {
verifier: 'rubric',
rubric: rubric.criteria,
scores: scoringResult.scores,
totalEarned: scoringResult.total_earned,
totalPossible: scoringResult.total_possible,
percentage: scoringResult.percentage,
threshold: this.passThreshold * 100,
requiredCriteriaMet: scoringResult.required_criteria_met,
model: this.model,
},
}
} catch (error) {
return {
score: 0,
pass: false,
reasoning: `Rubric verifier error: ${error instanceof Error ? error.message : String(error)}`,
details: { error: true, verifier: 'rubric' },
}
}
}
private async generateRubric(task: string): Promise<Rubric> {
const response = await this.callWithRetry([
{ role: 'system', content: RUBRIC_GENERATION_PROMPT },
{
role: 'user',
content: `Generate a rubric for evaluating this web task:\n\n${task}`,
},
])
const content = response.choices[0]?.message?.content || ''
return this.parseRubric(content)
}
private async scoreAgainstRubric(
task: string,
rubric: Rubric,
actionSequence: string,
finalAnswer: string | null,
): Promise<ScoringResult> {
const rubricJson = JSON.stringify(rubric, null, 2)
const userPrompt = `**Task:** ${task}
**Rubric:**
${rubricJson}
**Agent Action Sequence:**
${actionSequence || 'No actions taken'}
**Final Response:** ${finalAnswer || '[No response provided]'}
Score this trajectory against each criterion in the rubric.`
const response = await this.callWithRetry([
{ role: 'system', content: RUBRIC_SCORING_PROMPT },
{ role: 'user', content: userPrompt },
])
const content = response.choices[0]?.message?.content || ''
return this.parseScoringResult(content, rubric)
}
private parseRubric(content: string): Rubric {
try {
const jsonMatch = content.match(/\{[\s\S]*\}/)
if (jsonMatch) {
const parsed = JSON.parse(jsonMatch[0])
if (
parsed.criteria &&
Array.isArray(parsed.criteria) &&
parsed.criteria.length > 0
) {
return {
criteria: parsed.criteria.map(
(c: Partial<RubricCriterion>, idx: number) => ({
id: c.id ?? idx + 1,
description: c.description ?? `Criterion ${idx + 1}`,
points: c.points ?? 25,
required: c.required ?? false,
}),
),
total_points:
parsed.total_points ||
parsed.criteria.reduce(
(sum: number, c: Partial<RubricCriterion>) =>
sum + (c.points ?? 25),
0,
),
}
}
}
} catch {
// Fall through to default rubric
}
return this.getDefaultRubric()
}
private getDefaultRubric(): Rubric {
return {
criteria: [
{
id: 1,
description: 'Agent navigated to relevant pages for the task',
points: 25,
required: true,
},
{
id: 2,
description: 'Agent performed correct interactions (clicks, inputs)',
points: 25,
required: false,
},
{
id: 3,
description: 'Agent reached the target state or information',
points: 30,
required: true,
},
{
id: 4,
description: 'Final response accurately addresses the task',
points: 20,
required: false,
},
],
total_points: 100,
}
}
private parseScoringResult(content: string, rubric: Rubric): ScoringResult {
try {
const jsonMatch = content.match(/\{[\s\S]*\}/)
if (jsonMatch) {
const parsed = JSON.parse(jsonMatch[0])
if (parsed.scores && Array.isArray(parsed.scores)) {
const totalEarned =
parsed.total_earned ??
parsed.scores.reduce(
(sum: number, s: Partial<RubricScore>) =>
sum + (s.points_earned ?? 0),
0,
)
const totalPossible =
parsed.total_possible ??
rubric.total_points ??
parsed.scores.reduce(
(sum: number, s: Partial<RubricScore>) =>
sum + (s.max_points ?? 0),
0,
)
const requiredCriteriaMet =
parsed.required_criteria_met ??
this.checkRequiredCriteria(parsed.scores, rubric)
return {
scores: parsed.scores.map(
(s: Partial<RubricScore>, idx: number) => ({
criterion_id: s.criterion_id ?? idx + 1,
points_earned: s.points_earned ?? 0,
max_points: s.max_points ?? 25,
satisfied: s.satisfied ?? false,
justification: s.justification ?? 'No justification provided',
}),
),
total_earned: totalEarned,
total_possible: totalPossible,
percentage:
parsed.percentage ??
(totalPossible > 0
? Math.round((totalEarned / totalPossible) * 100)
: 0),
required_criteria_met: requiredCriteriaMet,
summary: parsed.summary ?? 'Scoring completed',
}
}
}
} catch {
// Fall through to default scoring
}
return this.getDefaultScoringResult(rubric)
}
private checkRequiredCriteria(
scores: Partial<RubricScore>[],
rubric: Rubric,
): boolean {
const requiredIds = rubric.criteria
.filter((c) => c.required)
.map((c) => c.id)
for (const reqId of requiredIds) {
const score = scores.find((s) => s.criterion_id === reqId)
if (!score || !score.satisfied) {
return false
}
}
return true
}
private getDefaultScoringResult(rubric: Rubric): ScoringResult {
return {
scores: rubric.criteria.map((c) => ({
criterion_id: c.id,
points_earned: 0,
max_points: c.points,
satisfied: false,
justification: 'Unable to evaluate',
})),
total_earned: 0,
total_possible: rubric.total_points,
percentage: 0,
required_criteria_met: false,
summary: 'Unable to parse scoring result',
}
}
private formatReasoning(rubric: Rubric, result: ScoringResult): string {
const lines: string[] = []
lines.push('**Rubric Evaluation**\n')
lines.push(
`Score: ${result.total_earned}/${result.total_possible} (${result.percentage}%)`,
)
lines.push(`Threshold: ${this.passThreshold * 100}%`)
lines.push(
`Required Criteria Met: ${result.required_criteria_met ? 'Yes' : 'No'}\n`,
)
lines.push('**Criteria Scores:**')
for (const score of result.scores) {
const criterion = rubric.criteria.find((c) => c.id === score.criterion_id)
const status = score.satisfied ? 'PASS' : 'FAIL'
const required = criterion?.required ? ' [REQUIRED]' : ''
lines.push(
`- ${criterion?.description ?? `Criterion ${score.criterion_id}`}${required}: ${score.points_earned}/${score.max_points} (${status})`,
)
lines.push(` Justification: ${score.justification}`)
}
lines.push(`\n**Summary:** ${result.summary}`)
return lines.join('\n')
}
private extractActionSequence(input: GraderInput): string {
const actions: string[] = []
let stepNum = 1
for (const msg of input.messages) {
if (isToolInputAvailable(msg)) {
const paramsStr = this.formatParams(
msg.input as Record<string, unknown>,
)
actions.push(`${stepNum}. ${msg.toolName}(${paramsStr})`)
stepNum++
}
}
return actions.join('\n')
}
private formatParams(params: Record<string, unknown>): string {
const entries = Object.entries(params)
if (entries.length === 0) return ''
return entries
.map(([key, value]) => {
const strValue =
typeof value === 'string'
? `"${value.substring(0, 100)}${value.length > 100 ? '...' : ''}"`
: JSON.stringify(value)
return `${key}=${strValue}`
})
.join(', ')
}
private async callWithRetry(
messages: OpenAI.Chat.Completions.ChatCompletionMessageParam[],
attempt = 1,
): Promise<OpenAI.Chat.Completions.ChatCompletion> {
try {
return await this.client.chat.completions.create({
model: this.model,
temperature: 0,
messages,
max_tokens: 2000,
response_format: { type: 'json_object' },
})
} catch (error) {
if (attempt < this.maxRetries) {
const delay = this.retryDelayMs * 2 ** (attempt - 1)
await new Promise((resolve) => setTimeout(resolve, delay))
return this.callWithRetry(messages, attempt + 1)
}
throw error
}
}
}

View File

@@ -1,78 +1,23 @@
import type { GraderResult } from '../types'
import { Mind2WebJudgeGrader } from './benchmark/mind2web'
import { WebVoyagerGrader } from './benchmark/webvoyager'
import { FaraAlignmentGrader } from './fara/alignment'
import { FaraCombinedGrader } from './fara/combined'
import { FaraMultimodalGrader } from './fara/multimodal'
import { FaraRubricGrader } from './fara/rubric'
import { AgisdkStateDiffGrader } from './benchmark/agisdk-state-diff'
import { InfinityStateGrader } from './benchmark/infinity-state'
import { PerformanceGrader } from './performance/performance-grader'
import type { Grader, GraderInput } from './types'
interface GraderOptions {
apiKey: string
baseUrl?: string
model?: string
}
export const PASS_FAIL_GRADER_ORDER = [
'agisdk_state_diff',
'infinity_state',
'performance_grader',
] as const
export function createGrader(
name: string,
options: GraderOptions | null,
): Grader | null {
export function createGrader(name: string): Grader | null {
switch (name) {
// Benchmark graders
case 'webvoyager_grader':
if (!options?.apiKey) return null
return new WebVoyagerGrader(
options.apiKey,
options.baseUrl,
options.model,
)
case 'mind2web_judge':
case 'mind2web_grader':
if (!options?.apiKey) return null
return new Mind2WebJudgeGrader(
options.apiKey,
options.baseUrl,
options.model,
)
// Fara individual verifiers
case 'fara_alignment':
if (!options?.apiKey) return null
return new FaraAlignmentGrader(
options.apiKey,
options.baseUrl,
options.model || 'gpt-4o-mini',
)
case 'fara_rubric':
if (!options?.apiKey) return null
return new FaraRubricGrader(
options.apiKey,
options.baseUrl,
options.model || 'gpt-4o-mini',
)
case 'fara_multimodal':
if (!options?.apiKey) return null
return new FaraMultimodalGrader(
options.apiKey,
options.baseUrl,
options.model || 'gpt-4o',
)
// Fara combined 3-verifier system (majority voting)
case 'fara_grader':
case 'fara_combined':
if (!options?.apiKey) return null
return new FaraCombinedGrader(
options.apiKey,
options.baseUrl,
options.model,
)
// Multi-axis performance grader (Claude Agent SDK — uses its own Claude default model)
case 'agisdk_state_diff':
return new AgisdkStateDiffGrader()
case 'infinity_state':
return new InfinityStateGrader()
case 'performance_grader':
return new PerformanceGrader()
default:
console.warn(`Unknown grader: ${name}`)
return null
@@ -82,22 +27,20 @@ export function createGrader(
export async function runGraders(
graderNames: string[],
input: GraderInput,
options: GraderOptions | null,
): Promise<Record<string, GraderResult>> {
const results: Record<string, GraderResult> = {}
for (const name of graderNames) {
const grader = createGrader(name, options)
if (grader) {
try {
console.log(` Running grader: ${name}`)
results[name] = await grader.grade(input)
} catch (error) {
results[name] = {
score: 0,
pass: false,
reasoning: `Error running grader: ${error}`,
}
const grader = createGrader(name)
if (!grader) continue
try {
console.log(` Running grader: ${name}`)
results[name] = await grader.grade(input)
} catch (error) {
results[name] = {
score: 0,
pass: false,
reasoning: `Error running grader: ${error}`,
}
}
}
@@ -105,13 +48,4 @@ export async function runGraders(
return results
}
// Export grader classes for direct use
export {
FaraAlignmentGrader,
FaraCombinedGrader,
FaraMultimodalGrader,
FaraRubricGrader,
Mind2WebJudgeGrader,
PerformanceGrader,
WebVoyagerGrader,
}
export { AgisdkStateDiffGrader, InfinityStateGrader, PerformanceGrader }

View File

@@ -11,6 +11,8 @@ export interface GraderInput {
finalAnswer: string | null
expectedAnswer?: string | null
outputDir: string
mcpUrl?: string
infinityAppUrl?: string
}
export interface Grader {

View File

@@ -13,31 +13,34 @@ const { values } = parseArgs({
if (values.help) {
console.log(`
Web Agent Eval System
BrowserOS Eval
Usage:
bun run eval # Opens dashboard in config mode
bun run eval --config <config.json> # Runs eval with config file
Config file should include:
- agent: Agent configuration (single or orchestrator-executor)
- dataset: Path to dataset JSONL file
- output_dir: Output directory for results (optional, default: ./results)
- num_workers: Number of parallel workers
- browseros.server_url: BrowserOS server URL
- grader_model, grader_api_key_env, grader_base_url: Grader settings (optional)
- timeout_ms: Task timeout in ms (optional)
Available agent types:
- single Single LLM agent driven by the BrowserOS tool loop
- orchestrator-executor High-level planner + visual/text executor
Preset configs available in configs/:
- configs/webvoyager-full.json Full WebVoyager evaluation
- configs/mind2web-full.json Full Mind2Web evaluation
- configs/webvoyager-test.json WebVoyager test subset (10 tasks)
- configs/mind2web-test.json Mind2Web test subset (10 tasks)
Available graders:
- performance_grader Multi-axis grader using Claude Agent SDK
- agisdk_state_diff AGI SDK / REAL Bench state-diff grader
- infinity_state WebArena-Infinity verifier-script grader
Preset configs in configs/:
- browseros-agent-weekly.json Weekly eval (single agent)
- browseros-oe-agent-weekly.json Weekly eval (orchestrator + LLM executor)
- browseros-oe-clado-weekly.json Weekly eval (orchestrator + Clado executor)
- agisdk-real-smoke.json AGI SDK smoke run
- infinity-hard-50.json WebArena-Infinity hard-50 set
- test-webvoyager.json WebVoyager test
- test-mind2web.json Mind2Web test
Examples:
bun run eval # Dashboard config mode
bun run eval -c configs/webvoyager-test.json # WebVoyager test
bun run eval -c configs/mind2web-full.json # Full Mind2Web eval
bun run eval # Dashboard config mode
bun run eval -c configs/browseros-agent-weekly.json
bun run eval -c configs/test-webvoyager.json
`)
process.exit(0)
}

View File

@@ -23,9 +23,14 @@ import {
import { dirname, join } from 'node:path'
import { fileURLToPath } from 'node:url'
import { type Subprocess, spawn, spawnSync } from 'bun'
import type { EvalPorts } from '../utils/dev-config'
import { sleep } from '../utils/sleep'
export interface EvalPorts {
cdp: number
server: number
extension: number
}
const MAX_RESTART_ATTEMPTS = 3
const CDP_WAIT_TIMEOUT_MS = 30_000
const SERVER_HEALTH_TIMEOUT_MS = 30_000

Some files were not shown because too many files have changed in this diff Show More