Commit Graph

2464 Commits

Author SHA1 Message Date
Nikhil Sonti
919a5e898b chore: merge origin dev into patch cli fixes 2026-05-04 18:37:00 -07:00
Nikhil Sonti
70bd9533e6 fix: address review feedback for PR #941 2026-05-04 18:32:32 -07:00
Nikhil Sonti
860acd43c1 test: cover patch CLI checkout ergonomics 2026-05-04 18:18:45 -07:00
Nikhil Sonti
49e20e73b1 fix: add browseros-patch llm quick reference 2026-05-04 18:17:47 -07:00
Nikhil Sonti
f08435a05e fix: add browseros-patch help examples 2026-05-04 18:16:23 -07:00
Nikhil Sonti
ccc39590c0 fix: clarify browseros-patch checkout terminology 2026-05-04 18:14:55 -07:00
Nikhil Sonti
b603aeb953 fix: make checkout detection errors actionable 2026-05-04 18:13:13 -07:00
Nikhil
eed158eca0 fix(patch): handle canonical workspace paths (#940) 2026-05-04 18:09:51 -07:00
Nikhil
d61d6fc8a9 feat: add ACPX agent runtime adapters (#924)
* feat: add acpx claude runtime paths

* feat: add acpx adapter preparation

* refactor: use acpx adapter preparation

* refactor: move openclaw image turns to adapter

* fix: keep openclaw independent of host cwd

* fix: address acpx review feedback

* fix: preserve claude host auth in acpx
2026-05-04 11:04:24 -07:00
shivammittal274
d383b5e344 feat(eval): add claude-generated run report artifact (#892)
* feat(eval): add claude-generated run report artifact

* fix(eval): install claude code cli for CI evals

* fix(eval): bypass claude code tool permissions

* Eval metrics configs (#932)

* feat(eval): add agisdk comparison metrics configs

* fix(eval): keep cdp crashes from aborting run
2026-05-04 21:09:06 +05:30
Dani Akash
ce4bb44083 feat(agent): /home composer parity with image attachments (#930)
* feat(agent): /home composer parity with image attachments

The /home composer used the same ConversationInput component as the
chat screen but passed attachmentsEnabled={false}, and the home →
chat handoff was a URL search param `?q=<text>` that physically
can't carry binary attachments. Pasting a screenshot at /home did
nothing.

Add a small in-memory registry (pending-initial-message.ts) as the
rich-data side channel for the same navigation: the home composer
writes { agentId, text, attachments } there before navigating; the
chat screen consumes it on mount and replays through the existing
harness send() path that already supports attachments. URL `?q=`
stays for shareable text-only prompts; the registry wins when both
are present. Module-scope, 10s TTL, destructive consume.

Net: home is now flagged attachmentsEnabled={true}; users can paste,
drag, or pick image files at /home and they survive the navigation
into the chat screen with previews intact.

* docs(agent): clarify why initial-message ref reset is safe post-registry-fire
2026-05-04 18:02:31 +05:30
Nikhil
0d56815cba fix: store server database under BrowserOS dir (#923)
* fix: store server database under browseros dir

* fix: address PR review feedback for 923
2026-05-02 16:03:41 -07:00
Nikhil
c07d3d95d4 feat: add sqlite drizzle persistence (#919)
* feat: add drizzle agent schema

* feat: run sqlite drizzle migrations

* refactor: remove old sql identity dependency

* feat: store harness agents in sqlite

* build: package db migrations

* refactor: remove sqlite oauth token store

* feat: restore oauth token storage

* fix: handle empty install id

* chore: ignore server runtime state

* fix: address review feedback for PR 919
2026-05-02 15:19:57 -07:00
Nikhil
32530ec418 fix: default extract base to BASE_COMMIT (#922)
* fix: default extract base to BASE_COMMIT

* fix: address review feedback for PR #922
2026-05-02 15:12:17 -07:00
Nikhil
e7105ae50b fix: improve browseros-patch workspace feedback (#921)
* fix: make patch list registry-only

* feat: add patch command progress logs

* fix: address review feedback for PR #921
2026-05-02 15:09:31 -07:00
Nikhil
1d42a973ea refactor: extract acpx runtime templates (#918) 2026-05-02 14:03:15 -07:00
Nikhil
921a797c5b feat: add ACPX agent soul and memory support (#917)
* feat: add acpx agent runtime context helpers

* feat: add acpx runtime state store

* feat: prepare acpx agent runtime context

* feat: inject acpx agent command environment

* feat: forward acpx agent chat cwd

* fix: normalize acpx session record fallback

* feat: improve acpx agent soul and memory prompts

* fix: address PR review comments for memory-soul-acp

* fix: satisfy acpx runtime deepscan checks
2026-05-02 13:45:40 -07:00
Nikhil
d94597bbf9 fix(agent): add CLI model catalog entries (#915)
* fix(agent): add CLI model catalog entries

* fix: address PR review comments for acpx-models
2026-05-02 13:06:41 -07:00
github-actions[bot]
ecc6bac070 chore: sync internal-docs submodule (#911)
Co-authored-by: browseros-bot <bot@browseros.ai>
2026-05-01 20:16:26 +00:00
Dani Akash
84e2739663 feat(agent): rich rail + header on /agents/:agentId chat (#908)
* feat(agent): rich rail + header on /agents/:agentId chat

Replace the chat screen's legacy AgentEntry rail and binary READY
header with the same rich data the /agents page already exposes:
adapter glyph, liveness dot, pin star, status badge, adapter · model ·
reasoning chip line, last-used time, lifetime tokens, queue count,
and the Adapter Unavailable warning. Source of truth flips from the
merged AgentEntry list to useHarnessAgents() directly.

Sort order matches /agents (pinned → recency) — not /home
(active-first → recency) — because chat is index-shaped and shuffling
rows every 5s as turns transition would be jarring while reading.

Lift the inline pin-then-recency comparator out of /agents
AgentList.tsx into a shared agents-list-order.ts so both surfaces
stay on identical sort semantics.

* fix(agent): chat header height + composer sticking to bottom

Header was clipping descenders because the strip was vertical-content
sized at min-h-14 with tight py-2.5; bump padding and lean on natural
content height. Drop the AgentTile glyph (the rail row already shows
adapter identity) and the cwd path (too long, pushed the meta line
off-screen). Header is now name + pin star + status pill, then
adapter · model · reasoning, then last-used · tokens · queued.

Composer was floating mid-screen on short chats because the chat
grid had no grid-template-rows — the implicit auto row collapsed to
content height, so the right-column flex wrapper never received the
full container height. Add grid-rows-[minmax(0,1fr)] so the single
row claims 100% and ClawChat's flex-1 expands to push the composer
flush to the bottom.

* fix(agent): composer flush to bottom on short chats

Match the sidepanel chat's nested-flex pattern. The right-column
wrapper got h-full so it expands to the grid row; the conversation
controller's root added flex-1 so ClawChat's existing flex-1 has
something to actually fill against. Without these, the grid cell
stretched but the inner flex columns shrank to content height,
leaving the composer floating mid-screen.

* fix(agent): align rail header with chat header in shared top band

Pull the rail's "Agents" + back-button into the same horizontal strip
as the agent identity header. The two halves now sit on a single row
that spans both columns, so they can't drift in height as the chat
header gains/loses meta lines (last-used, tokens, queued).

The rail below the band keeps its scrollable list only; the chat
column below holds the conversation + composer. Border-bottom moves
from ConversationHeader to the band wrapper so we don't get a
double-rule on the boundary.

* fix(agent): reserve header height to prevent layout shift on data load

The chat header grew from a single line to three lines once the
useHarnessAgents() poll resolved (adapter chips + meta line populate
asynchronously), shoving the rail and conversation body downward.
Lock min-h-[84px] on both the band's left "Agents" cell and the
ConversationHeader root, and always render the meta line slot
(non-breaking space when empty) so the typographic frame is stable
regardless of data state.

* refactor(agent): pull status pill + meta to right side of chat header

Two-column header layout instead of three stacked rows: name + pin
star + adapter chips on the left, status pill stacked on top of the
last-used / tokens / queued meta line on the right. Drops min-h
from 84px → 60px so the band reclaims ~24px of vertical space and
the chat body starts higher on screen. Band's left "Agents" cell
matches the new height.
2026-05-01 20:19:16 +05:30
Dani Akash
974e7e9b86 fix(agents): hide BrowserOS ACP envelope from chat history payloads (TKT-774) (#907)
* fix(agents): hide BrowserOS ACP envelope from chat history payloads (TKT-774)

The user-message text persisted on the wire carried two nested
envelopes — the outer `<role>You are BrowserOS…</role>` +
`<user_request>…</user_request>` block from buildBrowserosAcpPrompt
and the inner `## Browser Context` + `<selected_text>` +
`<USER_QUERY>` block from formatUserMessage. PR #856 had unwrapped
only the outer envelope on history reads, so the user bubble in
the agent rail still rendered the inner envelope, and the LLM
chat-service path leaked the wrapper all the way back to the
sidepanel client through AI SDK's stream sync.

Two surgical fixes, both server-only:

1) ACP path (acpx-runtime.ts) — replace unwrapBrowserosAcpPrompt
   with a comprehensive unwrapBrowserosAcpUserMessage that strips
   both layers and decodes the &lt;/&gt;/&amp; escapes the server
   applied via escapePromptTagText. Each step is independently
   defensive (anchors that don't match are skipped) so the helper
   is idempotent and tolerates partial / older / future-shape
   envelopes. Applied in userContentToText (history mapper) and
   inherited by extractLastUserMessage (listing's lastUserMessage).

2) LLM chat path (chat-service.ts) — split the persisted user
   message from the prompt-time copy. session.agent.appendUserMessage
   now stores the raw user text; a transient promptUiMessages array
   is built with the wrapped (formatUserMessage + context-change
   prefix) form and passed to createAgentUIStreamResponse for the
   model. onFinish restores the raw form before persisting, so the
   user-visible message and any future history reads see only the
   user's typed text.

Tests:

- acpx-runtime.test.ts: new dedicated unwrapBrowserosAcpUserMessage
  suite covering fully-wrapped messages, only-outer / only-inner
  inputs, selected_text blocks with attribute strings, idempotency,
  literal user-typed angle-bracket round-trip, and an integration
  test that round-trips the real formatUserMessage output through
  the unwrap to pin the writer/reader contract.
- chat-service.test.ts: existing 'rebuilds a managed-app session'
  test updated for the new behaviour — asserts the persisted user
  message is the raw text and the prompt copy passed to the agent
  carries the Klavis context-change notice.

* fix(agents): decode entity escapes before stripping inner envelope (TKT-774)

The unwrap was running its inner-envelope strips against the
literal-tag form (<USER_QUERY>, <selected_text>) but the persisted
payload has those tags entity-escaped (&lt;USER_QUERY&gt;,
&lt;selected_text&gt;) — buildBrowserosAcpPrompt runs
escapePromptTagText over the entire formatUserMessage payload
before adding the outer <role>+<user_request> envelope, so the
inner anchors never matched against the on-disk text and the user
was still seeing <USER_QUERY> in /agents/:id/sessions/main/history
responses.

Reorder unwrapBrowserosAcpUserMessage to: outer-strip → decode
entities → inner-strips. Test fixtures updated to reflect the
actual on-wire form (escaped inner tags); the round-trip test
duplicates the escape rule inline so the contract between
buildBrowserosAcpPrompt and the unwrap is pinned end-to-end.
2026-05-01 19:42:48 +05:30
github-actions[bot]
19e07c086f chore: sync internal-docs submodule (#903)
Co-authored-by: browseros-bot <bot@browseros.ai>
2026-05-01 08:36:41 +00:00
Nikhil
ab354d7dd7 fix(ci): restore PAT on actions/checkout for submodule fetch (#898)
Without a token on actions/checkout, the action falls back to
GITHUB_TOKEN, which has no access to the private internal-docs
repo. Submodule clone fails with "repository not found".

PAT is back on checkout. PR ops still use GITHUB_TOKEN via the
GH_TOKEN env var on the run step. The bot-branch git push uses
the credential helper set up by checkout (the PAT, which has
Contents: Read and write).
2026-04-30 16:23:58 -07:00
Nikhil
0e779fa344 fix(ci): switch internal-docs sync to PR + auto-merge (#897)
Direct push to dev fails the dev ruleset's "Require pull request"
rule. Open a tiny PR from a bot branch and enable auto-merge
(squash, 0 approvals required) instead. No bypass actor needed —
the rule stays strict for everyone, including the bot.

PR ops use GITHUB_TOKEN with explicit pull-requests: write
permission. The cross-repo PAT is only used to rewrite the SSH
submodule URL so internal-docs can be cloned over HTTPS.
2026-04-30 16:17:15 -07:00
Nikhil
dfbce48994 feat: remove CLI auto init discovery (#896)
* feat: remove CLI auto init discovery

* fix: address review feedback for PR #896
2026-04-30 16:03:47 -07:00
Nikhil
7c942e91ce chore: add internal-docs submodule (#895)
Mounts browseros-ai/internal-docs at .internal-docs/, tracking main.

This activates the /document-internal and /ask-internal skills (which
early-exit if the submodule is missing) and lets the sync-internal-docs
workflow start bumping the pointer on its 4-hourly schedule.

Team members: after this lands, run once from a fresh dev pull:
    git submodule update --init .internal-docs
2026-04-30 15:13:41 -07:00
Nikhil
1ff92c44b3 feat(internal-docs): scaffold private docs submodule, skills, sync action (#894)
* feat(internal-docs): scaffold private docs submodule, skills, sync action

Adds the OSS-side scaffolding for the internal-docs system:

- /document-internal skill — drafts a 1-page feature/architecture/design
  doc from the current branch's diff, asks four sharp questions, enforces
  voice rules (no em dashes, banned filler words, 60-line cap on feature
  notes), then opens a PR to browseros-ai/internal-docs via a tmp clone.
- /ask-internal skill — answers team-internal questions by greping
  internal-docs and the codebase, synthesizing with file:line citations,
  optionally executing surfaced commands with per-command confirmation,
  and drafting a new doc + PR if grep returns nothing useful.
- .github/workflows/sync-internal-docs.yml — every 4 hours, bumps the
  submodule pointer on dev directly (no PR; relies on dev branch
  protection blocking force-push). Skips silently until the submodule
  is configured. Uses url.insteadOf to rewrite the SSH submodule URL
  to HTTPS-with-token for the bot, while keeping SSH the local default.
- .claude/skills/document-internal/seeds/ — root README and three
  templates (feature-note, architecture-note, design-spec) ready to
  copy into the new internal-docs repo on rollout.

Design spec: .llm/superpowers/specs/2026-04-30-internal-docs-submodule-design.md

Manual prereqs (NOT in this PR — handled out-of-band):
1. Create private repo browseros-ai/internal-docs with branch protection on main.
2. Seed it with the contents of .claude/skills/document-internal/seeds/.
3. Create a bot account, mark as bypass actor on dev branch protection.
4. Add INTERNAL_DOCS_SYNC_TOKEN secret with repo + read access to internal-docs.
5. Once internal-docs exists, on a follow-up branch:
     git submodule add -b main git@github.com:browseros-ai/internal-docs.git .internal-docs
6. Send the team the one-time init snippet for their existing checkouts:
     git submodule update --init .internal-docs

* fix(internal-docs): address Greptile review feedback

- Workflow: rebase onto dev before push to handle non-fast-forward race;
  bump fetch-depth 1->50 so rebase has merge-base history.
- Workflow: move INTERNAL_DOCS_SYNC_TOKEN into step env: per Actions
  credential-injection pattern, instead of inlining in the script body.
- Skill (BASE bug): suppress git rev-parse stdout so SHA does not get
  captured into BASE alongside the literal 'dev'. Was breaking every
  downstream git log/diff call.
- Skill (tmp clone): trap 'rm -rf "$TMP" EXIT after mktemp so cleanup
  always runs, even if any subsequent step fails.
2026-04-30 15:04:08 -07:00
shivammittal274
c81906ecbf feat(eval): add claude code eval agent (#885) 2026-05-01 02:25:08 +05:30
Nikhil
ffc0f09c86 feat(dev): add target-aware reset cleanup (#893)
* feat(dev): add target-aware reset cleanup

* fix(dev): address cleanup reset review comments
2026-04-30 13:34:52 -07:00
Nikhil
7fb53c9921 feat(dev): bootstrap setup from dev watch (#891)
* feat(dev): bootstrap setup from dev watch

* fix: address review feedback for PR #891
2026-04-30 13:00:46 -07:00
Nikhil
d38b01a8c7 feat(dev): add guided cleanup and reset commands (#890)
* feat(dev): add guided cleanup and reset commands

* fix: address cleanup reset review feedback
2026-04-30 12:27:15 -07:00
Nikhil
ff36c8412b fix(dev): use run lock for watch cleanup (#889)
* fix(dev): use run lock for watch cleanup

* fix(dev): address watch lock review comments
2026-04-30 11:46:17 -07:00
Nikhil
fd5aba249b fix: stabilize OpenClaw gateway startup (#888)
* feat(server): add shared process lock helper

* feat(container): add container name reconciliation helpers

* feat(openclaw): serialize lifecycle across processes

* fix(openclaw): reconcile fixed gateway container startup

* test(openclaw): cover lifecycle race recovery

* fix(server): satisfy process lock error override

* fix(openclaw): address review feedback

* test(openclaw): align serialization mock with image check
2026-04-30 11:31:40 -07:00
Nikhil
492f3fcdf2 feat(openclaw): prewarm ghcr image in vm (#887)
* feat(openclaw): add gateway image inspection

* feat(openclaw): pull gateway image from registry

* refactor(vm): decouple readiness from image cache

* refactor(openclaw): remove vm cache from runtime factory

* feat(openclaw): detect current gateway image

* feat(openclaw): prewarm vm runtime and reuse current gateway

* feat(openclaw): prewarm runtime on server startup

* refactor(vm): remove browseros image cache runtime

* refactor(build-tools): remove openclaw tarball pipeline

* chore: self-review fixes

* fix(openclaw): suppress prewarm pull progress logs

* fix(openclaw): address review feedback

* fix(openclaw): resolve review findings

* fix(dev): stop stale watch supervisors
2026-04-30 11:18:11 -07:00
Nikhil
cb0c0dd0c1 chore: simplify root test scripts (#886)
* chore: simplify root test scripts

* fix: avoid chained root test scripts

* fix: update test workflow commands

* fix: move app test commands into packages
2026-04-30 10:58:08 -07:00
Dani Akash
8712f89f18 feat(agents): durable per-agent chat message queue + composer Stop (#880)
* feat(agents): durable per-agent chat message queue + composer Stop button

* fix(agents): tighten queue UI — smaller Stop, drop empty indicator, live drain attach

User feedback round 1 on the message-queue UX:

1) The Stop button matched the send/voice mics at h-10 w-10 with a
   solid destructive fill, which read as alarming. Shrunk to h-8 w-8,
   ghost variant with a soft destructive/10 background, smaller
   filled square glyph. Reads as a calm 'stop' affordance instead of
   a panic button.

2) The QueueItem's leading <QueueItemIndicator> dot was decorative
   only — no state, no interaction. Dropped it from QueuePanel along
   with the import; queue items now render as a clean preview line
   with the trailing X remove action.

3) When the server drained the queue and started the next turn, the
   chat panel didn't pick up the live stream until the user
   navigated away and back. The hook's resume effect previously
   only fired on agent change, not on listing-observed activeTurnId
   change. Surface activeTurnId from useHarnessAgents into
   useAgentConversation; effect now re-runs when the id changes,
   calls /chat/active, and attaches to the new turn — so a queued
   message starts streaming the moment the server drain pops it.

* fix(agents): don't reset streaming state from the resume effect's no-op paths

The Stop button was disappearing while the agent was actively
streaming, even though events were still flowing into the chat. Root
cause: the resume effect's `finally` block reset `streaming`,
`turnIdRef`, and `lastSeqRef` unconditionally — including on the
early-return paths (no active turn, or another mechanism already
owns the stream).

Sequence that triggered it:
  1) User sends a message → send() sets streamAbortRef + streaming=true
     and starts consuming the SSE.
  2) User enqueues another message → enqueue mutation invalidates the
     listing query.
  3) Listing refetches with the live activeTurnId → the resume
     effect re-fires (deps include activeTurnIdDep).
  4) attemptResume hits `if (streamAbortRef.current) return` because
     send() owns it.
  5) The finally clause fires anyway and calls setStreaming(false),
     clobbering the live state set by send(). The SSE consumer keeps
     running (refs are intact) so text keeps streaming, but the React
     flag is wrong, so the Stop button gates off.

Fix: track whether *this* run actually started a stream
(`weStartedStream`). The finally only resets state when it does.
Early-return / no-active-turn paths now leave streaming/turnIdRef/
lastSeqRef alone for whoever does own them.

Also widens the Stop button's visibility (`canStop` prop on
ConversationInput) so it stays steady across the brief gap between
turns when a queue drain is mid-flight; the parent computes
`streaming || activeTurnId !== null || queue.length > 0`. The
visibility widening is independent of the streaming-state fix above
— both are now in place.

* revert: drop canStop widening — Stop only shows while streaming

Reverts the canStop prop on ConversationInput and the OR-with-queue
visibility from AgentCommandConversation. Stop is gated solely on
`streaming` again. Between turns (queue draining) the button stays
hidden — only the actively-streaming turn is interruptible from the
composer, which matches what the user actually expects.

* fix(agents): persist the kicking-off prompt on active turns so the resume placeholder isn't empty

When a queued message drained and started a new turn, the chat
panel's resume effect staged a placeholder turn with userText: ''
because the hook had no way to know what message kicked off the
turn — only the agent-side stream was visible, and the user bubble
above it was blank until the user navigated away and back (at which
point the session record's history loaded normally).

Fix: ActiveTurnRegistry.register now accepts an optional `prompt`
that's stashed on the turn and surfaced via describe() / the
ActiveTurnInfo response. AgentHarnessService.startTurn passes the
incoming message into register. /chat/active returns it. The chat
hook's resume effect uses active.prompt as the placeholder
turn's userText, so the user bubble shows the queued message text
the moment streaming begins. Falls back to '' for older clients
that haven't been refetched yet.

* fix(agents): always release streamAbortRef on resume cleanup, even when cancelled

Greptile P1 follow-up. The previous `weStartedStream` guard correctly
stopped the resume effect's no-op early-returns from clobbering an
in-flight `send()` stream — but it also stopped a *cancelled*
mid-stream resume from clearing its own `streamAbortRef`. When the
cleanup fires (e.g. the 5s listing poll captures a new queue-drain
turn id while the SSE for the prior turn is still finishing), the
next effect run hits the `if (streamAbortRef.current) return` guard
against the now-aborted controller and never reattaches, leaving
`streaming === true` with no live stream until the user navigates
away.

Split the finally block: always release `streamAbortRef` when we
owned the controller (so the next run can take over), but only
reset the streaming flag / turn id / lastSeq on a clean exit (the
new run will set those itself, so resetting on cancel would just
flicker).
2026-04-30 18:26:56 +05:30
Dani Akash
ba60bf466f feat(agents): rich command-center rows + home grid + dead-code sweep (#879)
* feat(agents): rich-info command center rows + pin/PATCH/adapter-health backbone

Splits AgentRowCard from a 271-line monolith into a shallow tree of
single-responsibility sub-components under `agent-row/`:

  AgentTile, AdapterHealthDot, PinToggle, AgentTitleRow,
  AgentSparkline, AgentSummaryChips, AgentLastMessage, CwdChip,
  AgentTokenSummary, AgentMetaRow, AgentErrorPanel, AgentActions

Adds the data each row consumes:

- pinned: boolean field on AgentDefinition + FileAgentStore.update
  + new PATCH /agents/:id route. useUpdateHarnessAgent mutation
  optimistically updates the listing cache so the star flips
  instantly; rolls back on error.
- Listing payload extended with lastUserMessage, cwd, tokens
  (cumulative + last7d shape — last7d zero-filled until the
  activity ledger lands), turnsByDay/failedByDay (zero-filled),
  lastError/lastErrorAt, activeTurnId. AcpxRuntime grows a
  getRowSnapshot() that reads cwd + cumulative tokens + last user
  message from the session record in one pass.
- Adapter health: in-memory AdapterHealthChecker probes
  `claude --version` / `codex --version` with a 2s timeout and
  caches results for 5 min. /adapters response carries
  { healthy, reason?, checkedAt }. Tile-corner dot exposes the
  state via HoverCard; openclaw inherits health from the gateway
  snapshot already on the page.

Sub-components are pure: card itself owns no state. Sort order
becomes pinned-first, then recency. HoverCard is the workhorse for
keeping rows compact while exposing depth (full message, token
breakdown, daily turn list, error stack, adapter reason).

* refactor(agents): tighten command-center row design + cut redundant affordances

User feedback round 1:
1) Two green dots on the tile (health + liveness) was confusing. Health
   moves out of the tile entirely and surfaces as an inline 'Unavailable'
   chip in the model line — silent when the adapter is healthy, with a
   warning amber chip + HoverCard reason when not. The tile now shows
   one signal: liveness.
2) The last-user-message HoverCard wasn't telegraphing intent. Drop the
   HoverCard. The line is informational, italic, with a leading quote
   glyph so the row reads like a conversation snippet. To see the full
   message the user opens the chat (which is the action they want next
   anyway).
3) Resume + Chat were duplicate CTAs. Single primary action per row:
   Resume (filled, accent-orange, with a pulsing dot) replaces Chat
   when there's an active turn. Both navigate to /agents/:id but the
   row tells the user which action they're taking.
4) Tokens weren't visible because the row gated on last7d.requestCount,
   which is zero until the activity ledger ships. Switch to lifetime
   tokens (which we have today). Drop the '7d stats:' framing — talking
   about a window we can't compute would be misleading. The HoverCard
   surfaces input/output split + a footnote that per-window stats land
   in a follow-up.
5) CWD was rendering the server's own running directory, which is
   meaningless to users. Hide it from the row entirely. The cwd field
   still rides in the listing payload for future surfaces (chat panel,
   debug view) — only the row stops rendering it.

Aesthetic refinements while we're here:
- Whole card carries state, not just the tile: working rows get an
  accent-orange tinted border with a soft glow, error rows tint
  destructive, idle rows lift on hover.
- Pin star fades in on hover (group-hover) when unpinned and stays
  solid amber when pinned — keeps the rail calm by default.
- Tabular-nums on token figures so columns visually align across rows.
- Drop CwdChip and AdapterHealthDot files: no callers left.

* fix(agents): align row title flush-left whether pinned or not

Pin star moved from leading the title to trailing the badges, and
hidden from layout entirely (`hidden group-hover:inline-flex`) when
unpinned. The previous `opacity-0` rule kept the star reserving its
`size-6` slot, which left every unpinned title indented relative to
the model / preview / meta lines underneath it. Title now flushes
left in both states; pinned star stays solid amber so the signal
isn't hidden, and unpinned reveals an outline star on row hover for
the toggle affordance.

* fix(agents): keep pin-toggle slot reserved so row height is constant

Switching the unpinned star from `hidden group-hover:inline-flex`
to `opacity-0 group-hover:opacity-100`. The hidden/show variant was
collapsing the title row's height when the star wasn't rendered,
which made every card below visibly shift on hover. Always rendering
the button (with opacity-only visibility) keeps the row's vertical
metrics constant; the title still flushes left because the slot is
trailing, not leading.

Card hover effect (-translate-y + shadow-md) restored — the layout
shift wasn't coming from the card hover; it was the pin slot
appearing and disappearing.

* fix(agents): quieten row hover — border-tint only, no lift, no shadow

Drop the `-translate-y-px` and `hover:shadow-md` from the row card
plus the working-state inner ring. The translate + shadow grow
combination was visibly noisy as the cursor moved through the rail —
each row 'lifted' as you passed over it. Hover now just tints the
border in accent-orange/30; working and error states keep their
distinct border colours but no inner ring. Card height and shadow
stay constant in every state, so the rail reads as a calm vertical
list of cards.

* feat(home): rich Recent Agents grid + dead-code sweep

The /home Recent Agents grid was a placeholder shell. Every 'rich'
field on the card (lastMessage, lastMessageTimestamp, activitySummary,
currentTool, costUsd) was wired to undefined because AgentCommandHome
called `buildAgentCardData(agents, status?.status, undefined)` — the
dashboard arg has been hard-coded undefined since the harness
migration. Repointing the grid at `useHarnessAgents` + `useAgentAdapters`
gives every card the same enriched data the rail uses.

What the new card shows per agent:
  • Adapter glyph tile + liveness dot (working pulses; asleep is
    hollow; error is red)
  • Name + Working pill (when active)
  • Adapter · model · reasoning summary line, with an inline
    Unavailable chip + HoverCard reason when the adapter binary
    isn't on $PATH
  • Italic last-user-message preview (line-clamp-2, leading quote
    glyph) — same visual language as the rail
  • Footer: 'X ago' + state chip (Asleep / Attention) OR a Resume
    button (orange, with pulsing dot) when activeTurnId is non-null

Sort on the home grid is active-turn → recency. Pinning is NOT a
sort key here (and there's no pin indicator on the card) — pinning
belongs to the rail at /agents; the home page is action-oriented
and trusts active-turn + recency to surface the right agent.

Dead code removed:
  • useAgentDashboard.ts (96 lines, no callers; subscribed to the
    dead /claw/dashboard/stream from the OpenClaw-only era)
  • useAgentCardData.ts (the dashboard-merge shim; passed undefined
    every call so all enriched fields landed as undefined)
  • AgentCard.tsx (AgentCardExpanded replaced by HomeAgentCard;
    AgentCardCompact had no callers — the dock's compact mode was
    never used)
  • AgentCardData interface dropped from lib/agent-conversations/
    types.ts; the new card consumes HarnessAgent directly

Visual language stays continuous between rail and grid: same
<AgentTile>, same <LivenessDot>, same italic-quote message
preview, same orange Resume button with a pulsing dot.
2026-04-30 16:36:22 +05:30
Nikhil
26afb826c6 feat(eval): add viewer manifest contract (#878)
* refactor(eval): canonicalize viewer manifest contract

* refactor(eval): publish canonical viewer manifests

* feat(eval): make r2 viewer use manifest artifact paths

* fix(eval): keep weekly report compatible with viewer manifests

* docs(eval): document r2 viewer manifest contract

* chore: self-review fixes

* fix: address review feedback for PR #878
2026-04-29 20:50:35 -07:00
Nikhil
b2340c8afa refactor(eval): split orchestrated executor backends (#876)
* refactor(eval): split orchestrated executor backends

* fix(eval): address executor backend review comments
2026-04-29 18:02:32 -07:00
Felarof
790a270f47 Update README.md (#877) 2026-04-29 17:35:15 -07:00
Nikhil
84a79ba0a1 feat: refactor eval pipeline workflow (#875)
* feat(eval): add suite variant config bridge

* feat(eval): add stable run artifacts

* refactor(eval): add shared grader contract

* feat(eval): persist grader artifacts

* refactor(eval): rename runner layers

* refactor(eval): add executor backend boundary

* refactor(eval): split clado backend

* feat(eval): add workflow compatible cli

* feat(eval): add r2 publisher module

* ci(eval): migrate weekly workflow to eval cli

* docs(eval): document suite pipeline

* chore(eval): verify pipeline refactor

* fix: address review feedback for PR #875

* docs(eval): add env example

* docs(eval): explain suites and variants

* chore(eval): organize config layouts

* chore(eval): colocate grader python evaluators
2026-04-29 17:21:02 -07:00
Nikhil
6e3306f5e5 fix: make R2 uploads retryable (#874)
* fix: make R2 uploads retryable

* fix: address review feedback for PR #874
2026-04-29 16:43:33 -07:00
Nikhil
c244462b29 fix: use Node 24 GitHub actions (#872) 2026-04-29 15:31:23 -07:00
Nikhil
ebf97f74f6 fix: bound VM agent cache smoke test (#870)
* fix: bound VM agent cache smoke test

* fix: address review feedback for PR #870
2026-04-29 13:43:37 -07:00
Nikhil
561f2baf97 fix(eval): split AGISDK smoke and full configs (#871)
* fix(eval): split agisdk smoke and full configs

* fix(eval): default agisdk smoke to openrouter
2026-04-29 13:38:55 -07:00
shivammittal274
df0f45dd29 Feat: eval debug dev ci (#869)
* chore(eval): instrument server startup to root-cause dev CI health-check timeouts

Three diagnostics + one config swap to investigate why the eval-weekly
workflow has been failing on dev since 2026-04-25 with "Server health
check timed out" (every worker, every retry).

Background:
- Last successful weekly eval on dev: 2026-04-18 (sha f5a2b73)
- Since then, ~30 server commits landed including Lima/VM runtime,
  OpenClaw service, ACL system, ACP SDK — 108 server files changed,
  ~13K LOC added.
- Server process spawns cleanly in CI (PID logged) but never binds
  /health within the 30s eval-side timeout. Static analysis finds no
  obvious blocker; we need runtime evidence.

Changes:

1. apps/server/package.json — add `start:ci` script (no `--watch`).
   The default `start` uses `bun --watch` which forks a child process
   that watches every file in the import graph. Dev's graph is ~108
   files larger than main's; on a cold CI runner the watcher setup is a
   plausible source of multi-second startup overhead.

2. apps/eval/src/runner/browseros-app-manager.ts:
   - Use `start:ci` when `process.env.CI` is set (true on
     GitHub-hosted runners by default), else `start`.
   - Capture per-worker server stderr to /tmp/browseros-server-logs/
     instead of ignoring it. Without this we have no visibility into
     why the server is hung pre-/health.
   - Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module
     graph may simply need more cold-start time on CI.

3. .github/workflows/eval-weekly.yml — upload the server logs dir as a
   workflow artifact (always, not just on success) so we can post-mortem
   any startup failure on the next run.

4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter ->
   Fireworks (bypasses the OpenRouter per-key spend cap that has been
   eating recent runs) and drop num_workers 10 -> 4 (well below the
   Fireworks per-account TPM threshold that overwhelmed the original
   2026-04-23 run).

Plan: trigger the eval-weekly workflow on this branch with the agisdk
config and observe (a) whether it gets past server startup, and
(b) if it doesn't, what the captured server stderr says.

* fix(eval): capture stdout too — pino logger writes to stdout, not stderr

Previous diagnostic patch only redirected stderr; the captured per-worker
log files came back as 0 bytes because the server uses pino which writes
all log output to stdout (fd 1), not stderr (fd 2). Capture both into
the same file.

* fix(server): catch sync throw from OpenClaw constructor on Linux

The container runtime constructor in OpenClawService throws synchronously
on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing
.catch() on tryAutoStart() only handles async throws inside auto-start —
the sync throw from configureOpenClawService(...) itself propagates up
through Application.start() and crashes the process via index.ts:48
(process.exit(EXIT_CODES.GENERAL_ERROR)).

This is what's been killing dev's eval-weekly CI: the server crashes in
milliseconds, the eval client polls /health, gets nothing, times out.

Fix: wrap the configureOpenClawService call in try/catch matching the
existing .catch() intent (best-effort, don't crash). Server continues
without OpenClaw on platforms where it can't initialize.

Verified by reading captured server stdout from run 25123195126:
  Failed to start server: error: browseros-vm currently supports macOS only
    at buildContainerRuntime (container-runtime-factory.ts:54:11)
    at new OpenClawService (openclaw-service.ts:652:15)
    at configureOpenClawService (openclaw-service.ts:1527:19)
    at start (main.ts:127:5)

* fix(server): defer OpenClaw chat client port lookup to request time

apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort()
synchronously when constructing the OpenClawGatewayChatClient inside the
createHttpServer object literal. On non-darwin platforms this throws via
the OpenClawService constructor → buildContainerRuntime, escaping the
try/catch added in 5cf7b765 (which only protected the configureOpenClawService
call further down in main.ts).

Every other getOpenClawService() reference in server.ts is already wrapped
in an arrow function. This was the lone holdout. Make it lazy too: change
the chat client constructor to take getHostPort: () => number instead of
hostPort: number, evaluate it inside streamTurn at request time. Behavior
on darwin is unchanged.

This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't
available — the chat endpoint isn't exercised by the eval, so a deferred
throw is acceptable.

* fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1

Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't
unblock dev's Linux CI — same throw kept reproducing. Whether this is bun
caching stale stack frames or a missed eager call site, the safer move is
to fix it at the root: make buildContainerRuntime never throw on Linux
when the runner has explicitly opted out.

Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test
escape hatch in container-runtime-factory.ts. When set, returns the existing
UnsupportedPlatformTestRuntime stub — server boots normally, /health binds,
any actual OpenClaw API call still fails loudly at request time.

eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and
non-CI Linux behavior unchanged (without the flag they still throw).

* feat(eval): align Clado action executor with new endpoint contract

David Shan shared the updated Clado BrowserOS Action Model spec.
Changes to match it:

- Bump endpoint URL + model id to the 000159-merged checkpoint
  (clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef)
  in browseros-oe-clado-weekly.json and the README example.
- CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min;
  the 2-min ceiling was failing every cold-start request.
- Treat HTTP 200 with action=null / parse_error as an INVALID step
  instead of aborting the executor loop. The model can self-correct
  on the next call. Cap consecutive parse failures at 3 to avoid
  infinite loops.
- Capture final_answer from end actions. Surface it in the observation
  back to the orchestrator so its task answer can use the model's
  declared result.
- Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X).
- Switch screenshot format from webp → png to match the documented
  "PNG or JPEG" contract.

* chore(eval): refresh test-clado-api script for new Clado contract

Updated the local smoke-test to match the new Clado endpoint and
response contract:

- New action + health URLs (000159-merged checkpoint).
- Drop the grounding-model branch (orchestrator-executor doesn't
  use it; the README David shared only documents the action model).
- Health-check waits up to 6 minutes for cold start with a 30s
  warning so the operator knows it's spinning up.
- Print every documented response field (action, x/y, text, key,
  direction, amount, drag start/end, time, final_answer, thinking,
  parse_error, inference_time_seconds).
- Three-step run that exercises a click, a typing continuation
  with formatted history, and an end+final_answer probe.

* chore(eval): point clado weekly config at agisdk-real

Switches the orchestrator-executor + Clado weekly config to run on the
AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff
grader. Matches the orchestrator-executor smoke target (Fireworks K2.5
orchestrator + Clado action executor) we want to track week-over-week.

* chore(eval): run clado weekly headless

Default to headless so the weekly job (and local repros) don't pop ten
visible Chrome windows. Set headless=false locally if you need to watch
a worker.

* fix(eval): address Greptile P1+P2 on server log fd handling

P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir
failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the
log directory missing and crash the server spawn with ENOENT. Move openSync
into the same try block; fall back to /dev/null so spawn always succeeds.

P2: the log fd was opened on every server start but never closed. Each
restart attempt leaked one fd across all workers — over a long eval run
that could exhaust the process fd limit. Track the fd on the manager and
closeSync it in killApp() right after the server process exits (the child's
dup keeps the file open until it exits, so we don't truncate output).
2026-04-30 01:33:49 +05:30
Nikhil
edfc5c751c fix: align OpenClaw gateway image with VM cache (#868)
* fix: load OpenClaw gateway image from VM cache

* fix: use container port for OpenClaw ACP bridge

* fix: address review feedback for PR #868
2026-04-29 12:11:00 -07:00
Nikhil
471256f31c fix: stop passing native permission flags to ACP adapters (#867) 2026-04-29 11:07:51 -07:00
Nikhil
4c90ca696b fix(agents): connect OpenClaw ACP inside gateway container (#866) 2026-04-29 11:07:29 -07:00
Nikhil
f2ac87d7c3 feat: show created agents in sidepanel (#865)
* feat(agent): list created agents in sidepanel target catalog

* feat(agent): show created agents in sidepanel selector

* feat(server): add sidepanel chat route for created agents

* feat(agent): route sidepanel agent sends by agent id

* chore(agent): retire virtual sidepanel acp targets

* fix: address review feedback for PR #865
2026-04-29 10:15:58 -07:00