BrowserOS

mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-18 11:06:19 +00:00

Author	SHA1	Message	Date
Nikhil Sonti	919a5e898b	chore: merge origin dev into patch cli fixes	2026-05-04 18:37:00 -07:00
Nikhil Sonti	70bd9533e6	fix: address review feedback for PR #941	2026-05-04 18:32:32 -07:00
Nikhil Sonti	860acd43c1	test: cover patch CLI checkout ergonomics	2026-05-04 18:18:45 -07:00
Nikhil Sonti	49e20e73b1	fix: add browseros-patch llm quick reference	2026-05-04 18:17:47 -07:00
Nikhil Sonti	f08435a05e	fix: add browseros-patch help examples	2026-05-04 18:16:23 -07:00
Nikhil Sonti	ccc39590c0	fix: clarify browseros-patch checkout terminology	2026-05-04 18:14:55 -07:00
Nikhil Sonti	b603aeb953	fix: make checkout detection errors actionable	2026-05-04 18:13:13 -07:00
Nikhil	eed158eca0	fix(patch): handle canonical workspace paths (#940 )	2026-05-04 18:09:51 -07:00
Nikhil	d61d6fc8a9	feat: add ACPX agent runtime adapters (#924 ) * feat: add acpx claude runtime paths * feat: add acpx adapter preparation * refactor: use acpx adapter preparation * refactor: move openclaw image turns to adapter * fix: keep openclaw independent of host cwd * fix: address acpx review feedback * fix: preserve claude host auth in acpx	2026-05-04 11:04:24 -07:00
shivammittal274	d383b5e344	feat(eval): add claude-generated run report artifact (#892 ) * feat(eval): add claude-generated run report artifact * fix(eval): install claude code cli for CI evals * fix(eval): bypass claude code tool permissions * Eval metrics configs (#932) * feat(eval): add agisdk comparison metrics configs * fix(eval): keep cdp crashes from aborting run	2026-05-04 21:09:06 +05:30
Dani Akash	ce4bb44083	feat(agent): /home composer parity with image attachments (#930 ) * feat(agent): /home composer parity with image attachments The /home composer used the same ConversationInput component as the chat screen but passed attachmentsEnabled={false}, and the home → chat handoff was a URL search param `?q=<text>` that physically can't carry binary attachments. Pasting a screenshot at /home did nothing. Add a small in-memory registry (pending-initial-message.ts) as the rich-data side channel for the same navigation: the home composer writes { agentId, text, attachments } there before navigating; the chat screen consumes it on mount and replays through the existing harness send() path that already supports attachments. URL `?q=` stays for shareable text-only prompts; the registry wins when both are present. Module-scope, 10s TTL, destructive consume. Net: home is now flagged attachmentsEnabled={true}; users can paste, drag, or pick image files at /home and they survive the navigation into the chat screen with previews intact. * docs(agent): clarify why initial-message ref reset is safe post-registry-fire	2026-05-04 18:02:31 +05:30
Nikhil	0d56815cba	fix: store server database under BrowserOS dir (#923 ) * fix: store server database under browseros dir * fix: address PR review feedback for 923	2026-05-02 16:03:41 -07:00
Nikhil	c07d3d95d4	feat: add sqlite drizzle persistence (#919 ) * feat: add drizzle agent schema * feat: run sqlite drizzle migrations * refactor: remove old sql identity dependency * feat: store harness agents in sqlite * build: package db migrations * refactor: remove sqlite oauth token store * feat: restore oauth token storage * fix: handle empty install id * chore: ignore server runtime state * fix: address review feedback for PR 919	2026-05-02 15:19:57 -07:00
Nikhil	32530ec418	fix: default extract base to BASE_COMMIT (#922 ) * fix: default extract base to BASE_COMMIT * fix: address review feedback for PR #922	2026-05-02 15:12:17 -07:00
Nikhil	e7105ae50b	fix: improve browseros-patch workspace feedback (#921 ) * fix: make patch list registry-only * feat: add patch command progress logs * fix: address review feedback for PR #921	2026-05-02 15:09:31 -07:00
Nikhil	1d42a973ea	refactor: extract acpx runtime templates (#918 )	2026-05-02 14:03:15 -07:00
Nikhil	921a797c5b	feat: add ACPX agent soul and memory support (#917 ) * feat: add acpx agent runtime context helpers * feat: add acpx runtime state store * feat: prepare acpx agent runtime context * feat: inject acpx agent command environment * feat: forward acpx agent chat cwd * fix: normalize acpx session record fallback * feat: improve acpx agent soul and memory prompts * fix: address PR review comments for memory-soul-acp * fix: satisfy acpx runtime deepscan checks	2026-05-02 13:45:40 -07:00
Nikhil	d94597bbf9	fix(agent): add CLI model catalog entries (#915 ) * fix(agent): add CLI model catalog entries * fix: address PR review comments for acpx-models	2026-05-02 13:06:41 -07:00
github-actions[bot]	ecc6bac070	chore: sync internal-docs submodule (#911 ) Co-authored-by: browseros-bot <bot@browseros.ai>	2026-05-01 20:16:26 +00:00
Dani Akash	84e2739663	feat(agent): rich rail + header on /agents/:agentId chat (#908 ) * feat(agent): rich rail + header on /agents/:agentId chat Replace the chat screen's legacy AgentEntry rail and binary READY header with the same rich data the /agents page already exposes: adapter glyph, liveness dot, pin star, status badge, adapter · model · reasoning chip line, last-used time, lifetime tokens, queue count, and the Adapter Unavailable warning. Source of truth flips from the merged AgentEntry list to useHarnessAgents() directly. Sort order matches /agents (pinned → recency) — not /home (active-first → recency) — because chat is index-shaped and shuffling rows every 5s as turns transition would be jarring while reading. Lift the inline pin-then-recency comparator out of /agents AgentList.tsx into a shared agents-list-order.ts so both surfaces stay on identical sort semantics. * fix(agent): chat header height + composer sticking to bottom Header was clipping descenders because the strip was vertical-content sized at min-h-14 with tight py-2.5; bump padding and lean on natural content height. Drop the AgentTile glyph (the rail row already shows adapter identity) and the cwd path (too long, pushed the meta line off-screen). Header is now name + pin star + status pill, then adapter · model · reasoning, then last-used · tokens · queued. Composer was floating mid-screen on short chats because the chat grid had no grid-template-rows — the implicit auto row collapsed to content height, so the right-column flex wrapper never received the full container height. Add grid-rows-[minmax(0,1fr)] so the single row claims 100% and ClawChat's flex-1 expands to push the composer flush to the bottom. * fix(agent): composer flush to bottom on short chats Match the sidepanel chat's nested-flex pattern. The right-column wrapper got h-full so it expands to the grid row; the conversation controller's root added flex-1 so ClawChat's existing flex-1 has something to actually fill against. Without these, the grid cell stretched but the inner flex columns shrank to content height, leaving the composer floating mid-screen. * fix(agent): align rail header with chat header in shared top band Pull the rail's "Agents" + back-button into the same horizontal strip as the agent identity header. The two halves now sit on a single row that spans both columns, so they can't drift in height as the chat header gains/loses meta lines (last-used, tokens, queued). The rail below the band keeps its scrollable list only; the chat column below holds the conversation + composer. Border-bottom moves from ConversationHeader to the band wrapper so we don't get a double-rule on the boundary. * fix(agent): reserve header height to prevent layout shift on data load The chat header grew from a single line to three lines once the useHarnessAgents() poll resolved (adapter chips + meta line populate asynchronously), shoving the rail and conversation body downward. Lock min-h-[84px] on both the band's left "Agents" cell and the ConversationHeader root, and always render the meta line slot (non-breaking space when empty) so the typographic frame is stable regardless of data state. * refactor(agent): pull status pill + meta to right side of chat header Two-column header layout instead of three stacked rows: name + pin star + adapter chips on the left, status pill stacked on top of the last-used / tokens / queued meta line on the right. Drops min-h from 84px → 60px so the band reclaims ~24px of vertical space and the chat body starts higher on screen. Band's left "Agents" cell matches the new height.	2026-05-01 20:19:16 +05:30
Dani Akash	974e7e9b86	fix(agents): hide BrowserOS ACP envelope from chat history payloads (TKT-774) (#907 ) * fix(agents): hide BrowserOS ACP envelope from chat history payloads (TKT-774) The user-message text persisted on the wire carried two nested envelopes — the outer `<role>You are BrowserOS…</role>` + `<user_request>…</user_request>` block from buildBrowserosAcpPrompt and the inner `## Browser Context` + `<selected_text>` + `<USER_QUERY>` block from formatUserMessage. PR #856 had unwrapped only the outer envelope on history reads, so the user bubble in the agent rail still rendered the inner envelope, and the LLM chat-service path leaked the wrapper all the way back to the sidepanel client through AI SDK's stream sync. Two surgical fixes, both server-only: 1) ACP path (acpx-runtime.ts) — replace unwrapBrowserosAcpPrompt with a comprehensive unwrapBrowserosAcpUserMessage that strips both layers and decodes the </>/& escapes the server applied via escapePromptTagText. Each step is independently defensive (anchors that don't match are skipped) so the helper is idempotent and tolerates partial / older / future-shape envelopes. Applied in userContentToText (history mapper) and inherited by extractLastUserMessage (listing's lastUserMessage). 2) LLM chat path (chat-service.ts) — split the persisted user message from the prompt-time copy. session.agent.appendUserMessage now stores the raw user text; a transient promptUiMessages array is built with the wrapped (formatUserMessage + context-change prefix) form and passed to createAgentUIStreamResponse for the model. onFinish restores the raw form before persisting, so the user-visible message and any future history reads see only the user's typed text. Tests: - acpx-runtime.test.ts: new dedicated unwrapBrowserosAcpUserMessage suite covering fully-wrapped messages, only-outer / only-inner inputs, selected_text blocks with attribute strings, idempotency, literal user-typed angle-bracket round-trip, and an integration test that round-trips the real formatUserMessage output through the unwrap to pin the writer/reader contract. - chat-service.test.ts: existing 'rebuilds a managed-app session' test updated for the new behaviour — asserts the persisted user message is the raw text and the prompt copy passed to the agent carries the Klavis context-change notice. * fix(agents): decode entity escapes before stripping inner envelope (TKT-774) The unwrap was running its inner-envelope strips against the literal-tag form (<USER_QUERY>, <selected_text>) but the persisted payload has those tags entity-escaped (<USER_QUERY>, <selected_text>) — buildBrowserosAcpPrompt runs escapePromptTagText over the entire formatUserMessage payload before adding the outer <role>+<user_request> envelope, so the inner anchors never matched against the on-disk text and the user was still seeing <USER_QUERY> in /agents/:id/sessions/main/history responses. Reorder unwrapBrowserosAcpUserMessage to: outer-strip → decode entities → inner-strips. Test fixtures updated to reflect the actual on-wire form (escaped inner tags); the round-trip test duplicates the escape rule inline so the contract between buildBrowserosAcpPrompt and the unwrap is pinned end-to-end.	2026-05-01 19:42:48 +05:30
github-actions[bot]	19e07c086f	chore: sync internal-docs submodule (#903 ) Co-authored-by: browseros-bot <bot@browseros.ai>	2026-05-01 08:36:41 +00:00
Nikhil	ab354d7dd7	fix(ci): restore PAT on actions/checkout for submodule fetch (#898 ) Without a token on actions/checkout, the action falls back to GITHUB_TOKEN, which has no access to the private internal-docs repo. Submodule clone fails with "repository not found". PAT is back on checkout. PR ops still use GITHUB_TOKEN via the GH_TOKEN env var on the run step. The bot-branch git push uses the credential helper set up by checkout (the PAT, which has Contents: Read and write).	2026-04-30 16:23:58 -07:00
Nikhil	0e779fa344	fix(ci): switch internal-docs sync to PR + auto-merge (#897 ) Direct push to dev fails the dev ruleset's "Require pull request" rule. Open a tiny PR from a bot branch and enable auto-merge (squash, 0 approvals required) instead. No bypass actor needed — the rule stays strict for everyone, including the bot. PR ops use GITHUB_TOKEN with explicit pull-requests: write permission. The cross-repo PAT is only used to rewrite the SSH submodule URL so internal-docs can be cloned over HTTPS.	2026-04-30 16:17:15 -07:00
Nikhil	dfbce48994	feat: remove CLI auto init discovery (#896 ) * feat: remove CLI auto init discovery * fix: address review feedback for PR #896	2026-04-30 16:03:47 -07:00
Nikhil	7c942e91ce	chore: add internal-docs submodule (#895 ) Mounts browseros-ai/internal-docs at .internal-docs/, tracking main. This activates the /document-internal and /ask-internal skills (which early-exit if the submodule is missing) and lets the sync-internal-docs workflow start bumping the pointer on its 4-hourly schedule. Team members: after this lands, run once from a fresh dev pull: git submodule update --init .internal-docs	2026-04-30 15:13:41 -07:00
Nikhil	1ff92c44b3	feat(internal-docs): scaffold private docs submodule, skills, sync action (#894 ) * feat(internal-docs): scaffold private docs submodule, skills, sync action Adds the OSS-side scaffolding for the internal-docs system: - /document-internal skill — drafts a 1-page feature/architecture/design doc from the current branch's diff, asks four sharp questions, enforces voice rules (no em dashes, banned filler words, 60-line cap on feature notes), then opens a PR to browseros-ai/internal-docs via a tmp clone. - /ask-internal skill — answers team-internal questions by greping internal-docs and the codebase, synthesizing with file:line citations, optionally executing surfaced commands with per-command confirmation, and drafting a new doc + PR if grep returns nothing useful. - .github/workflows/sync-internal-docs.yml — every 4 hours, bumps the submodule pointer on dev directly (no PR; relies on dev branch protection blocking force-push). Skips silently until the submodule is configured. Uses url.insteadOf to rewrite the SSH submodule URL to HTTPS-with-token for the bot, while keeping SSH the local default. - .claude/skills/document-internal/seeds/ — root README and three templates (feature-note, architecture-note, design-spec) ready to copy into the new internal-docs repo on rollout. Design spec: .llm/superpowers/specs/2026-04-30-internal-docs-submodule-design.md Manual prereqs (NOT in this PR — handled out-of-band): 1. Create private repo browseros-ai/internal-docs with branch protection on main. 2. Seed it with the contents of .claude/skills/document-internal/seeds/. 3. Create a bot account, mark as bypass actor on dev branch protection. 4. Add INTERNAL_DOCS_SYNC_TOKEN secret with repo + read access to internal-docs. 5. Once internal-docs exists, on a follow-up branch: git submodule add -b main git@github.com:browseros-ai/internal-docs.git .internal-docs 6. Send the team the one-time init snippet for their existing checkouts: git submodule update --init .internal-docs * fix(internal-docs): address Greptile review feedback - Workflow: rebase onto dev before push to handle non-fast-forward race; bump fetch-depth 1->50 so rebase has merge-base history. - Workflow: move INTERNAL_DOCS_SYNC_TOKEN into step env: per Actions credential-injection pattern, instead of inlining in the script body. - Skill (BASE bug): suppress git rev-parse stdout so SHA does not get captured into BASE alongside the literal 'dev'. Was breaking every downstream git log/diff call. - Skill (tmp clone): trap 'rm -rf "$TMP" EXIT after mktemp so cleanup always runs, even if any subsequent step fails.	2026-04-30 15:04:08 -07:00
shivammittal274	c81906ecbf	feat(eval): add claude code eval agent (#885 )	2026-05-01 02:25:08 +05:30
Nikhil	ffc0f09c86	feat(dev): add target-aware reset cleanup (#893 ) * feat(dev): add target-aware reset cleanup * fix(dev): address cleanup reset review comments	2026-04-30 13:34:52 -07:00
Nikhil	7fb53c9921	feat(dev): bootstrap setup from dev watch (#891 ) * feat(dev): bootstrap setup from dev watch * fix: address review feedback for PR #891	2026-04-30 13:00:46 -07:00
Nikhil	d38b01a8c7	feat(dev): add guided cleanup and reset commands (#890 ) * feat(dev): add guided cleanup and reset commands * fix: address cleanup reset review feedback	2026-04-30 12:27:15 -07:00
Nikhil	ff36c8412b	fix(dev): use run lock for watch cleanup (#889 ) * fix(dev): use run lock for watch cleanup * fix(dev): address watch lock review comments	2026-04-30 11:46:17 -07:00
Nikhil	fd5aba249b	fix: stabilize OpenClaw gateway startup (#888 ) * feat(server): add shared process lock helper * feat(container): add container name reconciliation helpers * feat(openclaw): serialize lifecycle across processes * fix(openclaw): reconcile fixed gateway container startup * test(openclaw): cover lifecycle race recovery * fix(server): satisfy process lock error override * fix(openclaw): address review feedback * test(openclaw): align serialization mock with image check	2026-04-30 11:31:40 -07:00
Nikhil	492f3fcdf2	feat(openclaw): prewarm ghcr image in vm (#887 ) * feat(openclaw): add gateway image inspection * feat(openclaw): pull gateway image from registry * refactor(vm): decouple readiness from image cache * refactor(openclaw): remove vm cache from runtime factory * feat(openclaw): detect current gateway image * feat(openclaw): prewarm vm runtime and reuse current gateway * feat(openclaw): prewarm runtime on server startup * refactor(vm): remove browseros image cache runtime * refactor(build-tools): remove openclaw tarball pipeline * chore: self-review fixes * fix(openclaw): suppress prewarm pull progress logs * fix(openclaw): address review feedback * fix(openclaw): resolve review findings * fix(dev): stop stale watch supervisors	2026-04-30 11:18:11 -07:00
Nikhil	cb0c0dd0c1	chore: simplify root test scripts (#886 ) * chore: simplify root test scripts * fix: avoid chained root test scripts * fix: update test workflow commands * fix: move app test commands into packages	2026-04-30 10:58:08 -07:00
Dani Akash	8712f89f18	feat(agents): durable per-agent chat message queue + composer Stop (#880 ) * feat(agents): durable per-agent chat message queue + composer Stop button * fix(agents): tighten queue UI — smaller Stop, drop empty indicator, live drain attach User feedback round 1 on the message-queue UX: 1) The Stop button matched the send/voice mics at h-10 w-10 with a solid destructive fill, which read as alarming. Shrunk to h-8 w-8, ghost variant with a soft destructive/10 background, smaller filled square glyph. Reads as a calm 'stop' affordance instead of a panic button. 2) The QueueItem's leading <QueueItemIndicator> dot was decorative only — no state, no interaction. Dropped it from QueuePanel along with the import; queue items now render as a clean preview line with the trailing X remove action. 3) When the server drained the queue and started the next turn, the chat panel didn't pick up the live stream until the user navigated away and back. The hook's resume effect previously only fired on agent change, not on listing-observed activeTurnId change. Surface activeTurnId from useHarnessAgents into useAgentConversation; effect now re-runs when the id changes, calls /chat/active, and attaches to the new turn — so a queued message starts streaming the moment the server drain pops it. * fix(agents): don't reset streaming state from the resume effect's no-op paths The Stop button was disappearing while the agent was actively streaming, even though events were still flowing into the chat. Root cause: the resume effect's `finally` block reset `streaming`, `turnIdRef`, and `lastSeqRef` unconditionally — including on the early-return paths (no active turn, or another mechanism already owns the stream). Sequence that triggered it: 1) User sends a message → send() sets streamAbortRef + streaming=true and starts consuming the SSE. 2) User enqueues another message → enqueue mutation invalidates the listing query. 3) Listing refetches with the live activeTurnId → the resume effect re-fires (deps include activeTurnIdDep). 4) attemptResume hits `if (streamAbortRef.current) return` because send() owns it. 5) The finally clause fires anyway and calls setStreaming(false), clobbering the live state set by send(). The SSE consumer keeps running (refs are intact) so text keeps streaming, but the React flag is wrong, so the Stop button gates off. Fix: track whether this run actually started a stream (`weStartedStream`). The finally only resets state when it does. Early-return / no-active-turn paths now leave streaming/turnIdRef/ lastSeqRef alone for whoever does own them. Also widens the Stop button's visibility (`canStop` prop on ConversationInput) so it stays steady across the brief gap between turns when a queue drain is mid-flight; the parent computes `streaming \|\| activeTurnId !== null \|\| queue.length > 0`. The visibility widening is independent of the streaming-state fix above — both are now in place. * revert: drop canStop widening — Stop only shows while streaming Reverts the canStop prop on ConversationInput and the OR-with-queue visibility from AgentCommandConversation. Stop is gated solely on `streaming` again. Between turns (queue draining) the button stays hidden — only the actively-streaming turn is interruptible from the composer, which matches what the user actually expects. * fix(agents): persist the kicking-off prompt on active turns so the resume placeholder isn't empty When a queued message drained and started a new turn, the chat panel's resume effect staged a placeholder turn with userText: '' because the hook had no way to know what message kicked off the turn — only the agent-side stream was visible, and the user bubble above it was blank until the user navigated away and back (at which point the session record's history loaded normally). Fix: ActiveTurnRegistry.register now accepts an optional `prompt` that's stashed on the turn and surfaced via describe() / the ActiveTurnInfo response. AgentHarnessService.startTurn passes the incoming message into register. /chat/active returns it. The chat hook's resume effect uses active.prompt as the placeholder turn's userText, so the user bubble shows the queued message text the moment streaming begins. Falls back to '' for older clients that haven't been refetched yet. * fix(agents): always release streamAbortRef on resume cleanup, even when cancelled Greptile P1 follow-up. The previous `weStartedStream` guard correctly stopped the resume effect's no-op early-returns from clobbering an in-flight `send()` stream — but it also stopped a cancelled mid-stream resume from clearing its own `streamAbortRef`. When the cleanup fires (e.g. the 5s listing poll captures a new queue-drain turn id while the SSE for the prior turn is still finishing), the next effect run hits the `if (streamAbortRef.current) return` guard against the now-aborted controller and never reattaches, leaving `streaming === true` with no live stream until the user navigates away. Split the finally block: always release `streamAbortRef` when we owned the controller (so the next run can take over), but only reset the streaming flag / turn id / lastSeq on a clean exit (the new run will set those itself, so resetting on cancel would just flicker).	2026-04-30 18:26:56 +05:30
Dani Akash	ba60bf466f	feat(agents): rich command-center rows + home grid + dead-code sweep (#879 ) * feat(agents): rich-info command center rows + pin/PATCH/adapter-health backbone Splits AgentRowCard from a 271-line monolith into a shallow tree of single-responsibility sub-components under `agent-row/`: AgentTile, AdapterHealthDot, PinToggle, AgentTitleRow, AgentSparkline, AgentSummaryChips, AgentLastMessage, CwdChip, AgentTokenSummary, AgentMetaRow, AgentErrorPanel, AgentActions Adds the data each row consumes: - pinned: boolean field on AgentDefinition + FileAgentStore.update + new PATCH /agents/:id route. useUpdateHarnessAgent mutation optimistically updates the listing cache so the star flips instantly; rolls back on error. - Listing payload extended with lastUserMessage, cwd, tokens (cumulative + last7d shape — last7d zero-filled until the activity ledger lands), turnsByDay/failedByDay (zero-filled), lastError/lastErrorAt, activeTurnId. AcpxRuntime grows a getRowSnapshot() that reads cwd + cumulative tokens + last user message from the session record in one pass. - Adapter health: in-memory AdapterHealthChecker probes `claude --version` / `codex --version` with a 2s timeout and caches results for 5 min. /adapters response carries { healthy, reason?, checkedAt }. Tile-corner dot exposes the state via HoverCard; openclaw inherits health from the gateway snapshot already on the page. Sub-components are pure: card itself owns no state. Sort order becomes pinned-first, then recency. HoverCard is the workhorse for keeping rows compact while exposing depth (full message, token breakdown, daily turn list, error stack, adapter reason). * refactor(agents): tighten command-center row design + cut redundant affordances User feedback round 1: 1) Two green dots on the tile (health + liveness) was confusing. Health moves out of the tile entirely and surfaces as an inline 'Unavailable' chip in the model line — silent when the adapter is healthy, with a warning amber chip + HoverCard reason when not. The tile now shows one signal: liveness. 2) The last-user-message HoverCard wasn't telegraphing intent. Drop the HoverCard. The line is informational, italic, with a leading quote glyph so the row reads like a conversation snippet. To see the full message the user opens the chat (which is the action they want next anyway). 3) Resume + Chat were duplicate CTAs. Single primary action per row: Resume (filled, accent-orange, with a pulsing dot) replaces Chat when there's an active turn. Both navigate to /agents/:id but the row tells the user which action they're taking. 4) Tokens weren't visible because the row gated on last7d.requestCount, which is zero until the activity ledger ships. Switch to lifetime tokens (which we have today). Drop the '7d stats:' framing — talking about a window we can't compute would be misleading. The HoverCard surfaces input/output split + a footnote that per-window stats land in a follow-up. 5) CWD was rendering the server's own running directory, which is meaningless to users. Hide it from the row entirely. The cwd field still rides in the listing payload for future surfaces (chat panel, debug view) — only the row stops rendering it. Aesthetic refinements while we're here: - Whole card carries state, not just the tile: working rows get an accent-orange tinted border with a soft glow, error rows tint destructive, idle rows lift on hover. - Pin star fades in on hover (group-hover) when unpinned and stays solid amber when pinned — keeps the rail calm by default. - Tabular-nums on token figures so columns visually align across rows. - Drop CwdChip and AdapterHealthDot files: no callers left. * fix(agents): align row title flush-left whether pinned or not Pin star moved from leading the title to trailing the badges, and hidden from layout entirely (`hidden group-hover:inline-flex`) when unpinned. The previous `opacity-0` rule kept the star reserving its `size-6` slot, which left every unpinned title indented relative to the model / preview / meta lines underneath it. Title now flushes left in both states; pinned star stays solid amber so the signal isn't hidden, and unpinned reveals an outline star on row hover for the toggle affordance. * fix(agents): keep pin-toggle slot reserved so row height is constant Switching the unpinned star from `hidden group-hover:inline-flex` to `opacity-0 group-hover:opacity-100`. The hidden/show variant was collapsing the title row's height when the star wasn't rendered, which made every card below visibly shift on hover. Always rendering the button (with opacity-only visibility) keeps the row's vertical metrics constant; the title still flushes left because the slot is trailing, not leading. Card hover effect (-translate-y + shadow-md) restored — the layout shift wasn't coming from the card hover; it was the pin slot appearing and disappearing. * fix(agents): quieten row hover — border-tint only, no lift, no shadow Drop the `-translate-y-px` and `hover:shadow-md` from the row card plus the working-state inner ring. The translate + shadow grow combination was visibly noisy as the cursor moved through the rail — each row 'lifted' as you passed over it. Hover now just tints the border in accent-orange/30; working and error states keep their distinct border colours but no inner ring. Card height and shadow stay constant in every state, so the rail reads as a calm vertical list of cards. * feat(home): rich Recent Agents grid + dead-code sweep The /home Recent Agents grid was a placeholder shell. Every 'rich' field on the card (lastMessage, lastMessageTimestamp, activitySummary, currentTool, costUsd) was wired to undefined because AgentCommandHome called `buildAgentCardData(agents, status?.status, undefined)` — the dashboard arg has been hard-coded undefined since the harness migration. Repointing the grid at `useHarnessAgents` + `useAgentAdapters` gives every card the same enriched data the rail uses. What the new card shows per agent: • Adapter glyph tile + liveness dot (working pulses; asleep is hollow; error is red) • Name + Working pill (when active) • Adapter · model · reasoning summary line, with an inline Unavailable chip + HoverCard reason when the adapter binary isn't on $PATH • Italic last-user-message preview (line-clamp-2, leading quote glyph) — same visual language as the rail • Footer: 'X ago' + state chip (Asleep / Attention) OR a Resume button (orange, with pulsing dot) when activeTurnId is non-null Sort on the home grid is active-turn → recency. Pinning is NOT a sort key here (and there's no pin indicator on the card) — pinning belongs to the rail at /agents; the home page is action-oriented and trusts active-turn + recency to surface the right agent. Dead code removed: • useAgentDashboard.ts (96 lines, no callers; subscribed to the dead /claw/dashboard/stream from the OpenClaw-only era) • useAgentCardData.ts (the dashboard-merge shim; passed undefined every call so all enriched fields landed as undefined) • AgentCard.tsx (AgentCardExpanded replaced by HomeAgentCard; AgentCardCompact had no callers — the dock's compact mode was never used) • AgentCardData interface dropped from lib/agent-conversations/ types.ts; the new card consumes HarnessAgent directly Visual language stays continuous between rail and grid: same <AgentTile>, same <LivenessDot>, same italic-quote message preview, same orange Resume button with a pulsing dot.	2026-04-30 16:36:22 +05:30
Nikhil	26afb826c6	feat(eval): add viewer manifest contract (#878 ) * refactor(eval): canonicalize viewer manifest contract * refactor(eval): publish canonical viewer manifests * feat(eval): make r2 viewer use manifest artifact paths * fix(eval): keep weekly report compatible with viewer manifests * docs(eval): document r2 viewer manifest contract * chore: self-review fixes * fix: address review feedback for PR #878	2026-04-29 20:50:35 -07:00
Nikhil	b2340c8afa	refactor(eval): split orchestrated executor backends (#876 ) * refactor(eval): split orchestrated executor backends * fix(eval): address executor backend review comments	2026-04-29 18:02:32 -07:00
Felarof	790a270f47	Update README.md (#877 )	2026-04-29 17:35:15 -07:00
Nikhil	84a79ba0a1	feat: refactor eval pipeline workflow (#875 ) * feat(eval): add suite variant config bridge * feat(eval): add stable run artifacts * refactor(eval): add shared grader contract * feat(eval): persist grader artifacts * refactor(eval): rename runner layers * refactor(eval): add executor backend boundary * refactor(eval): split clado backend * feat(eval): add workflow compatible cli * feat(eval): add r2 publisher module * ci(eval): migrate weekly workflow to eval cli * docs(eval): document suite pipeline * chore(eval): verify pipeline refactor * fix: address review feedback for PR #875 * docs(eval): add env example * docs(eval): explain suites and variants * chore(eval): organize config layouts * chore(eval): colocate grader python evaluators	2026-04-29 17:21:02 -07:00
Nikhil	6e3306f5e5	fix: make R2 uploads retryable (#874 ) * fix: make R2 uploads retryable * fix: address review feedback for PR #874	2026-04-29 16:43:33 -07:00
Nikhil	c244462b29	fix: use Node 24 GitHub actions (#872 )	2026-04-29 15:31:23 -07:00
Nikhil	ebf97f74f6	fix: bound VM agent cache smoke test (#870 ) * fix: bound VM agent cache smoke test * fix: address review feedback for PR #870	2026-04-29 13:43:37 -07:00
Nikhil	561f2baf97	fix(eval): split AGISDK smoke and full configs (#871 ) * fix(eval): split agisdk smoke and full configs * fix(eval): default agisdk smoke to openrouter	2026-04-29 13:38:55 -07:00
shivammittal274	df0f45dd29	Feat: eval debug dev ci (#869 ) * chore(eval): instrument server startup to root-cause dev CI health-check timeouts Three diagnostics + one config swap to investigate why the eval-weekly workflow has been failing on dev since 2026-04-25 with "Server health check timed out" (every worker, every retry). Background: - Last successful weekly eval on dev: 2026-04-18 (sha `f5a2b73`) - Since then, ~30 server commits landed including Lima/VM runtime, OpenClaw service, ACL system, ACP SDK — 108 server files changed, ~13K LOC added. - Server process spawns cleanly in CI (PID logged) but never binds /health within the 30s eval-side timeout. Static analysis finds no obvious blocker; we need runtime evidence. Changes: 1. apps/server/package.json — add `start:ci` script (no `--watch`). The default `start` uses `bun --watch` which forks a child process that watches every file in the import graph. Dev's graph is ~108 files larger than main's; on a cold CI runner the watcher setup is a plausible source of multi-second startup overhead. 2. apps/eval/src/runner/browseros-app-manager.ts: - Use `start:ci` when `process.env.CI` is set (true on GitHub-hosted runners by default), else `start`. - Capture per-worker server stderr to /tmp/browseros-server-logs/ instead of ignoring it. Without this we have no visibility into why the server is hung pre-/health. - Bump SERVER_HEALTH_TIMEOUT_MS 30s -> 90s. Dev's larger module graph may simply need more cold-start time on CI. 3. .github/workflows/eval-weekly.yml — upload the server logs dir as a workflow artifact (always, not just on success) so we can post-mortem any startup failure on the next run. 4. configs/agisdk-real-smoke.json — swap K2.5 from OpenRouter -> Fireworks (bypasses the OpenRouter per-key spend cap that has been eating recent runs) and drop num_workers 10 -> 4 (well below the Fireworks per-account TPM threshold that overwhelmed the original 2026-04-23 run). Plan: trigger the eval-weekly workflow on this branch with the agisdk config and observe (a) whether it gets past server startup, and (b) if it doesn't, what the captured server stderr says. * fix(eval): capture stdout too — pino logger writes to stdout, not stderr Previous diagnostic patch only redirected stderr; the captured per-worker log files came back as 0 bytes because the server uses pino which writes all log output to stdout (fd 1), not stderr (fd 2). Capture both into the same file. * fix(server): catch sync throw from OpenClaw constructor on Linux The container runtime constructor in OpenClawService throws synchronously on non-darwin platforms, e.g. GitHub Actions Linux runners. The existing .catch() on tryAutoStart() only handles async throws inside auto-start — the sync throw from configureOpenClawService(...) itself propagates up through Application.start() and crashes the process via index.ts:48 (process.exit(EXIT_CODES.GENERAL_ERROR)). This is what's been killing dev's eval-weekly CI: the server crashes in milliseconds, the eval client polls /health, gets nothing, times out. Fix: wrap the configureOpenClawService call in try/catch matching the existing .catch() intent (best-effort, don't crash). Server continues without OpenClaw on platforms where it can't initialize. Verified by reading captured server stdout from run 25123195126: Failed to start server: error: browseros-vm currently supports macOS only at buildContainerRuntime (container-runtime-factory.ts:54:11) at new OpenClawService (openclaw-service.ts:652:15) at configureOpenClawService (openclaw-service.ts:1527:19) at start (main.ts:127:5) * fix(server): defer OpenClaw chat client port lookup to request time apps/server/src/api/server.ts:149 was calling getOpenClawService().getPort() synchronously when constructing the OpenClawGatewayChatClient inside the createHttpServer object literal. On non-darwin platforms this throws via the OpenClawService constructor → buildContainerRuntime, escaping the try/catch added in `5cf7b765` (which only protected the configureOpenClawService call further down in main.ts). Every other getOpenClawService() reference in server.ts is already wrapped in an arrow function. This was the lone holdout. Make it lazy too: change the chat client constructor to take getHostPort: () => number instead of hostPort: number, evaluate it inside streamTurn at request time. Behavior on darwin is unchanged. This unblocks dev's eval-weekly CI on Linux runners where OpenClaw isn't available — the chat endpoint isn't exercised by the eval, so a deferred throw is acceptable. * fix(server): allow Linux to skip OpenClaw via BROWSEROS_SKIP_OPENCLAW=1 Earlier surgical fixes (try/catch in main.ts, lazy chat client port) didn't unblock dev's Linux CI — same throw kept reproducing. Whether this is bun caching stale stack frames or a missed eager call site, the safer move is to fix it at the root: make buildContainerRuntime never throw on Linux when the runner has explicitly opted out. Adds BROWSEROS_SKIP_OPENCLAW env check alongside the existing NODE_ENV=test escape hatch in container-runtime-factory.ts. When set, returns the existing UnsupportedPlatformTestRuntime stub — server boots normally, /health binds, any actual OpenClaw API call still fails loudly at request time. eval-weekly.yml sets the flag for the Linux runner. Darwin behavior and non-CI Linux behavior unchanged (without the flag they still throw). * feat(eval): align Clado action executor with new endpoint contract David Shan shared the updated Clado BrowserOS Action Model spec. Changes to match it: - Bump endpoint URL + model id to the 000159-merged checkpoint (clado-ai--clado-browseros-action-000159-merged-actionmod-f4a6ef) in browseros-oe-clado-weekly.json and the README example. - CLADO_REQUEST_TIMEOUT_MS 120s → 360s. Cold start can take ~5 min; the 2-min ceiling was failing every cold-start request. - Treat HTTP 200 with action=null / parse_error as an INVALID step instead of aborting the executor loop. The model can self-correct on the next call. Cap consecutive parse failures at 3 to avoid infinite loops. - Capture final_answer from end actions. Surface it in the observation back to the orchestrator so its task answer can use the model's declared result. - Add macOS Cmd-* key mappings (M-a, M-c, M-v, M-x → Meta+A/C/V/X). - Switch screenshot format from webp → png to match the documented "PNG or JPEG" contract. * chore(eval): refresh test-clado-api script for new Clado contract Updated the local smoke-test to match the new Clado endpoint and response contract: - New action + health URLs (000159-merged checkpoint). - Drop the grounding-model branch (orchestrator-executor doesn't use it; the README David shared only documents the action model). - Health-check waits up to 6 minutes for cold start with a 30s warning so the operator knows it's spinning up. - Print every documented response field (action, x/y, text, key, direction, amount, drag start/end, time, final_answer, thinking, parse_error, inference_time_seconds). - Three-step run that exercises a click, a typing continuation with formatted history, and an end+final_answer probe. * chore(eval): point clado weekly config at agisdk-real Switches the orchestrator-executor + Clado weekly config to run on the AGI SDK / REAL Bench task set with the deterministic agisdk_state_diff grader. Matches the orchestrator-executor smoke target (Fireworks K2.5 orchestrator + Clado action executor) we want to track week-over-week. * chore(eval): run clado weekly headless Default to headless so the weekly job (and local repros) don't pop ten visible Chrome windows. Set headless=false locally if you need to watch a worker. * fix(eval): address Greptile P1+P2 on server log fd handling P1: openSync was outside the mkdirSync try/catch, so a swallowed mkdir failure (e.g. unwritable custom BROWSEROS_SERVER_LOG_DIR) would leave the log directory missing and crash the server spawn with ENOENT. Move openSync into the same try block; fall back to /dev/null so spawn always succeeds. P2: the log fd was opened on every server start but never closed. Each restart attempt leaked one fd across all workers — over a long eval run that could exhaust the process fd limit. Track the fd on the manager and closeSync it in killApp() right after the server process exits (the child's dup keeps the file open until it exits, so we don't truncate output).	2026-04-30 01:33:49 +05:30
Nikhil	edfc5c751c	fix: align OpenClaw gateway image with VM cache (#868 ) * fix: load OpenClaw gateway image from VM cache * fix: use container port for OpenClaw ACP bridge * fix: address review feedback for PR #868	2026-04-29 12:11:00 -07:00
Nikhil	471256f31c	fix: stop passing native permission flags to ACP adapters (#867 )	2026-04-29 11:07:51 -07:00
Nikhil	4c90ca696b	fix(agents): connect OpenClaw ACP inside gateway container (#866 )	2026-04-29 11:07:29 -07:00
Nikhil	f2ac87d7c3	feat: show created agents in sidepanel (#865 ) * feat(agent): list created agents in sidepanel target catalog * feat(agent): show created agents in sidepanel selector * feat(server): add sidepanel chat route for created agents * feat(agent): route sidepanel agent sends by agent id * chore(agent): retire virtual sidepanel acp targets * fix: address review feedback for PR #865	2026-04-29 10:15:58 -07:00

1 2 3 4 5 ...

2464 Commits