pocketpaw

mirror of https://github.com/pocketpaw/pocketpaw.git synced 2026-05-13 21:21:53 +00:00
Files
Rohit Kushwaha adaa700a0d feat(pocket-specialist): single-shot pocket creation + deepagents 0.5.8 + ripple validator (#1085 )
* feat(ripple): scaffold $source resolver walker (no sources yet)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(ripple): include workspace/pocket context in resolver warnings

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ripple): cover marker dispatch, unknown-source, error paths

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(ripple): cover marker inside list and multi-marker resolution

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): workspace.pockets source

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(ripple): guard workspace.pockets against falsy ctx; drop __all__

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): workspace.members source (v1: ids only)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pockets): resolve \$source markers on read in service.get

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pockets): never raise from resolver; fall back to raw spec on failure

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): teach pocket-creation agent the \$source mechanism

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(ripple): remove scaffolding comment; document share-link non-resolve

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(ripple): note state-sources in assembly comment; document share-link non-resolve

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Revert "chore(ripple): remove scaffolding comment; document share-link non-resolve"

This reverts commit e105687e92.

* feat(ripple): teach interaction agents about $source; eager-register sources

Three follow-ups from the resolver review:
- _assemble_interaction now includes _STATE_SOURCES_BLOCK so edits to
  existing pockets can use $source markers (not just new builds).
- mount_cloud eagerly imports ripple_sources so @register decorators
  fire at startup rather than on first pocket get().
- Document agent_view's intentional non-resolution: agents must see raw
  markers to preserve them on edit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pockets): resolve \$source markers on create/update returns and broadcasts

The user-visible bug: a pocket with \`{\"\$source\": \"workspace.pockets\"}\`
in state.all_pockets rendered an empty table after creation. Root cause
was that service.create, service.update, and the WebSocket event payload
all bypassed the resolver — the desktop client renders from those, never
hitting service.get.

Centralise resolution in a private \`_resolved_wire_dict(doc, viewer_user_id)\`
helper used by service.get (existing), service.create return, service.update
return, and \_pocket_event_payload.

For multi-recipient broadcasts, the helper resolves against doc.owner.
This can over-share owner's private pocket metadata to other recipients;
v2 will move to per-recipient resolution or frontend refetch on event.
Documented in the helper docstring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pockets): resolve \$source markers in agent SSE push too

The previous fix covered service.create/update return and the WebSocket
broadcast, but missed a third channel: the agent's MCP create/update
tools push to the active SSE stream via _push_replace and
push_sse_event(\"pocket_created\"). Both used the raw _agent_view_dict
output (Beanie model_dump) — the desktop client renders from those
events first, before any GET hits service.get.

Add _resolved_view_for_frontend that resolves rippleSpec using the
streaming user/workspace ContextVars (per-stream SSE = right viewer).
Wire it into _push_replace and the pocket_created SSE push. The agent's
return value still carries raw markers so it preserves them on edit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pockets): resolve \$source on widget/team/agent mutation returns

All wire-dict-returning service functions now pipe through
_resolved_wire_dict so the renderer never receives raw markers via:
- POST /pockets/{id}/widgets / PATCH / DELETE / reorder
- POST /pockets/{id}/team / DELETE
- POST /pockets/{id}/agents / DELETE

Previously these returned raw pocket_to_wire_dict, so any frontend
that updated its local pocket store from those response payloads
clobbered the resolved state from service.get with raw markers — most
likely cause of the \"renders once, empty on revisit\" symptom after
a widget or membership change between visits.

access_via_share_link stays raw (no auth context, documented).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pockets): resolve \$source markers in list_pockets too

The desktop client renders pockets directly from the list_pockets
response — it doesn't fetch each pocket via GET /api/v1/pockets/{id}.
So even though service.get had been resolving since the very first
commit of this feature, the frontend never saw resolved data: it was
reading from list_pockets, which returned raw markers.

Apply the same _resolved_wire_dict treatment per pocket. v1: this is
N resolutions for N pockets in the list response. The two current
sources (workspace.pockets, workspace.members) are cheap Mongo reads,
so this is acceptable. If a future source is heavy, add a per-request
memo to ResolveCtx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): enrich workspace.members with name/email/avatar/role

The v1 id-only shape crashed the people-picker widget — its renderer
calls .split() on a member's name to derive initials, and an undefined
name throws \"Cannot read properties of undefined (reading 'split')\".

Join the workspace's member ids with the User collection on the way
out: each entry now carries {id, name, email, avatar, role}. Members
with no matching User row are dropped (rare but possible during async
deletion). Falls back to the email local-part when full_name is empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): teach prompts the new composite layouts + add no-invented-widgets rule

WIDGET CATALOG, USE-THE-WIDGET RULE, FULL-PANE RULE, and COMPOSITION
COOKBOOK now cover the new ripple layouts: comparison-layout,
entity-detail, form-layout, wizard-layout, checklist-layout,
report-layout, invoice-layout, order-status, map, location-picker.
Dashboard variants (exec/ops/analytics/pipeline/project) are
intentionally NOT yet documented — they ship in a follow-up.

Also fixes a typo (`entity-details` → `entity-detail`) so the prompt's
catalog string matches the registry.

New NO_INVENTED_WIDGETS_RULE — the registry is closed; the renderer
prints a red `Unknown widget type: ...` for anything not in the
catalog. The rule spells out the common invention modes (pluralizing,
abbreviating, compounding like `metric-card`/`kpi-tile`) and the
rebuild antipatterns whose right answer is a typed widget. Spliced
into RIPPLE_DESIGN_RULES between WIDGET_CATALOG and
WIDGET_SPEC_TOOL_RULE so the agent learns the catalog, then the
closure rule, then the tool-call requirement.

Example accuracy: the inlined `table` examples in CANONICAL_SHAPES
and the Todos creation example switch from `data:` (runtime alias) to
`rows:` (manifest's documented prop) so prompts and manifest agree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): teach prompts derive methods + normalizer drops dead actions

Two paired changes for the same root cause (specs with controls that
look interactive but aren't wired):

1. _design.py — document the new resolver methods (where/whereIn/
   sortBy/limit/reverse + bracket indexing) with a concrete filter+sort
   example, and add an "Interactive elements must have handlers" rule
   that names the dead-button pattern explicitly.

2. ripple_normalizer.py — strip entity-detail action items lacking
   ``actions``/``on_click`` handlers; lift ``on_click`` -> ``actions``
   when the agent uses the wrong field name. Stripping over raising
   so a content-side regression doesn't lock the agent in a retry
   loop; warning logged for telemetry.

Three new normalizer tests covering drop / lift / pass-through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(chat): reorder system prompt so static prefix caches

Anthropic prompt caching keys off prefix stability. build_context_block
previously emitted dynamic <scope>/<participants> tags FIRST, so the
~12k-token ripple/pocket block at the end never hit cache. Reorder so
static prompts go first, dynamic tags last. KB-context append in the
router lands after dynamic tags, where it belongs (also per-turn).

Adds prefix-stability test that fails before the reorder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(chat): clarify static_parts caveat + named prefix-floor constant

Code review follow-up to the build_context_block reorder:
- Comment clarifies that the pocket-interaction branch's static_parts
  prefix is per-pocket-instance, not globally cacheable.
- Replace bare 1000 magic number in the prefix-stability test with a
  named local constant + explanatory comment.
- Remove redundant in-function import that duplicated module-level imports.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): extract widget_help() catalog accessor

The full RIPPLE_DESIGN_RULES text rides in every chat-inline system
prompt today. We're moving the per-widget catalog behind an on-demand
MCP tool (get_inline_widget_help, landing in a follow-up commit).

This commit creates the lookup function that the tool will call:
widget_help(types=[...]) returns the slice of RIPPLE_DESIGN_RULES
matching the requested widget types, or the full text when called
with no args. The 'Toolkit' / expression-language section is always
included — the agent rarely uses widgets without bindings.

Two unit tests cover known-type lookup and full-catalog fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(ripple): split widget_help only on top-level headings

Code review caught that _split_sections fragmented the
INTERACTIVE_STATE_RULE section (which uses '##' subheadings for
Toolkit, action vocabulary, etc.) into ~10 disconnected pieces. Split
only on '# ' top-level headings; '##' stays in the section body.

always_keep now matches the section body for 'toolkit' or
'expression language' so the agent always gets the handler/binding
vocabulary regardless of which widget types it asked for.

Strengthens the chart-help test to assert the canonical chart schema
is included and that the result is strictly smaller than the full
catalog — catches a regression where the splitter accidentally
returned everything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(ripple): slim INLINE_RIPPLE_SYSTEM_PROMPT, defer catalog to MCP tool

The full RIPPLE_DESIGN_RULES (~600 lines, ~9k tokens of widget
catalog and chart shapes) used to ride in every chat-inline system
prompt. Most replies use 1-3 widgets, so 90%+ of those tokens were
paid for nothing.

Replace the catalog concatenation with _INLINE_CORE_CATALOG: a slim
block naming the six core widgets the agent uses constantly
(text, heading, stat, button, table, flex) plus a pointer to the
get_inline_widget_help MCP tool for everything else (chart, sparkline,
kanban, gauge, ...). The tool was wired up to the same RIPPLE_DESIGN_RULES
text via ee.ripple._inline_core.widget_help in the previous commit.

Add an explicit 'interactive elements need handlers' rule to the
RULES block — this was previously load-bearing on the design rules
text that's no longer included.

Removes the now-unused RIPPLE_DESIGN_RULES import.

The companion test test_build_context_block_includes_ripple_hint
will be rewritten in the next commit to match the new slim shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(ripple): tighten inline prompt — single rule + temporal gate

Code review follow-ups for the slim-inline-prompt commit:
- Remove the duplicate '---' boundary at the preamble/catalog seam.
  The preamble already closes with a horizontal rule; the catalog
  was opening with another, producing a double rule in the rendered
  prompt.
- Reword self-check item 5 from 'OR get_inline_widget_help was called
  for the type' to 'Used a core widget, or called get_inline_widget_help
  BEFORE emitting the type'. The 'BEFORE' converts a retrospective
  question into a temporal gate, closing the path where a model can
  rationalize satisfying it from memory.

No behavior change. Prompt size delta: ~+16 chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(chat): rewrite ripple-hint test for slim inline prompt

The slim INLINE_RIPPLE_SYSTEM_PROMPT no longer ships the per-widget
catalog — those moved behind the get_inline_widget_help MCP tool.
The old test asserted on content that's no longer in the prompt
and had been failing since before this work began.

Rewrite to match the new shape:
- Six core widgets named (text, heading, stat, button, table, flex).
- chat.send loop still there.
- get_inline_widget_help mentioned (so the agent knows the escape
  hatch exists).
- The full RIPPLE_DESIGN_RULES text is NOT a substring of the prompt
  — proves the catalog deferral.
- The prompt is strictly shorter than the catalog — guards against
  accidental re-inclusion.

The test module is now fully green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(chat): hoist design-rules import + add sentinel to slim-prompt test

Code review follow-ups:
- Hoist 'from ee.ripple._design import RIPPLE_DESIGN_RULES' to
  module top-level. Three test functions had deferred in-function
  imports of the same symbol; consolidating.
- Strengthen test_build_context_block_includes_ripple_hint with a
  catalog-only sentinel phrase. The 'not in' check on the full
  RIPPLE_DESIGN_RULES string is the strict guard; the sentinel
  pinpoints WHICH catalog content leaked when the strict guard
  fails. The size check stays as a coarse secondary guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(agents): add get_inline_widget_help MCP tool

The slim chat-inline system prompt points the agent at this tool for
non-core widget docs. Until now the tool didn't exist; the agent
would have hallucinated calls.

Implementation mirrors get_widget_spec — module-level handler that
reads from ee.ripple._inline_core.widget_help, plus an @tool
registration inside build_pocket_context_server. Two handler tests
cover the typical case (asking for 'chart') and the no-args fallback
(full catalog).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(agents): strengthen get_inline_widget_help handler tests

Code review follow-ups:
- no-types test: replace 'first_heading in text' substring check with
  '== RIPPLE_DESIGN_RULES' direct equality. The substring check would
  pass vacuously if first_heading were empty.
- chart-types test: add assertion that bar/line/pie appear in the body.
  The previous 'chart' substring check would pass even if the filter
  fell through to the full catalog or returned a one-word response —
  this version verifies chart-specific schema content was returned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): add POCKET_DELEGATION_RULE to main chat prompt

Phase 3 step 1 of pocket-specialist subagent rollout. Adds a slim
delegation rule the main chat agent sees in plain-chat mode (no
pocket_create intent, no active pocket_id) telling it to invoke
delegate_to_pocket_specialist for any request that mutates pocket
state, and to keep using read-only cloud_list_pockets /
cloud_get_pocket for conversational queries about pockets.

The full POCKET_CREATION_PROMPT_MCP / POCKET_INTERACTION_PROMPT_MCP
text stays unchanged — those will be wired onto the specialist
subagent in the next commit.

Test asserts the delegation rule is present in plain-chat scope and
that the full pocket creation prompt is NOT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(agents): register pocket_specialist subagent + filter main allowlist

Phase 3 step 2 of pocket-specialist rollout. Defines a subagent that
owns the full POCKET_CREATION_PROMPT_MCP + POCKET_INTERACTION_PROMPT_MCP
text and the cloud_* pocket mutation tools. Pocket edits flow through
this subagent via the delegate_to_pocket_specialist tool added in the
next commit.

Filters create_pocket, update_pocket, add_widget, update_widget,
remove_widget out of the main chat agent's allowed_tools — read-only
get_pocket / list_pockets / get_widget_spec / get_inline_widget_help
stay, since the delegation rule explicitly carves out read tools for
conversational queries about pockets.

Wires into ClaudeAgentOptions.agents (claude-agent-sdk 0.1.72) using
AgentDefinition with the actual MCP tool prefix
(mcp__pocketpaw_pocket__*, the SDK MCP server name).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ripple): POCKET_DELEGATION_RULE uses MCP tool names, not CLI

The cloud_* names exist only on the CLI bridge (codex_cli, opencode).
Subagents are MCP-only (only claude_agent_sdk supports them), so the
delegation rule is read in MCP mode where tool names are bare:
list_pockets, get_pocket, create_pocket, update_pocket, add_widget.

The _TOOLS_MCP block in the same file already uses bare names — this
aligns POCKET_DELEGATION_RULE with that canonical pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(agents): teach delegation via built-in Agent tool

Phase 3 step 3 of pocket-specialist rollout. Original plan called for
a custom delegate_to_pocket_specialist MCP tool, but claude-agent-sdk
0.1.72 auto-exposes registered subagents (set via
ClaudeAgentOptions.agents) through the built-in Agent tool. Calling
Agent(subagent_type='pocket_specialist', description=..., prompt=...)
invokes the subagent and returns its reply as a tool result the
model can read and continue with.

POCKET_DELEGATION_RULE updated to teach this canonical pattern. The
custom MCP tool was NOT added — it would have been an unnecessary
indirection that doesn't actually invoke the subagent.

Verifies (or adds, if missing) 'Agent' in the main chat agent's
allowed_tools so the Agent tool is callable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(agents): tool-policy map for Agent + resolve POCKET_ID_TOKEN

Code review follow-ups for Phase 3.4:
- Tool policy: added an explicit 'Agent' -> 'shell' entry in
  _TOOL_POLICY_MAP. is_tool_allowed() returns True for unknown keys
  only on the 'full' profile (empty _allowed_set); restrictive profiles
  ('minimal', 'coding') return False for any key absent from the resolved
  allow set. Without the entry, 'Agent' fell through .get(t, t) to the
  literal string 'Agent', which no profile allowlist contains — silently
  blocking the pocket_specialist subagent for every non-full profile.
  Mapped to 'shell' (conservative, matches Bash gating level).
- Specialist prompt: replace literal __POCKET_ID__ in the interaction
  prompt with a placeholder pointing at the Agent-tool invocation
  prompt. The specialist's system prompt is set at SDK init time, so
  per-call substitution must come from the parent's prompt arg.
- Test: dedupe duplicate OR clause in delegation-rule assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(agents): integration tests for pocket_specialist contract

Pocket-specialist subagent integration of system prompt + tool surface.
Static contract tests, no live agent run:

- Delegation rule names the registered subagent exactly and uses the
  Agent-tool kwarg shape.
- Read-only pocket tools (list_pockets, get_pocket) remain available
  to the main agent, per the carve-out for conversational queries.
- Specialist's system prompt embeds the full pocket creation prompt
  AND substitutes the POCKET_ID_TOKEN placeholder so it doesn't leak
  the literal __POCKET_ID__ marker into the runtime prompt.
- Main agent's _POCKET_MUTATION_TOOL_IDS frozenset matches the
  canonical 5-tool set that's filtered off its allowlist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(agents): assert real cross-file contracts, not prose

Code review follow-ups for the pocket_specialist integration tests:

- Extract _POCKET_SPECIALIST_NAME = 'pocket_specialist' as a module
  constant in claude_sdk.py and use it as the registration dict key.
  test_delegation_rule_lists_correct_subagent_name now imports that
  constant and asserts the delegation rule references the same name —
  catching drift if the registration is renamed but the prose isn't.
- Replace the 'rule mentions list_pockets/get_pocket' prose check with
  a real allowlist-enforcement check: the read-only tool IDs must NOT
  appear in _POCKET_MUTATION_TOOL_IDS. Renamed to
  test_main_agent_keeps_read_only_pocket_tools for accuracy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chat): pocket_create + pocket_id branches must delegate, not inline

Critical regression caught in final review: _POCKET_MUTATION_TOOL_IDS
unconditionally filters create_pocket/update_pocket/add_widget off
the main allowlist, but build_context_block's pocket_create and
pocket_id branches still shipped the full POCKET_*_PROMPT_MCP text
that instructs the agent to call those tools. Sessions in those
modes would receive instructions for tools they could not call.

Collapse all three branches: ship INLINE_RIPPLE_SYSTEM_PROMPT +
POCKET_DELEGATION_RULE everywhere. The heavy POCKET_CREATION_PROMPT_MCP
and POCKET_INTERACTION_PROMPT_MCP live ONLY on the pocket_specialist
subagent — that's the architecture Phase 3 promised. The dynamic
<current-pocket> tag still appears for pocket_id mode so the main
agent knows which pocket to pass when invoking the specialist.

Cleanup:
- Removed get_pocket_prompts / POCKET_ID_TOKEN imports from
  agent_service.py (now dead).
- Re-exported POCKET_DELEGATION_RULE from ee/ripple/__init__.py.
- Refreshed stale comment in claude_sdk.py describing the surviving
  main-agent tool surface (read-only + catalog only).

Tests:
- Two regression guards (pocket_create branch, pocket_id branch) that
  the heavy prompt is NOT inlined and the delegation rule IS.
- Policy-map test ensuring 'Agent' has an explicit entry, preventing
  silent stripping under restricted tool profiles.
- Updated two stale tests in test_pocket_agent_context.py that were
  asserting old branch behavior now replaced by delegation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(chat): gate Phase 3 delegation to subagent-capable backends

Phase 3's slim-prompt + delegate-to-specialist architecture is
currently Claude-only — pocket_specialist is registered via
ClaudeAgentOptions.agents and POCKET_DELEGATION_RULE references the
built-in Agent tool. On other backends (codex_cli, openai_agents,
google_adk, deep_agents, copilot_sdk, opencode) the slim prompt
+ delegation rule would leave the agent without context to act.

Gate the new path on _MCP_POCKET_BACKENDS membership. Subagent-capable
backends ship the slim main-agent prompt; everything else falls back
to the pre-Phase-3 selection (heavy POCKET_CREATION_PROMPT_MCP /
POCKET_INTERACTION_PROMPT_MCP inline).

Universal Option-A — an MCP-based specialist that orchestrates a
fresh LLM call from any backend — is the planned follow-up. Tracking
issue / next plan to be filed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump deepagents to 0.4.12, langchain-mcp-adapters to 0.2.2

Lift the deep_agents extra off the >=0.1.0 floor so we pick up the
0.4.x feature surface (response_format / structured output, skills,
subagents, middleware, ProviderProfile, cache, interrupt_on,
permissions). The existing src/pocketpaw/agents/deep_agents.py
implementation stays compatible — this is a pure floor bump that
unblocks follow-up optimization work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): correct deepagents pin to >=0.5.8,<0.6.0

Initial pin of >=0.4.12,<0.5.0 was based on stale PyPI metadata. The
actual current stable line is 0.5.x (latest 0.5.8, released 2026); the
local development env already had 0.5.1 installed, which the previous
upper bound would have forbidden.

The 0.5.x signature surface is what we'll actually be coding against:
  - cache=langgraph.cache.base.BaseCache (not langchain BaseCache)
  - response_format=ToolStrategy|ProviderStrategy|AutoStrategy
  - middleware=Sequence[AgentMiddleware]
  - subagents=list[SubAgent|CompiledSubAgent]
  - skills=list[str], memory=list[str]
  - interrupt_on, checkpointer, store, backend

Top-level exports verified in 0.5.1: CompiledSubAgent,
FilesystemMiddleware, MemoryMiddleware, SubAgent, SubAgentMiddleware,
create_deep_agent (no SkillsMiddleware, SummarizationMiddleware, or
ProviderProfile at the package root in 0.5.x — those moved or were
removed since the 0.4.x docs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(deep_agents): fix Responses-API regression + skills/memory plumbing

Three changes against the deep_agents backend, gated by the deepagents
0.5.8 floor introduced in #1083:

1. **Add `langchain-openai>=1.2.0,<2.0.0` to the `deep-agents` extra.**
   `deepagents` 0.5.x only pulls `langchain-anthropic` and
   `langchain-google-genai`, so any user picking
   `deep_agents_provider` in {openai, openai_compatible, openrouter,
   litellm} hit `ImportError: Initializing ChatOpenAI requires the
   langchain-openai package` at runtime. This was broken before the
   bump too — now fixed.

2. **Force chat-completions for non-OpenAI OpenAI-compat endpoints.**
   In deepagents 0.5.x, `init_chat_model("openai:...")` defaults to
   the OpenAI **Responses API**. DeepSeek, OpenRouter, LiteLLM proxy,
   vLLM and friends speak chat-completions but not Responses, so
   every call would 404. `_build_model()` now flags these branches
   (`openai_compatible`, `openrouter`, `litellm`) and forwards
   `use_responses_api=False` to `init_chat_model`. Plain `openai`
   without a custom base_url is unaffected and keeps Responses-API
   features.

3. **Wire `Settings.deep_agents_skills` and `deep_agents_memory`** —
   two `list[str]` fields that forward to deepagents'
   `SkillsMiddleware` (progressive AGENTS.md-style file loading) and
   `MemoryMiddleware` (cross-thread recall). Both fields are
   forwarded only when populated, so the default config doesn't wire
   middleware with nothing to load. The compiled-graph cache key now
   includes both lists so changing them invalidates cleanly.

Tests: 9 new cases in `test_deep_agents_backend.py` covering the
Responses-API kwarg per provider branch, skills/memory forwarding,
empty-list omission, and cache-key invalidation. Full backend test
suite (37 tests) green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(pocket-specialist): design for cloud pocket specialist agent

Spec for the pocket_specialist tool that any agent backend can call to
create pockets end-to-end (list -> decide extend-vs-create -> draft ->
validate -> persist) with status events streamed back to the user.

Key design choices captured:

- Always part of the system (no enable/disable toggle); tool
  availability gated by which backend the operator picks for the
  specialist runtime.
- Default backend deep_agents for in-process LLM (no subprocess
  cold-start), configurable via POCKETPAW_POCKET_SPECIALIST_BACKEND.
- Always ships output — never refuses, never returns noop, persists
  best-effort even after max validation iterations. Mirrors the
  ripple_validator's "never block writes" philosophy.
- Dual surface: MCP tool for MCP-capable backends, shell command for
  codex_cli/opencode/gemini_cli. Both call the same runtime.
- Pocket prompts stay canonical in ee/ripple/_pockets.py per the
  reference_pocket_prompts memory; legacy STEP 1..N inline-creation
  blocks are deleted from both prompt variants in favor of an
  unconditional STEP 0 delegation block.
- Persist-once invariant enforced by runtime safety net: if the LLM
  returns without calling persist_pocket, the runtime force-persists
  the last draft.

Stacks on PR #1083 (deepagents >= 0.5.8) and PR #1084 (deep_agents.py
Responses-API fix + skills/memory plumbing). Implementation plan to
follow once the user reviews this spec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(pocket-specialist): implementation plan

13-task TDD plan covering: settings, status events, internal tool
wrappers (list/validate/persist), AgentBackend.attach_specialist_tools
protocol method + DeepAgentsBackend impl, AgentRouter.create_isolated_
backend classmethod, run_specialist runtime with persist-once safety
net, MCP server, CLI shell command, calling-agent prompt rewrite,
public exports, and PR open.

Self-review pass clean: every spec section maps to a task, no
placeholders, type/signature consistency verified across tasks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): add settings fields

* feat(pocket-specialist): add model resolver

* feat(pocket-specialist): add status events

* fix(pocket-specialist): map mismatched backend model field names

* feat(pocket-specialist): add list/validate/persist tool wrappers

Three LangChain StructuredTool factories that close over workspace_id
and user_id, so multi-tenancy stays enforced even if the LLM
hallucinates argument names. Validation re-uses ee.ripple.manifest
(no separate ripple_validator module exists; the plan's reference was
aspirational).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pocket-specialist): debug-level emit log + assert no-raise

* feat(agents): attach_specialist_tools on AgentBackend protocol + deep_agents impl

* fix(pocket-specialist): tighten persist guard, drop dead validator branch, tighten test

* feat(agents): AgentRouter.create_isolated_backend for fresh non-cached instances

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(agents): BaseAgentBackend mixin with NotImplementedError default + cache key test + docstring

* feat(pocket-specialist): runtime happy path with backend orchestration

* feat(pocket-specialist): persist-once safety net

Replaces the Task 7 NotImplementedError stub with a real fallback that
force-persists a minimal pocket when the LLM finishes without calling
persist_pocket. Surfaces a warning in the output so callers know the
pocket is a stub and ask the user to refine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pocket-specialist): side-channel capture for persist/validate tool results

Real agent backends (deep_agents, claude_sdk, codex_cli, copilot_sdk,
google_adk) only emit metadata={"name": tool_name} on tool_result events;
they never put the tool's return dict in metadata["result"]. The runtime's
old capture path therefore always saw None for captured_pocket and fell
through to the safety-net fallback on every successful run, never
returning the pocket the LLM actually built.

Fix by giving make_persist_pocket_tool and make_validate_spec_tool an
optional capture dict argument. The factory's _run closure mutates the
dict when the tool runs (capture["pocket"] / capture["last_validation"]).
The runtime constructs the dicts, passes them into the factories, and
reads them after backend.run finishes. This bypasses the LangGraph/MCP
boundary entirely - no backend changes, no string parsing of truncated
tool_result content, no contract changes elsewhere.

Tests updated to patch the factories with stubs that simulate the
capture-write side effect, since the mocked backend never invokes the
returned StructuredTool. The safety-net test still exercises the
no-persist path unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): MCP tool registration + claude_sdk wiring

Adds in-process SDK MCP server `pocketpaw_pocket_specialist` exposing a
single `create` tool that hands a brief off to `run_specialist`. Wired
into `_get_mcp_servers` and the main agent's allowlist alongside the
existing `pocketpaw_pocket` server. Updates the test strip helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): CLI shell command for non-MCP backends

Adds cloud_pocket_specialist_create to the pocketpaw.tools.cli dispatch
registry so codex_cli, opencode, gemini_cli, and copilot_sdk backends
(which can't host an in-process SDK MCP server) can invoke the pocket
specialist via a Bash tool call.

Handler signature matches the existing cloud_* dict-arg pattern (vs the
argv-style sketched in the plan), so it slots straight into the
_run_cloud_handler dispatcher without needing a parallel codepath.
Workspace/user identity is read via the same current_workspace_id /
current_user_id accessors used by Task 9's MCP tool, with an args-dict
and POCKETPAW_* env-var fallback for callers outside the cloud chat
ContextVar scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pocket-specialist): correct minimal-spec text prop, manifest validation guard

The persist-once safety net was building a minimal Ripple spec with
`{"type": "text", "props": {"value": "..."}}` -- but the `text` widget's
manifest declares `text` as the content prop, not `value`. The pocket
would render blank.

Fix:
- Rename the prop to `text`.
- Extract the spec to a module-level `_MINIMAL_SPEC_FOR_FALLBACK`
  constant so a regression test can validate it against the live
  ripple manifest. The test loads ripple/static/manifest.json directly
  and runs `validate_against_manifest` -- if a future renderer rename
  drifts the prop names, we fail the test before shipping a blank
  pocket.
- Add a failure-path test for `agent_create` returning an error string;
  confirms we propagate as RuntimeError (the MCP/CLI handler boundary
  converts it to a user-facing is_error response).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): replace inline pocket creation with delegation block

Per Task 11 of the pocket-specialist plan: the calling-agent creation
prompts (POCKET_CREATION_PROMPT_MCP / POCKET_CREATION_PROMPT_CLI) now
carry only a STEP 0 "DELEGATE TO SPECIALIST" block — scope/canvas
context plus a single instruction to call pocket_specialist__create
(MCP) or cloud_pocket_specialist_create (CLI). The legacy STEP 1..N
inline workflow (list_pockets / create_pocket / update_pocket calls,
list-before-create gate, interactive-by-default rule, examples,
research protocol, design rules) is gone from the calling-agent prompts.

The heavy creation lift moves to a new POCKET_SPECIALIST_PROMPT
constant — scope/canvas + specialist-tools (list_pockets / validate_spec
/ persist_pocket) + workflow + interactive-by-default + state-sources +
examples + research protocol + design rules. This is what
ee.agent.pocket_specialist.runtime threads as the specialist's system
prompt, replacing the previous reuse of POCKET_CREATION_PROMPT_MCP.

claude_sdk.py's native pocket_specialist subagent also flips to the
new specialist prompt so it doesn't get told "delegate to yourself"
when given a creation brief.

Tests updated:
- test_canonical_prompts_carry_required_features: now asserts the
  STEP 0 delegation block on the calling-agent prompts and the heavy
  workflow on POCKET_SPECIALIST_PROMPT.
- test_pocket_prompt_state_sources: $source vocabulary now lives on
  POCKET_SPECIALIST_PROMPT (creator) + interaction prompts (editor),
  not the calling-agent creation prompts.
- test_specialist_system_prompt_includes_full_pocket_prompts: assert
  POCKET_SPECIALIST_PROMPT (the specialist's actual prompt) is fully
  embedded, not the legacy creation prompt.
- test_non_subagent_backend_uses_inline_pocket_prompts: codex_cli &
  friends now see the CLI delegation block instead of the legacy
  list-before-create / heavy creation prompt.
- New TestSpecialistDelegationBlock class adds 4 regression tests
  guarding the new contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pocket-specialist): tighten @tool schema + use cached get_settings()

- mcp_tool: rewrite @tool JSON schema to full object form. Marks `brief`
  required at schema level, enumerates `hints` properties with
  `additionalProperties: false` so caller typos are rejected instead of
  silently dropped.
- mcp_tool: replace `Settings()` with cached `get_settings()` in the
  default-construction call site of `_create_handler`.
- cli_tool: replace `Settings()` with cached `get_settings()` in
  `_cloud_pocket_specialist_create`.
- runtime: no instantiation of `Settings()` — accepts settings via
  parameter injection so test paths remain unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): public package exports + extra error-path test

- ee/agent/pocket_specialist/__init__.py re-exports the public API
  (PocketSpecialistCreateInput, PocketSpecialistCreateOutput,
  PocketSpecialistHints, run_specialist).
- Adds a regression test for the broad-except path in the MCP handler:
  when run_specialist raises, the handler must return is_error: True
  with the exception text, never propagate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pocket-specialist): drop legacy cloud_create_pocket/cloud_update_pocket from CLI dispatcher

These were the calling-agent equivalents of the specialist tool;
claude_agent_sdk already filters them out via _POCKET_MUTATION_TOOL_IDS.
Drop them from _CLOUD_HANDLERS so codex_cli / opencode / gemini_cli /
copilot_sdk also can't bypass the specialist.

Keep cloud_add_widget / cloud_update_widget / cloud_remove_widget
(used by POCKET_INTERACTION_PROMPT_* for live editing) and the
read-only cloud_list_pockets / cloud_get_pocket plus the specialist
tool itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): INFO-level phase-transition logs for observability

Add INFO log lines so operators tailing logs can see the specialist
running even when no realtime bus subscriber is attached (headless
runs, dev shells). Two changes:

1. emit_specialist_event now logs every phase transition before
   touching the bus, with a compact key=value summary (long string
   values trimmed to 80 chars).

2. run_specialist emits a single-line operator-grep summary at the
   end of the run: pocket_id, action, backend, duration, warnings.
   Logged outside the per-event helper so it shows once per run
   regardless of bus state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pocket-specialist): remove legacy claude_agent_sdk pocket_specialist subagent

The OLD native subagent (registered via ClaudeAgentOptions.agents with
mcp__pocketpaw_pocket__create_pocket / update_pocket / etc. in its
tools list) was the path POCKET_DELEGATION_RULE pointed at. With the
new pocket_specialist__create MCP tool now the canonical entry, the
old subagent was redundant - and worse, the calling agent kept being
told to delegate to it via Agent(subagent_type="pocket_specialist"),
which then bypassed the new MCP tool entirely and called the legacy
mutation tools directly.

Changes:
- POCKET_DELEGATION_RULE rewritten to point at pocket_specialist__create.
- _POCKET_SPECIALIST_NAME, _pocket_specialist_system_prompt,
  _build_pocket_specialist_agent_def removed from claude_sdk.py.
- ClaudeAgentOptions.agents registration block removed.
- Tests covering the old subagent path removed or rewritten.
- Comments updated to reference the new MCP tool.

The _POCKET_MUTATION_TOOL_IDS allowlist filter stays in place - it's
now the sole enforcement, and with no subagent target, the legacy
mutation tools are unreachable from any code path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pocket-specialist): skip MCP server loading in specialist runs

The specialist's isolated DeepAgentsBackend was calling
_build_mcp_tools() during run(), which connects to all of pocketpaw's
configured stdio MCP servers via MultiServerMCPClient.get_tools(). On
hosts with slow/dead MCP servers this hung the specialist for minutes
without ever reaching the LLM.

Specialist runs only need the three tools attached via
attach_specialist_tools (list_pockets, validate_spec, persist_pocket);
the user MCP server set is irrelevant. Pre-populating _mcp_tools = []
inside attach_specialist_tools short-circuits the MCP loader.

Also adds INFO-level dispatch logs to runtime.py so future hangs land
on a known diagnostic line:
  [pocket-specialist] dispatching to backend.run (...)
  [pocket-specialist] backend stream started (first event: ...)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(deep_agents): switch litellm provider to native ChatLiteLLM integration

The earlier ChatOpenAI-masquerade for the litellm provider routed
requests correctly but dropped provider-specific protocol handling on
the floor — DeepSeek's reasoning_content (thinking mode) wasn't
threaded back across turns, breaking multi-turn tool-calling agents
like the pocket specialist.

Native ChatLiteLLM uses the LiteLLM SDK directly, which has built-in
handling for DeepSeek reasoning_content, Anthropic thinking blocks,
model-name routing, and provider-specific quirks.

Changes:
- pyproject.toml: add langchain-litellm to deep-agents extra (+ all/dev
  mirrors). Pinned to 0.6.4 (excluding 0.6.5+) because 0.6.5 transitively
  requires litellm>=1.83.14 which pins openai==2.24.0 and conflicts with
  langchain-openai>=1.2.0 (needs openai>=2.26.0).
- deep_agents.py:_build_model litellm branch: use api_base= (not
  base_url=), keep provider="litellm" (no openai masquerade), drop
  use_responses_api (ChatLiteLLM doesn't take it).
- test_deep_agents_backend.py: replace test_litellm_forces_chat_completions
  with a test asserting the new ChatLiteLLM-shape kwargs (model_id starts
  with litellm:, api_base set, api_key set, no use_responses_api/base_url).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): single-shot creation + DeepSeek-via-LiteLLM

The specialist previously ran 5+ LLM turns (list_pockets → draft →
validate → revise×N → persist), each turn a slow DeepSeek thinking
call routed through the LiteLLM proxy. Total runtime was 4–8 minutes
per brief, exceeding the bundled Claude CLI's MCP tool timeout and
triggering "Main loop exited without ResultMessage". The calling
agent already does the listing, extend-vs-create decision, and
research before invoking the specialist — the specialist just needs
to emit a complete rippleSpec and persist.

Specialist refactor:
- Drop list_pockets and validate_spec tools; only persist_pocket
  remains. The brief and hints (target_pocket_id for extend) carry
  everything needed.
- Inline manifest validation (apply_aliases=True) into persist_pocket;
  warnings captured and surfaced in the run output. No more
  validate-revise loop.
- Specialist prompt rewritten: ONE LLM turn, ONE tool call.

DeepSeek-via-LiteLLM enablement:
- Patch langchain_litellm._convert_message_to_dict so DeepSeek's
  reasoning_content (wrapped as Anthropic-style "thinking" content
  blocks on AIMessages by the response parser) is hoisted back to a
  top-level reasoning_content field on outgoing requests. DeepSeek
  thinking-mode rejects both the unknown "thinking" block and a
  missing reasoning_content; the round-trip patch satisfies both.

Config precedence:
- Settings.load() was passing config.json values as kwargs to
  Settings(**data), which Pydantic treats as the highest-precedence
  source — POCKETPAW_* env vars never won over a stale config.json.
  Drop any field from data when its POCKETPAW_<FIELD> env var is set
  so BaseSettings reads it from env itself.

Reduces specialist runtime to a single DeepSeek call (~30s–1min),
well under the bundled CLI's MCP timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ripple): write-time spec validator + grammar docs hardening

Adds a write-time validator (`ee/cloud/ripple_validator.py`) that
inspects every `{...}` template in an AI-generated rippleSpec and
flags expressions the renderer's resolver can't parse — arrow funcs,
.map/.filter/.reduce, eval-style constructs, unknown fluent methods,
etc. Wired into `pockets/service.py` (create / update /
create_from_ripple_spec / agent_create / agent_update) as
`validate_ripple_spec_logged` and into `pockets/agent_context.py` so
warnings round-trip back to the LLM in the tool result, letting the
agent self-correct on the next turn instead of producing a silently
broken pocket.

The validator is read-only — it never blocks writes (the renderer's
defensive widgets keep the user functional even when expressions
return undefined). Grammar mirrors `ripple/src/lib/core/expression-
resolver.ts`; the two files are the contract.

Also hardens `ee/ripple/_design.py` with an explicit "NEVER use" list
(arrow fns, .map/.filter, template literals, spread, for/while) and a
worked example for placeholder-list patterns that previously tripped
the LLM into inline object-literal-in-ternary.

Includes `scripts/audit_ripple_specs.py` — a read-only Mongo audit
script that runs the validator against every persisted pocket and
emits a human/JSON report for tracking grammar drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(pocket-specialist): granular mutations, retry gate, reasoning round-trip, atomic auto-open

Major buckets:

- **Granular UI/state mutations**: 5 node ops (add/replace/set_prop/move/remove) +
  4 state ops (set/append/remove/patch) replace whole-spec rewrites. New
  spec_ops.py + state_ops.py pure helpers. MCP tools + agent_context
  wrappers emit granular SSE events with only the changed subtree.

- **Edit specialist** (pocket_specialist__edit) mirrors create. Accepts
  optional pocket + target_node_ids handoff from parent so the
  specialist skips its own get_pocket / disambiguation when the parent
  already did the work.

- **langchain_react backend**: thin alternative to deep_agents that uses
  langgraph.prebuilt.create_react_agent directly, skipping the
  middleware stack (filesystem/subagents/summarization) that pocket
  flow doesn't use.

- **DeepSeek thinking round-trip**: direct DeepSeek API path via
  openai_compatible provider. _patch_openai_message_serializer
  monkey-patches langchain_openai to capture and echo reasoning_content
  per https://api-docs.deepseek.com/guides/thinking_mode -- without it,
  every tool-using turn 400s. Applied in both DeepAgentsBackend AND
  LangchainReactBackend (subclass override).

- **Manifest validation retry gate**: persist_pocket validates prop names
  against the live widget manifest BEFORE saving. On invented props
  (chart.series/xAxis/categoryKey etc.) returns
  {ok:false, redraft_required:true, warnings, message} without
  persisting. Model fixes and retries up to MAX_VALIDATION_RETRIES (6).
  After cap, persists anyway. Manifest warnings also surface to the
  agent via tool result.

- **No placeholder pockets**: dropped _force_persist_fallback. When
  the specialist can't ship a real pocket, run_specialist returns
  {ok:false, action:"failed", pocket:null, error} instead of
  auto-shipping a blank shell captioned "auto-created from a brief".

- **Atomic auto-open + session bind**: persist_pocket pushes
  pocket_created SSE + calls attach_pocket_to_session_doc directly
  after _agent_create succeeds, sharing the parent stream's
  contextvars. No longer depends on the main agent's tool_result event
  being parsed by _maybe_handle_specialist_response.

- **Prompt restructure**: behavioral rules (INLINE_RIPPLE_SYSTEM_PROMPT +
  POCKET_DELEGATION_RULE) hoisted out of the "Your Knowledge Base"
  wrapper into a new instructions channel on pool.run. New
  build_behavior_instructions + build_dynamic_context helpers split
  the static rules from per-turn reference data so the model reads
  rules as rules, not as background reference. Strengthened the
  delegation rule with a hard "talk before you call the tool" preface.

- **Real-time side-channel streaming**: agent_router races the next
  agent event against side_channel_queue.get() so push_sse_event calls
  from inside in-process tools (the specialist's status pushes during
  its multi-second run) flush to the SSE consumer in real time
  instead of all at once after the tool returns.

- **Sub-stage tool_start events**: specialist pushes synthetic tool_start
  events (pocket_specialist:build, pocket_specialist:save) so the
  desktop client's TOOL_LABELS lookup updates the loader label as work
  progresses, instead of leaving "Designing pocket..." frozen.

- **Chart prompt hardening**: explicit ban list of Recharts-style props
  (series/xAxis/dataKey/categoryKey/legend/axes/margin/stack) +
  concrete per-type {label, value} examples for bar/line/donut/pie
  + multi-series via series field on each data point.

- **Plan handoff fields**: PocketSpecialistHints expanded with
  purpose / layout / focal_widget / data_shape / key_interactions.
  Parent agent decides shape; specialist follows.

Tests: new test_spec_ops, test_state_ops, test_pocket_granular_ops,
test_pocket_state_ops, test_pocket_prompt_cache, test_edit,
test_edit_handoff, test_plan_handoff, test_widget_diversity,
test_persist_session_bind, test_langchain_react_backend,
test_deep_agents_disable_thinking, test_deep_agents_streaming_events,
test_deep_agents_openai_reasoning_content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(pocket-mcp): drop dead mutation tools from pocketpaw_pocket server

The granular node/state ops and legacy pocket/widget mutators on the
pocketpaw_pocket MCP server were unreachable in production: filtered
off the main agent via _POCKET_MUTATION_TOOL_IDS, and bypassed by the
specialist (which uses LangChain StructuredTool wrappers on an
isolated deep_agents backend, see ee/agent/pocket_specialist/tools.py).

Remove the 14 dead tool registrations + handlers, drop the now-redundant
_POCKET_MUTATION_TOOL_IDS frozenset and its filter line, and collapse
the two allowlist tests into one that asserts the read-only surface
directly. pocketpaw_pocket is now read-only by construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(pocket-prompts): refresh single-source assertions to current shape

The pocket prompts evolved since this test was last touched:
- agent_service.py now legitimately *imports* _MCP_POCKET_BACKENDS from
  ee.ripple._pockets; the bare-substring check tripped on the import.
  Tightened to match definitions only (`<name> = ` at line start).
- The MCP creation prompt was rewritten with a "TWO-PHASE DELEGATION"
  header; the CLI prompt kept the legacy STEP 0 marker. Assert each
  variant against its actual shape.
- The main-agent interaction prompts got slimmed — <interactive-by-default>
  and <pocket-workflow> moved into the edit specialist prompts. Check
  the new <pocket-interaction> tag on the main prompts, and the heavy
  blocks on POCKET_EDIT_SPECIALIST_PROMPT_*.
- list_pockets / validate_spec are no longer wired as runtime tools on
  the creation specialist (list runs in the parent agent before
  delegation; validation is inline in persist_pocket). Assert only
  persist_pocket.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(pocket-specialist): drop orphan event bus, forward inner ops to chat SSE

SpecialistEvent / emit_specialist_event wrote to event_bus, which had
zero subscribers for specialist:* names — pure log noise. The runtime
already pushed real progress to the chat stream via _push_chat_status
for create (pocket_specialist:build / :save); edit pushed nothing.

Removed:
- ee/agent/pocket_specialist/events.py and its test (~95 lines)
- 6 emit_specialist_event calls in runtime.py
- 5 enum members (LISTING, DECIDED, DRAFTING, VALIDATING, REVISING)
  that were declared but never emitted

Added:
- _push_chat_status("pocket_specialist:edit") at run_edit_specialist
  start so the desktop client shows an "Editing pocket..." indicator
- Inner-op forwarding in run_edit_specialist: each granular tool_use
  the specialist's LLM emits (set_state, set_node_prop, add_node,
  move_node, etc.) is pushed as a tool_start on the outer chat stream,
  so the user sees per-op progress instead of opaque silence

The frontend TOOL_LABELS update (in paw-enterprise) lands separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(pockets): enforce workspace + edit-access in _agent_load_doc

The 9 granular agent mutation ops + the pocket_specialist tools all
routed through _agent_load_doc, which loaded a pocket by ObjectId
with no tenancy check. An agent with a valid session in workspace A
could call set_state / set_node_prop / etc. on a pocket in
workspace B if it knew or guessed the ObjectId. The REST update path
does the right thing via _check_domain_edit_access; the agent path
skipped it. (PR #1085 review, blocker #1.)

_agent_load_doc now reads workspace + user from the per-stream
ContextVars set by agent_router._run_agent_stream, rejects when
they're absent, and applies the same owner/shared_with/workspace-
visible gate as the REST path. Cross-tenant mismatches return the
same "pocket <id> not found" message as a genuinely missing pocket
so an agent in workspace A can't enumerate pocket ObjectIds in
workspace B.

Test plumbing:
- Refactored _patches() in test_pocket_granular_ops.py and
  test_pocket_state_ops.py to return a contextlib.ExitStack and
  patch the identity ContextVars to match the FakeDoc's tenancy.
  Every with ctx[0], ctx[1], ... collapses to with ctx.
- Added agent_identity fixture to test_pocket_agent_context.py and
  attached it to the 10 tests that hit real mongomock-motor through
  the agent path.
- New cross-workspace / non-owner / shared_with / no-stream test
  cases at the bottom of test_pocket_granular_ops.py lock the gate
  down structurally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(loop): guard backend-rebuild stop() against non-coroutine returns

asyncio.create_task(old.stop()) crashed with TypeError when old.stop()
returned a non-awaitable — happens whenever a test mocks the router
with a plain MagicMock (3 tests in test_concurrency.py, 2 in
test_stream_event.py — all flagged in PR #1085 review).

Wrap with inspect.iscoroutine() before scheduling: real backends keep
their async-cleanup behavior; mock backends no-op cleanly. Also
defensive against a future backend whose stop() is genuinely sync.

The three concurrency tests had a second, pre-existing bug: they set
loop._router = MagicMock() without stamping _active_backend_name on
it, so _select_router's "backend changed" branch tripped on every
call and swapped the carefully-mocked router for a real AgentRouter.
The test's slow_run coroutine never ran, and the test fell over on
the missing event order. Patched the test fixtures to:
  - set settings.agent_backend = "claude_agent_sdk" (concrete string)
  - patch pocketpaw.agents.loop.Settings.load to return the same mock
  - stamp router._active_backend_name = "claude_agent_sdk"
  - mock router.stop as AsyncMock for cleanliness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(tools-cli): patch canonical _ensure_cloud_runtime_initialized name

The autouse _stub_db_init fixture patched the alias
_ensure_cloud_db_initialized, but _run_cloud_handler calls the
canonical name _ensure_cloud_runtime_initialized directly — so the
boot logic still ran and the two new tests
(test_run_cloud_handler_serializes_to_json_line,
test_run_cloud_handler_catches_exceptions) returned the
"POCKETPAW_MONGO_URI not set" error instead of exercising the
handler. (PR #1085 review, blocker #3.)

Patch both names so existing tests via the alias keep working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: review high-priority items 5-7 — silent failure, hasattr guard, cloud import gate

## #5 run_edit_specialist silent failure

ok=True was returned for every edit run regardless of whether ops
applied or the inner backend errored mid-stream. The caller had no
way to tell "no work needed" from "the specialist crashed."

- success flag starts False, flips True only after backend.run loop
  completes without exception
- new error: str | None field on PocketSpecialistEditOutput captures
  the exception type + message when the run fails
- backend.stop() still runs in the finally — partial state cleanup
  matches the create flow

Two new tests in test_edit.py lock the contract:
- test_ok_true_when_stream_completes
- test_ok_false_when_backend_raises_mid_stream

## #6 hasattr guard on langchain_openai monkey-patch

_patch_openai_message_serializer reaches into three private
langchain_openai symbols (_convert_dict_to_message,
_convert_delta_to_message_chunk, _convert_message_to_dict). The
existing try/except ImportError caught a missing module but not a
missing attribute — a future langchain-openai release that renames
or moves any of those would AttributeError on the first DeepSeek
call in production with no early warning.

Each of the three assignments is now hasattr-guarded, and the patch
function logs a loud ERROR naming the missing symbol(s) so a
langchain upgrade surfaces in CI logs instead of crashing in prod.
Partial patches still apply — surviving functionality keeps working
on the symbols that didn't move.

New test: test_patch_logs_loudly_when_a_target_symbol_is_missing.

## #7 Cloud test files import gate

tests/cloud/* import ee.cloud.*, which pulls beanie + mongomock-motor
on import. CI runs with uv sync --dev --all-extras so it always has
them; local runs without the cloud extras hit ModuleNotFoundError
that's easy to miss in a verbose pytest log.

pytest.importorskip("beanie") + ("mongomock_motor") at the top of
tests/cloud/conftest.py turns the silent-vanish failure mode into
explicit, named pytest SKIP entries pointing operators at
uv sync --dev --all-extras.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(pocket-specialist): isolate test_settings from local .env

Every Settings() call now passes _env_file=None so pydantic-settings
skips reading backend/.env, and an autouse fixture strips the
relevant POCKETPAW_* env vars that might be exported in the shell.
Before this, contributors with a populated .env (e.g. local DeepSeek
configs setting POCKETPAW_POCKET_SPECIALIST_BACKEND=langchain_react
or POCKETPAW_POCKET_SPECIALIST_MAX_VALIDATION_RETRIES=6) saw 4
spurious failures on this file while CI stayed green — confusing
when triaging a PR locally.

The contract these tests measure is "what does the code default to,"
not "what does the operator's machine default to."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(credentials): isolate TestPlaintextMigration from POCKETPAW_LLM_PROVIDER

The env fixture for TestPlaintextMigration didn't strip
POCKETPAW_LLM_PROVIDER, so CI (which exports
POCKETPAW_LLM_PROVIDER=ollama) made
test_loaded_settings_have_migrated_values assert
loaded.llm_provider == "anthropic" against the env-override "ollama"
instead of the migrated config.json value. The test passed locally
without the env var set, fails on CI with it.

Strip POCKETPAW_LLM_PROVIDER via monkeypatch in the shared env
fixture so all six migration tests measure config-file values, not
operator/CI shell exports. (PR #1085 review follow-up.)

* test(tool-bridge): update tool-count contract for specialist function-tool split

This PR's _SPECIALIST_FUNCTION_TOOL_BACKENDS = {deep_agents, google_adk,
openai_agents} injects PocketSpecialistTool as a native function tool
for the function-tool bridge group only. Shell-CLI backends
(opencode, codex_cli, copilot_sdk) dispatch the same capability via
cloud_pocket_specialist_create in _CLOUD_HANDLERS instead, so the
specialist doesn't show in their tool count.

test_tool_count_is_consistent_across_backends used to assert one count
across all non-SDK backends; that contract no longer holds. Updated to
split the backends by integration mode and assert:
  - intra-group consistency in each (any divergence is an accidental
    backend-specific exclusion)
  - function-tool group is exactly cli group + 1 (the specialist tool)

Future accidental drift in either direction still trips the assert.
(PR #1085 review follow-up.)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: prakashUXtech <prakashd88@gmail.com>