feat(eval): wire BrowserOS MCP into performance grader

Performance grader now connects to the live BrowserOS the agent just used (still on the task page during Phase 3 grading) and can verify state-change claims via read-only mcp__browseros__* tools. System prompt teaches per-axis usage and caps live calls at 2-3 per task. Adds mind2web-e2e-perf suite (10 online-mind2web tasks, Bedrock Opus 4.6) for smoke-testing the new path.
2026-05-14 16:14:28 +00:00 · 2026-05-05 22:43:41 +05:30
4 changed files with 98 additions and 7 deletions
--- a/packages/browseros-agent/apps/eval/configs/suites/mind2web-e2e-perf.json
+++ b/packages/browseros-agent/apps/eval/configs/suites/mind2web-e2e-perf.json
@@ -0,0 +1,28 @@
+{
+  "id": "mind2web-e2e-perf",
+  "agent": {
+    "type": "single",
+    "provider": "bedrock",
+    "model": "global.anthropic.claude-opus-4-6-v1",
+    "region": "AWS_REGION",
+    "accessKeyId": "AWS_ACCESS_KEY_ID",
+    "secretAccessKey": "AWS_SECRET_ACCESS_KEY",
+    "supportsImages": true
+  },
+  "dataset": "../../data/mind2web_e2e_test.jsonl",
+  "num_workers": 2,
+  "restart_server_per_task": true,
+  "browseros": {
+    "server_url": "http://127.0.0.1:9110",
+    "base_cdp_port": 9010,
+    "base_server_port": 9110,
+    "base_extension_port": 9310,
+    "load_extensions": false,
+    "headless": false
+  },
+  "captcha": {
+    "api_key_env": "NOPECHA_API_KEY"
+  },
+  "graders": ["performance_grader"],
+  "timeout_ms": 600000
+}
--- a/packages/browseros-agent/apps/eval/src/graders/performance/axes.ts
+++ b/packages/browseros-agent/apps/eval/src/graders/performance/axes.ts
@@ -41,11 +41,34 @@ export const DEFAULT_AXES: AxisDefinition[] = [

 export const PERFORMANCE_SYSTEM_PROMPT = `You are a performance evaluator for a browser automation agent. You will score how well the agent executed a web task across multiple axes.

-## Data Files
+## Data Sources

-You have two data sources in your working directory:
+You have three sources of evidence: the local artifacts (messages.jsonl, screenshots) AND, when available, the **live BrowserOS browser** the agent just used (still on the task page — the run finishes by navigating to about:blank only after grading).

-### 1. messages.jsonl
+### Live browser access (mcp__browseros__*)
+The BrowserOS instance the agent just used is **still running and still on the task page** (the eval pipeline only navigates to about:blank after grading completes). You can inspect that live state via MCP — this is ground truth that no artifact can match.
+
+Available tools (READ-ONLY — never click, type, or navigate):
+- \`mcp__browseros__get_active_page\` — current URL + title. Cheap; call first to confirm the page hasn't changed.
+- \`mcp__browseros__list_pages\` — all open tabs (catches multi-tab tasks).
+- \`mcp__browseros__get_page_content\` — page as clean markdown. Best for reading prose, prices, lists.
+- \`mcp__browseros__get_page_links\` — all links on the page (verify the agent actually navigated where it claimed).
+- \`mcp__browseros__take_snapshot\` — interactive-element snapshot (verify form fields, buttons in their final state).
+- \`mcp__browseros__get_dom\` / \`mcp__browseros__search_dom\` — DOM inspection for specific selectors/strings.
+- \`mcp__browseros__take_screenshot\` — fresh screenshot of current state. More reliable than the last numbered screenshot if the agent's final action didn't trigger a capture.
+- \`mcp__browseros__get_console_logs\` — runtime errors the agent may have missed.
+
+**When to use the live browser (per axis):**
+- **task_completion** — the highest-value use. If the agent claims "submitted the form" or "added X to cart", call \`get_active_page\` (correct URL?) and \`get_page_content\` or \`take_snapshot\` (success state visible? cart shows the item?). If the answer cites specific data, \`search_dom\` for that value confirms it's actually present on the final page.
+- **error_recovery** — \`get_console_logs\` reveals runtime errors the agent didn't surface. A "completed" run with red console errors is suspicious.
+- **efficiency** — usually unnecessary; messages.jsonl already shows the call sequence.
+- **reasoning_quality / speed / autonomy** — usually unnecessary; derive from the message stream.
+
+**Budget:** prefer artifacts first. Reach for MCP only when artifacts are inconclusive (blurry screenshot, claim not in DOM logs, ambiguous final state, or you need to confirm a state-changing claim). Cap yourself at ~2-3 MCP calls per task. Never use MCP to drive the browser — these are verification reads only.
+
+### Local artifacts
+
+#### messages.jsonl
 The raw event stream — one JSON object per line with a "type" field.

 **Event types you care about:**
@@ -56,7 +79,7 @@ The raw event stream — one JSON object per line with a "type" field.
 **Event types to handle carefully:**
 - "tool-output-available" — Tool output. The "output" field contains FULL PAGE DOM CONTENT — hundreds of interactive elements, entire page text, etc. These lines are 5-50KB each. NEVER read them in bulk. However, you CAN and SHOULD use Grep to search within these lines for specific keywords when screenshots alone can't verify a claim. For example, if the task asks "find the price of X" and the screenshot is unclear, grep messages.jsonl for the product name or price value to confirm the agent actually saw it in the DOM.

-### 2. screenshots/ directory
+#### screenshots/ directory
 Numbered PNG screenshots (1.png, 2.png, ...) captured after each tool execution.

 ## Browser Tool Reference
@@ -102,6 +125,13 @@ When the agent's final answer contains specific data (prices, names, dates, coun
 - Task asks "extract the email address" → grep for the email pattern
 This is the most reliable way to verify whether the agent actually found the data it claims, since screenshots may be blurry, truncated, or missing the relevant section.

+**Step 5: Cross-check against the live browser (when artifacts are inconclusive)**
+If the answer relies on a side-effect ("submitted", "added to cart", "logged in", "filled the form") OR if Step 4 grep can't find the claimed value, fall through to mcp__browseros__ tools. Typical pattern:
+1. \`mcp__browseros__get_active_page\` — does the URL match the expected post-action page?
+2. \`mcp__browseros__get_page_content\` or \`mcp__browseros__search_dom\` — is the success indicator (confirmation message, cart item, updated value) actually present?
+3. If suspicious, \`mcp__browseros__get_console_logs\` to spot silent failures.
+Stop after 2-3 calls — this is verification, not exploration.
+
 ## How to View Screenshots

 You have {screenshot_count} screenshots. View 3-5 strategically:
--- a/packages/browseros-agent/apps/eval/src/graders/performance/performance-grader.ts
+++ b/packages/browseros-agent/apps/eval/src/graders/performance/performance-grader.ts
@@ -83,6 +83,7 @@ export class PerformanceGrader implements Grader {
        systemPrompt,
        userPrompt,
        input.outputDir,
+        input.mcpUrl,
      )
      if (response) {
        await writeGraderJsonArtifact(
@@ -185,11 +186,39 @@ export class PerformanceGrader implements Grader {
    systemPrompt: string,
    userPrompt: string,
    outputDir: string,
+    mcpUrl?: string,
  ): Promise<AgentResult | null> {
    const taskId = outputDir.split('/').pop() ?? outputDir
-    console.log(`Perf grader ${taskId}: Starting (model=${this.model})`)
+    console.log(
+      `Perf grader ${taskId}: Starting (model=${this.model}, mcp=${mcpUrl ? 'on' : 'off'})`,
+    )
    const startMs = Date.now()

+    const allowedTools = ['Read', 'Glob', 'Grep']
+    const mcpServers: Record<
+      string,
+      { type: 'http'; url: string; headers?: Record<string, string> }
+    > = {}
+    if (mcpUrl) {
+      mcpServers.browseros = {
+        type: 'http',
+        url: mcpUrl,
+        headers: { 'X-BrowserOS-Source': 'sdk-internal' },
+      }
+      // Read-only inspection tools — let the grader verify claims against live browser state.
+      allowedTools.push(
+        'mcp__browseros__get_active_page',
+        'mcp__browseros__list_pages',
+        'mcp__browseros__get_page_content',
+        'mcp__browseros__get_page_links',
+        'mcp__browseros__take_screenshot',
+        'mcp__browseros__take_snapshot',
+        'mcp__browseros__get_dom',
+        'mcp__browseros__search_dom',
+        'mcp__browseros__get_console_logs',
+      )
+    }
+
    const agentPromise = (async (): Promise<AgentResult | null> => {
      let result: AgentResult | null = null
      let messageCount = 0
@@ -200,7 +229,8 @@ export class PerformanceGrader implements Grader {
          model: this.model,
          cwd: outputDir,
          systemPrompt,
-          allowedTools: ['Read', 'Glob', 'Grep'],
+          allowedTools,
+          mcpServers,
          permissionMode: 'bypassPermissions',
          allowDangerouslySkipPermissions: true,
          maxTurns: this.maxTurns,
--- a/packages/browseros-agent/apps/eval/src/runs/task-run-pipeline.ts
+++ b/packages/browseros-agent/apps/eval/src/runs/task-run-pipeline.ts
@@ -163,7 +163,10 @@ export class TaskRunPipeline {
      // Phase 2: Execute agent
      const agentResult = await this.executeAgent(task, pageId)

-      // Phase 3: Run graders
+      // Phase 3: Run graders.
+      // The browser is intentionally still on the task page here — graders
+      // (e.g. PerformanceGrader) may inspect live browser state via MCP for
+      // claim verification. Do not move the about:blank cleanup above this.
      const graderResults = await this.runGraders(
        task,
        agentResult,