Braintrust basic evals (#87)

* implement validator eval

* implement online eval foundation

* further implementing online evals

* enhance evaluation data logging

* implement LLM scoring, remove redundant EventEnricher

* cleanup

* fix build errs from merging, extend LLM scorer context

* settled evaluation framework

* update evals documentation

* fix evals screenshots

* fix typos

* Evals config moved to env variables and tested

* test

* Update manifest to 49.1

* Removed duplciate + button

* Just use previous way of registering tools as that is not required for evals

* Add claude commands for research, plan and implement

* evals2 research and plan

implementation plan

new implementation plan

* Evals2 implementation

test test

* Removed old eval hooks

Remove old evals hooks

* evals 2 added to env

* Eval2 enhancement plan

backup

* Make Braintrust project configurable

Make Braintrust project configurable

* Enhanced scorer -- using Gemini 2.5 pro for evaluation

backup v0.1

enhancement v0.2

v0.2

backup v0.3

backup v0.4

* Deleted old evals directory

* Clean up old evals code

* Bunch of fixes and improvements

backup

fixes 0.1

more fixes

fixes

more elaborate prompts

braintrust logger fix

* Renamed files

backup
This commit is contained in:
Felarof
2025-09-05 18:04:07 -07:00
committed by GitHub
parent 5def144110
commit 1abbee638a
40 changed files with 5874 additions and 1947 deletions

View File

@@ -1,2 +1,20 @@
POSTHOG_API_KEY=
KLAVIS_API_KEY=
LITELLM_API_KEY=""
POSTHOG_API_KEY=""
KLAVIS_API_KEY=""
# Braintrust Telemetry Configuration
ENABLE_TELEMETRY=false
BRAINTRUST_API_KEY=""
BRAINTRUST_PROJECT_UUID=""
BRAINTRUST_PROJECT_NAME="browseros-agent-online"
# OpenAI Configuration for Scoring
OPENAI_API_KEY_FOR_SCORING=""
OPENAI_MODEL_FOR_SCORING="gpt-4o"
# Simplified Evals2 System
ENABLE_EVALS2=false
# Gemini API keys for evals2 scoring
GOOGLE_GENAI_API_KEY=""
GEMINI_API_KEY=""

1
.gitignore vendored
View File

@@ -1,7 +1,6 @@
# Dependencies
node_modules/
bak/
screenshots/
**/__test_output__/
# docs

View File

@@ -0,0 +1,400 @@
# Evals2 Implementation Documentation
## Overview
Evals2 is a simplified evaluation framework for the Nxtscape browser automation system. It represents a complete rewrite of the original evaluation system, achieving a 75% reduction in code complexity (500 lines vs 2000+) while maintaining full functionality.
## Architecture
### Core Components
The evals2 system consists of four main components:
```
┌─────────────────────────────────────────────────────┐
│ NxtScape │
│ (Session Lifecycle & Scoring Trigger) │
└────────────────────┬────────────────────────────────┘
┌────────────────────▼────────────────────────────────┐
│ BrowserAgent │
│ (Tool Wrapping & Metrics Collection) │
└────────────────────┬────────────────────────────────┘
┌────────────┴────────────┬─────────────┐
▼ ▼ ▼
┌───────────────┐ ┌──────────────────┐ ┌──────────────┐
│SimpleToolWrapper│ │SimplifiedScorer │ │SimpleBraintrust│
│ │ │ │ │EventManager │
│ Duration │ │ 4-Dimension │ │ │
│ Tracking │ │ Scoring Engine │ │ Session Mgmt │
└───────────────┘ └──────────────────┘ └──────────────┘
┌────────▼─────────┐
│SimpleBraintrust │
│Logger │
│ │
│ Score Reporting │
└─────────────────┘
```
### Component Details
#### 1. SimpleToolWrapper (`src/evals2/SimpleToolWrapper.ts`)
- **Purpose**: Lightweight tool duration tracking
- **Implementation**: Uses Map-based storage in ExecutionContext.toolMetrics
- **Performance**: ~1ms overhead per tool call
- **Key Methods**:
- `wrapTool()`: Wraps a tool with start/end timing logic
- Stores metrics as `{toolName, startTime, endTime}` in Map
#### 2. SimplifiedScorer (`src/evals2/SimplifiedScorer.ts`)
- **Purpose**: Multi-dimensional scoring of agent performance
- **Scoring Dimensions**:
- Goal Completion (40%): Task achievement assessment
- Plan Correctness (30%): Execution efficiency evaluation
- Error-Free Execution (15%): Error handling quality
- Context Efficiency (15%): Token usage optimization
- **Features**:
- LLM-based scoring with GPT-4o-mini (when available)
- Heuristic fallback for offline/no-API scenarios
- Returns structured scores with explanations
#### 3. SimpleBraintrustEventManager (`src/evals2/SimpleBraintrustEventManager.ts`)
- **Purpose**: Session lifecycle management
- **Key Features**:
- Parent span creation for conversation sessions
- Lazy loading of Braintrust SDK
- Graceful handling of missing API keys
- Session ID tracking
#### 4. SimpleBraintrustLogger (`src/evals2/SimpleBraintrustLogger.ts`)
- **Purpose**: Score reporting to Braintrust platform
- **Implementation**:
- Uploads scores as child spans
- Includes metadata (model, prompts, metrics)
- Handles connection failures gracefully
## Execution Flow
### 1. Session Initialization
```typescript
// In NxtScape.run()
if (process.env.ENABLE_EVALS2 === 'true') {
await SimpleBraintrustEventManager.startConversationSession({
sessionId: executionContext.sessionId,
userId: 'user',
initialMessage: userMessage
});
}
```
### 2. Tool Wrapping
```typescript
// In BrowserAgent.bindToolsToLLM()
if (process.env.ENABLE_EVALS2 === 'true') {
const wrappedTools = tools.map(tool =>
SimpleToolWrapper.wrapTool(tool, this.executionContext)
);
}
```
### 3. Metrics Collection
During execution, tool durations are automatically collected:
```typescript
// Stored in ExecutionContext.toolMetrics Map
Map<string, {
toolName: string;
startTime: number;
endTime: number;
}>
```
### 4. Scoring After Task
```typescript
// In NxtScape.run() after agent.execute()
const scores = await SimplifiedScorer.scoreMessages({
messages: executionContext.messageManager.messages,
toolMetrics: executionContext.toolMetrics,
userMessage: userMessage,
finalResponse: result
});
```
### 5. Score Reporting
```typescript
await SimpleBraintrustLogger.logScores({
scores,
metadata: {
model: llmSettings.model,
provider: llmSettings.provider,
sessionId: executionContext.sessionId
},
parentSpan: SimpleBraintrustEventManager.getParentSpan()
});
```
## Scoring Methodology
### Four-Dimension Scoring System
1. **Goal Completion (40% weight)**
- Evaluates if the agent achieved the user's requested task
- Scored 0-10 based on completion level
- Considers partial completions and alternative solutions
2. **Plan Correctness (30% weight)**
- Assesses the efficiency of the execution plan
- Evaluates tool selection and sequencing
- Penalizes unnecessary steps or redundant actions
3. **Error-Free Execution (15% weight)**
- Tracks error handling and recovery
- Scores based on error frequency and severity
- Rewards graceful degradation
4. **Context Efficiency (15% weight)**
- Measures token usage optimization
- Evaluates message conciseness
- Rewards efficient context management
### Scoring Implementation
```typescript
// LLM-based scoring (preferred)
if (process.env.OPENAI_MODEL_FOR_SCORING) {
const llmScore = await this.scoreWithLLM(messages, userMessage);
return llmScore;
}
// Heuristic fallback
return this.scoreWithHeuristics(messages, toolMetrics);
```
## Configuration
### Environment Variables
```bash
# Enable evals2 system
ENABLE_EVALS2=true
# Braintrust API key for reporting
BRAINTRUST_API_KEY=your-braintrust-api-key
# Optional: OpenAI model for scoring
OPENAI_MODEL_FOR_SCORING=gpt-4o-mini
# Optional: OpenAI API key (if different from main)
OPENAI_API_KEY=your-openai-api-key
```
### Integration Points
The system requires minimal integration with only two hooks:
1. **NxtScape** (`src/lib/core/NxtScape.ts`):
- Session start/end lifecycle
- Scoring trigger after task completion
2. **BrowserAgent** (`src/lib/agent/BrowserAgent.ts`):
- Tool wrapping for metrics collection
## Key Improvements from V1
### Code Simplification
- **75% reduction** in codebase size (500 lines vs 2000+)
- Removed complex span tree management
- Simplified to Map-based tracking
### Performance
- **~1ms overhead** per tool call (vs 10-20ms in v1)
- Map lookups instead of span traversal
- Lazy loading of dependencies
### Reliability
- **Graceful degradation** when APIs unavailable
- Works offline with heuristic scoring
- No blocking operations
### Maintainability
- Clear separation of concerns
- Testable components
- Minimal coupling with main codebase
## Usage Examples
### Basic Usage
```typescript
// Automatic - just set environment variable
process.env.ENABLE_EVALS2 = 'true';
// The system will automatically:
// 1. Track all tool executions
// 2. Score after each task
// 3. Report to Braintrust (if configured)
```
### Programmatic Access
```typescript
// Access scores directly
const scores = await SimplifiedScorer.scoreMessages({
messages: messageHistory,
toolMetrics: toolMetricsMap,
userMessage: "Book a flight to Paris",
finalResponse: agentResponse
});
console.log(`Goal Completion: ${scores.goalCompletion}/10`);
console.log(`Overall Score: ${scores.overallScore}/10`);
```
### Custom Tool Wrapping
```typescript
// Wrap a custom tool
const wrappedTool = SimpleToolWrapper.wrapTool(
myCustomTool,
executionContext
);
// Metrics automatically collected in executionContext.toolMetrics
```
## Testing
### Unit Tests
```bash
# Run evals2 specific tests
npm test -- src/evals2/
# Test individual components
npm test -- SimplifiedScorer.test.ts
```
### Integration Testing
```bash
# Enable evals2 and run full integration
ENABLE_EVALS2=true npm test -- integration/
```
## Monitoring & Debugging
### Debug Output
```typescript
// Enable debug logging
process.env.DEBUG_EVALS2 = 'true';
// Logs will show:
// - Tool wrapping events
// - Scoring calculations
// - Braintrust upload status
```
### Metrics Access
```typescript
// Access raw metrics during execution
const metrics = executionContext.toolMetrics;
metrics.forEach((metric, id) => {
console.log(`Tool: ${metric.toolName}`);
console.log(`Duration: ${metric.endTime - metric.startTime}ms`);
});
```
## Future Improvements
### Planned Enhancements
1. **Real-time scoring** - Score during execution, not just after
2. **Custom scoring dimensions** - Allow user-defined scoring criteria
3. **Batch uploading** - Aggregate scores before uploading
4. **Local storage** - Cache scores locally for offline analysis
### Open Questions
1. Should scoring be synchronous or async with the main flow?
2. How to handle multi-turn conversations vs single tasks?
3. Should we support custom scoring providers beyond OpenAI?
4. How to visualize scores in the UI?
## Troubleshooting
### Common Issues
**Evals2 not running:**
- Check `ENABLE_EVALS2=true` is set
- Verify environment variables are loaded
**Scores not uploading:**
- Verify `BRAINTRUST_API_KEY` is valid
- Check network connectivity
- Look for error logs in console
**LLM scoring failing:**
- Verify `OPENAI_MODEL_FOR_SCORING` is set
- Check OpenAI API key and quota
- System falls back to heuristics automatically
**High overhead:**
- Check for duplicate tool wrapping
- Verify Maps are being cleared after sessions
- Monitor memory usage
## API Reference
### SimplifiedScorer
```typescript
interface ScoreResult {
goalCompletion: number; // 0-10
planCorrectness: number; // 0-10
errorFreeExecution: number; // 0-10
contextEfficiency: number; // 0-10
overallScore: number; // Weighted average
explanation?: string; // LLM reasoning
}
class SimplifiedScorer {
static async scoreMessages(params: {
messages: Message[];
toolMetrics: Map<string, ToolMetric>;
userMessage: string;
finalResponse: any;
}): Promise<ScoreResult>;
}
```
### SimpleToolWrapper
```typescript
class SimpleToolWrapper {
static wrapTool(
tool: DynamicStructuredTool,
executionContext: ExecutionContext
): DynamicStructuredTool;
}
```
### SimpleBraintrustEventManager
```typescript
class SimpleBraintrustEventManager {
static async startConversationSession(params: {
sessionId: string;
userId: string;
initialMessage: string;
}): Promise<void>;
static async endConversationSession(): Promise<void>;
static getParentSpan(): any;
}
```
### SimpleBraintrustLogger
```typescript
class SimpleBraintrustLogger {
static async logScores(params: {
scores: ScoreResult;
metadata: any;
parentSpan?: any;
}): Promise<void>;
}
```
## Conclusion
Evals2 represents a significant improvement in evaluation system design, prioritizing simplicity, performance, and reliability. The system's modular architecture and minimal integration requirements make it easy to maintain and extend while providing comprehensive evaluation capabilities for the Nxtscape browser automation system.

View File

@@ -1,7 +1,7 @@
{
"manifest_version": 3,
"name": "Agent",
"version": "49.0.0.26",
"version": "49.1.0.26",
"description": "Agent",
"key": "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAs1zULZz5eE0U8SEjr/R++dlx6WKFj7GbpnBiE1n17gaylMWDlw6uuBJNjcRrSGwOt53Z3PKf2T3g5DtNES8q6rQc11P/y8J8GKhKuqGrtRJyk5iXzcKJk4CHz6leFSMt8CsZY0r0b7wCZ5QuhomTHGQpNWNS0c13xfVqWt4dncfIRj7fMzfTkicq7Mqqx+JcdprLkiVfETvdkMwwEWmSNwQ6nCDzLtTbyyMiGUEBSJs+WlP1fO7LIX0sHesFVxfPhCZ2K4F1biwenbRL+YYD60ogpVppop2ee/W3D211IN1zYxgnhycFv3m8TrzG+MD/IZgcu13u0bHRn3V7IGW1iwIDAQAB",
"permissions": [
@@ -51,4 +51,4 @@
"48": "assets/icon48.png",
"128": "assets/icon128.png"
}
}
}

156
package-lock.json generated
View File

@@ -32,7 +32,7 @@
"markdown-to-jsx": "^7.7.12",
"match-sorter": "^6.3.4",
"ollama": "^0.5.16",
"openai": "^4.98.0",
"openai": "^5.15.0",
"posthog-js": "^1.252.0",
"react": "^18.2.0",
"react-dom": "^18.2.0",
@@ -1733,25 +1733,6 @@
"@langchain/core": ">=0.3.58 <0.4.0"
}
},
"node_modules/@langchain/community/node_modules/@langchain/openai/node_modules/openai": {
"version": "5.10.1",
"license": "Apache-2.0",
"bin": {
"openai": "bin/cli"
},
"peerDependencies": {
"ws": "^8.18.0",
"zod": "^3.23.8"
},
"peerDependenciesMeta": {
"ws": {
"optional": true
},
"zod": {
"optional": true
}
}
},
"node_modules/@langchain/community/node_modules/uuid": {
"version": "10.0.0",
"resolved": "https://registry.npmjs.org/uuid/-/uuid-10.0.0.tgz",
@@ -1977,25 +1958,6 @@
"@langchain/core": ">=0.3.58 <0.4.0"
}
},
"node_modules/@langchain/openai/node_modules/openai": {
"version": "5.10.1",
"license": "Apache-2.0",
"bin": {
"openai": "bin/cli"
},
"peerDependencies": {
"ws": "^8.18.0",
"zod": "^3.23.8"
},
"peerDependenciesMeta": {
"ws": {
"optional": true
},
"zod": {
"optional": true
}
}
},
"node_modules/@langchain/textsplitters": {
"version": "0.1.0",
"resolved": "https://registry.npmjs.org/@langchain/textsplitters/-/textsplitters-0.1.0.tgz",
@@ -5033,6 +4995,51 @@
"zod-to-json-schema": "^3.22.5"
}
},
"node_modules/autoevals/node_modules/@types/node": {
"version": "18.19.123",
"resolved": "https://registry.npmjs.org/@types/node/-/node-18.19.123.tgz",
"integrity": "sha512-K7DIaHnh0mzVxreCR9qwgNxp3MH9dltPNIEddW9MYUlcKAzm+3grKNSTe2vCJHI1FaLpvpL5JGJrz1UZDKYvDg==",
"license": "MIT",
"dependencies": {
"undici-types": "~5.26.4"
}
},
"node_modules/autoevals/node_modules/openai": {
"version": "4.104.0",
"resolved": "https://registry.npmjs.org/openai/-/openai-4.104.0.tgz",
"integrity": "sha512-p99EFNsA/yX6UhVO93f5kJsDRLAg+CTA2RBqdHK4RtK8u5IJw32Hyb2dTGKbnnFmnuoBv5r7Z2CURI9sGZpSuA==",
"license": "Apache-2.0",
"dependencies": {
"@types/node": "^18.11.18",
"@types/node-fetch": "^2.6.4",
"abort-controller": "^3.0.0",
"agentkeepalive": "^4.2.1",
"form-data-encoder": "1.7.2",
"formdata-node": "^4.3.2",
"node-fetch": "^2.6.7"
},
"bin": {
"openai": "bin/cli"
},
"peerDependencies": {
"ws": "^8.18.0",
"zod": "^3.23.8"
},
"peerDependenciesMeta": {
"ws": {
"optional": true
},
"zod": {
"optional": true
}
}
},
"node_modules/autoevals/node_modules/undici-types": {
"version": "5.26.5",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-5.26.5.tgz",
"integrity": "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA==",
"license": "MIT"
},
"node_modules/autoprefixer": {
"version": "10.4.21",
"resolved": "https://registry.npmjs.org/autoprefixer/-/autoprefixer-10.4.21.tgz",
@@ -6710,6 +6717,10 @@
"integrity": "sha512-uWjbaKIK3T1OSVptzX7Nl6PvQ3qAGtKEtVRjRuazjfL3Bx5eI409VZSqgND+4UNnmzLVdPj9FqFJNPqBZFve4w==",
"deprecated": "Rimraf versions prior to v4 are no longer supported",
"dev": true,
<<<<<<< HEAD
"license": "ISC",
=======
>>>>>>> main
"dependencies": {
"glob": "^7.1.3"
},
@@ -9830,25 +9841,6 @@
"@langchain/core": ">=0.3.58 <0.4.0"
}
},
"node_modules/langchain/node_modules/openai": {
"version": "5.10.1",
"license": "Apache-2.0",
"bin": {
"openai": "bin/cli"
},
"peerDependencies": {
"ws": "^8.18.0",
"zod": "^3.23.8"
},
"peerDependenciesMeta": {
"ws": {
"optional": true
},
"zod": {
"optional": true
}
}
},
"node_modules/langchain/node_modules/uuid": {
"version": "10.0.0",
"resolved": "https://registry.npmjs.org/uuid/-/uuid-10.0.0.tgz",
@@ -11337,19 +11329,10 @@
}
},
"node_modules/openai": {
"version": "4.104.0",
"resolved": "https://registry.npmjs.org/openai/-/openai-4.104.0.tgz",
"integrity": "sha512-p99EFNsA/yX6UhVO93f5kJsDRLAg+CTA2RBqdHK4RtK8u5IJw32Hyb2dTGKbnnFmnuoBv5r7Z2CURI9sGZpSuA==",
"version": "5.15.0",
"resolved": "https://registry.npmjs.org/openai/-/openai-5.15.0.tgz",
"integrity": "sha512-kcUdws8K/A8m02I+IqFBwO51gS+87GP89yWEufGbzEi8anBz4FB/bti2QxaJdGwwY4mwJGzx85XO7TuL/Tpu1w==",
"license": "Apache-2.0",
"dependencies": {
"@types/node": "^18.11.18",
"@types/node-fetch": "^2.6.4",
"abort-controller": "^3.0.0",
"agentkeepalive": "^4.2.1",
"form-data-encoder": "1.7.2",
"formdata-node": "^4.3.2",
"node-fetch": "^2.6.7"
},
"bin": {
"openai": "bin/cli"
},
@@ -11366,19 +11349,6 @@
}
}
},
"node_modules/openai/node_modules/@types/node": {
"version": "18.19.120",
"license": "MIT",
"dependencies": {
"undici-types": "~5.26.4"
}
},
"node_modules/openai/node_modules/undici-types": {
"version": "5.26.5",
"resolved": "https://registry.npmjs.org/undici-types/-/undici-types-5.26.5.tgz",
"integrity": "sha512-JlCMO+ehdEIKqlFxk6IfVoAUVmgz7cU7zD/h9XZ0qzeosSHmUJVOzSQvvYSYWXkFXC+IfLKSIffhv0sVZup6pA==",
"license": "MIT"
},
"node_modules/openapi-types": {
"version": "12.1.3",
"resolved": "https://registry.npmjs.org/openapi-types/-/openapi-types-12.1.3.tgz",
@@ -13260,6 +13230,10 @@
"resolved": "https://registry.npmjs.org/glob/-/glob-11.0.3.tgz",
"integrity": "sha512-2Nim7dha1KVkaiF4q6Dj+ngPPMdfvLJEOpZk/jKiUAkqKebpGAWQXAq9z1xu9HKu5lWfqw/FASuccEjyznjPaA==",
"dev": true,
<<<<<<< HEAD
"license": "ISC",
=======
>>>>>>> main
"dependencies": {
"foreground-child": "^3.3.1",
"jackspeak": "^4.1.1",
@@ -13283,6 +13257,10 @@
"resolved": "https://registry.npmjs.org/jackspeak/-/jackspeak-4.1.1.tgz",
"integrity": "sha512-zptv57P3GpL+O0I7VdMJNBZCu+BPHVQUk55Ft8/QCJjTVxrnJHuVuX/0Bl2A6/+2oyR/ZMEuFKwmzqqZ/U5nPQ==",
"dev": true,
<<<<<<< HEAD
"license": "BlueOak-1.0.0",
=======
>>>>>>> main
"dependencies": {
"@isaacs/cliui": "^8.0.2"
},
@@ -13298,6 +13276,10 @@
"resolved": "https://registry.npmjs.org/lru-cache/-/lru-cache-11.1.0.tgz",
"integrity": "sha512-QIXZUBJUx+2zHUdQujWejBkcD9+cs94tLn0+YL8UrCh+D5sCXZ4c7LaEH48pNwRY3MLDgqUFyhlCyjJPf1WP0A==",
"dev": true,
<<<<<<< HEAD
"license": "ISC",
=======
>>>>>>> main
"engines": {
"node": "20 || >=22"
}
@@ -13307,6 +13289,10 @@
"resolved": "https://registry.npmjs.org/minimatch/-/minimatch-10.0.3.tgz",
"integrity": "sha512-IPZ167aShDZZUMdRk66cyQAW3qr0WzbHkPdMYa8bzZhlHhO3jALbKdxcaak7W9FfT2rZNpQuUu4Od7ILEpXSaw==",
"dev": true,
<<<<<<< HEAD
"license": "ISC",
=======
>>>>>>> main
"dependencies": {
"@isaacs/brace-expansion": "^5.0.0"
},
@@ -13322,6 +13308,10 @@
"resolved": "https://registry.npmjs.org/path-scurry/-/path-scurry-2.0.0.tgz",
"integrity": "sha512-ypGJsmGtdXUOeM5u93TyeIEfEhM6s+ljAhrk5vAvSx8uyY/02OvrZnA0YNGUrPXfpJMgI1ODd3nwz8Npx4O4cg==",
"dev": true,
<<<<<<< HEAD
"license": "BlueOak-1.0.0",
=======
>>>>>>> main
"dependencies": {
"lru-cache": "^11.0.0",
"minipass": "^7.1.2"

View File

@@ -17,10 +17,7 @@
"test:run": "vitest run",
"test:watch": "vitest --watch",
"test:coverage": "vitest run --coverage",
"test:ui": "vitest --ui",
"eval:planner": "tsx src/evals/planner-llm.eval.ts",
"eval:validator": "tsx src/evals/validator-llm.eval.ts",
"extract:prompts": "tsx src/evals/push-prompts.ts"
"test:ui": "vitest --ui"
},
"author": "",
"license": "MIT",
@@ -39,7 +36,7 @@
"@types/uuid": "^10.0.0",
"autoevals": "^0.0.130",
"axios": "^1.9.0",
"braintrust": "^0.2.4",
"braintrust": "^0.3.6",
"class-variance-authority": "^0.7.1",
"clsx": "^2.1.1",
"dotenv": "^16.3.1",
@@ -48,7 +45,7 @@
"markdown-to-jsx": "^7.7.12",
"match-sorter": "^6.3.4",
"ollama": "^0.5.16",
"openai": "^4.98.0",
"openai": "^5.15.0",
"posthog-js": "^1.252.0",
"react": "^18.2.0",
"react-dom": "^18.2.0",

View File

@@ -1,4 +1,4 @@
import { MessageType, LogMessage, ExecuteQueryMessage, AgentStreamUpdateMessage, CancelTaskMessage, ResetConversationMessage, GetTabsMessage } from '@/lib/types/messaging'
import { MessageType, LogMessage, ExecuteQueryMessage, CancelTaskMessage, ResetConversationMessage, GetTabsMessage } from '@/lib/types/messaging'
import { LLMSettingsReader } from '@/lib/llm/settings/LLMSettingsReader'
import { langChainProvider } from '@/lib/llm/LangChainProvider'
import { BrowserOSProvidersConfigSchema, BROWSEROS_PREFERENCE_KEYS } from '@/lib/llm/settings/browserOSTypes'
@@ -49,15 +49,15 @@ function debugLog(message: string, level: 'info' | 'error' | 'warning' = 'info')
Logging.log('Background', message, level)
}
// Active tabs map (tabId -> information)
const activeTabs = new Map<number, { url: string }>()
// Active tabs map (tabId -> information) - currently unused but preserved for future use
// const activeTabs = new Map<number, { url: string }>()
// Navigation history tracking (tabId -> array of navigation entries)
const tabHistory = new Map<number, Array<{
url: string
title: string
timestamp: number
}>>()
// Navigation history tracking (tabId -> array of navigation entries) - currently unused but preserved for future use
// const tabHistory = new Map<number, Array<{
// url: string
// title: string
// timestamp: number
// }>>()
// Connected ports (name -> port)
const connectedPorts = new Map<string, chrome.runtime.Port>();
@@ -130,7 +130,7 @@ function initialize(): void {
const raw = typeof change.newValue === 'string' ? JSON.parse(change.newValue) : change.newValue
const config = BrowserOSProvidersConfigSchema.parse(raw)
lastProvidersConfigJson = JSON.stringify(config)
try { langChainProvider.clearCache() } catch (_) {}
try { langChainProvider.clearCache() } catch (_) { /* Ignore error */ }
broadcastProvidersConfig(config)
} catch (_e) {
// Ignore parse/validation errors
@@ -396,6 +396,8 @@ function handlePortMessage(message: PortMessage, port: chrome.runtime.Port): voi
case MessageType.REFINE_PLAN:
handleRefinePlanPort(payload as { currentPlan: { goal?: string; steps: string[] }; feedback: string; maxSteps?: number }, port, id)
break
default:
// Unknown port message type
@@ -419,27 +421,14 @@ function handlePortMessage(message: PortMessage, port: chrome.runtime.Port): voi
/**
* Handles log messages
* @param payload - Log message payload
* @param _payload - Log message payload
*/
function handleLogMessage(payload: LogMessage['payload']): void {
const { source, message, level = 'info' } = payload;
// Forward log message from other components
function handleLogMessage(_payload: LogMessage['payload']): void {
// const { source, message, level = 'info' } = _payload;
// Forward log message from other components - currently no-op
}
/**
* Helper function to determine status from action string
*/
function getStatusFromAction(action: string): 'thinking' | 'executing' | 'completed' | 'error' {
if (action.includes('Error') || action.includes('Failed')) {
return 'error'
} else if (action.includes('Thinking') || action.includes('Processing')) {
return 'thinking'
} else if (action.includes('Executing')) {
return 'executing'
} else {
return 'executing'
}
}
// Helper function removed - was only used by old experiment functionality
/**
@@ -528,14 +517,14 @@ function handleHeartbeatMessage(payload: { timestamp: number }, port: chrome.run
/**
* Handles conversation reset requests via port messaging
* @param payload - Reset conversation payload
* @param port - Port to send response through
* @param id - Optional message ID for correlation
* @param _payload - Reset conversation payload
* @param _port - Port to send response through
* @param _id - Optional message ID for correlation
*/
function handleResetConversationPort(
payload: ResetConversationMessage['payload'],
port: chrome.runtime.Port,
id?: string
_payload: ResetConversationMessage['payload'],
_port: chrome.runtime.Port,
_id?: string
): void {
try {
nxtScape.reset()
@@ -656,7 +645,7 @@ function handleSaveLlmProvidersPort(
undefined,
(success?: boolean) => {
if (success) {
try { langChainProvider.clearCache() } catch (_) {}
try { langChainProvider.clearCache() } catch (_) { /* Ignore error */ }
lastProvidersConfigJson = JSON.stringify(config)
broadcastProvidersConfig(config)
}
@@ -672,7 +661,7 @@ function handleSaveLlmProvidersPort(
try {
const key = BROWSEROS_PREFERENCE_KEYS.PROVIDERS
chrome.storage?.local?.set({ [key]: JSON.stringify(config) }, () => {
try { langChainProvider.clearCache() } catch (_) {}
try { langChainProvider.clearCache() } catch (_) { /* Ignore error */ }
lastProvidersConfigJson = JSON.stringify(config)
broadcastProvidersConfig(config)
port.postMessage({
@@ -714,6 +703,8 @@ function handleCancelTaskPort(
try {
nxtScape.cancel()
Logging.logMetric('task_cancelled')
} catch (error) {
const errorMessage = error instanceof Error ? error.message : String(error)
debugLog(`Error handling task cancellation: ${errorMessage}`, 'error')

View File

@@ -46,4 +46,37 @@ export function isMockLLMSettings(): boolean {
return config.MOCK_LLM_SETTINGS
}
export function isPocMode(): boolean {
return false;
}
/**
* Evaluation configuration for development/debugging
*
* To enable telemetry:
* 1. Set ENABLE_TELEMETRY = true in your .env file
* 2. Add your Braintrust API key to BRAINTRUST_API_KEY in your .env file
* 3. Add your OpenAI API key to OPENAI_API_KEY_FOR_SCORING in your .env file (for LLM-as-judge scoring)
* 4. Optionally change OPENAI_MODEL_FOR_SCORING in your .env file (defaults to gpt-4o)
* 5. Rebuild
*
* 6. To experiment, you will need BRAINTRUST_PROJECT_UUID from your Braintrust dashboard in your .env file
* 7. Set BRAINTRUST_PROJECT_NAME in your .env file (defaults to 'browseros-agent-online')
*
* For the simplified evals2 system:
* 1. Set ENABLE_EVALS2 = true in your .env file
* 2. Set BRAINTRUST_API_KEY in your .env file
* 3. Set BRAINTRUST_PROJECT_NAME in your .env file (defaults to 'browseros-agent-online')
* 4. Rebuild
*/
export const ENABLE_TELEMETRY = process.env.ENABLE_TELEMETRY === 'true';
export const ENABLE_EVALS2 = process.env.ENABLE_EVALS2 === 'true';
export const BRAINTRUST_API_KEY = process.env.BRAINTRUST_API_KEY || '';
export const BRAINTRUST_PROJECT_UUID = process.env.BRAINTRUST_PROJECT_UUID || '';
export const BRAINTRUST_PROJECT_NAME = process.env.BRAINTRUST_PROJECT_NAME || 'browseros-agent-online';
// Gemini API keys for evals2 scoring
export const GOOGLE_GENAI_API_KEY = process.env.GOOGLE_GENAI_API_KEY || '';
export const GEMINI_API_KEY = process.env.GEMINI_API_KEY || '';
export default config

View File

@@ -1,116 +0,0 @@
# Tool Evaluation System
Current State:
LLM-based evaluation system for PlannerTool and ValidatorTool with LLM scoring.
## Structure
```
src/evals/
├── planner-llm.eval.ts # LLM-based planner evaluation
├── validator-llm.eval.ts # LLM-based validator evaluation
├── push-prompts.ts # Extract tool prompts for Braintrust
├── tools/
│ ├── planner/test-cases.json # Planner test cases
│ └── validator/test-cases.json # Validator test cases
└── utils/test-context.ts # Test utilities
```
## Commands
```bash
npm run eval:planner # Run LLM-based planner evaluation locally
npm run eval:validator # Run LLM-based validator evaluation locally
npm run extract:prompts # Extract tool prompts to JSON for Braintrust
# Braintrust SDK (optional)
npx braintrust eval src/evals/planner-llm.eval.ts
npx braintrust eval src/evals/validator-llm.eval.ts
```
## Prerequisites
Set your OpenAI API key:
```bash
$env:OPENAI_API_KEY="sk-your-openai-key"
```
## What happens when you run eval:planner
1. Loads test cases from `tools/planner/test-cases.json`
2. For each test case:
- Uses your PlannerTool prompts to generate a plan via LLM
- Scores the plan quality with LLM-as-judge (0.0-1.0)
- Provides reasoning for the score
3. Shows summary: passed/total tests and average score
Expected output:
```
Running PlannerTool LLM Evaluation
Test 1/3: planner-001
Task: Order toothpaste on Amazon
Generating plan...
Generated 5 steps
Scoring with LLM...
Score: 0.90
Reasoning: The plan covers all required actions and presents them in a logical sequence...
Test 2/3: planner-002
Task: Compare MacBook Air M2 prices on Amazon and Best Buy
Generating plan...
Generated 5 steps
Scoring with LLM...
Score: 0.75
Reasoning: The plan covers most required actions but misses the explicit step...
Test 3/3: planner-003
Task: Open example.com and extract the page title
Generating plan...
Generated 1 steps
Scoring with LLM...
Score: 0.65
Reasoning: The plan is incomplete as it only includes the action to extract...
=== RESULTS ===
Passed: 2/3
Average Score: 0.767
```
## Benefits of Braintrust Prompt Management
1. **Version Control**: Track prompt changes across experiments
2. **A/B Testing**: Compare different prompt versions systematically
3. **Performance Analytics**: See which prompts work best
4. **Team Collaboration**: Share and review prompts
5. **Experiment Linking**: Connect prompts to evaluation results
6. **Easy Rollback**: Revert to previous working versions
## Current Status
**PlannerTool evaluation is working!**
- Average score: 0.767 (2/3 tests passing)
- Successfully generates plans with your actual prompts
- LLM-as-judge scoring with detailed reasoning
## Identified Issues
- Test 3 (0.65): Plan missing navigation step for "Open example.com"
- Test 2 (0.75): Missing explicit price extraction step
- Overall: Room for prompt improvement to increase completeness
## Next Steps
**Option A: Improve PlannerTool First**
1. Analyze and improve PlannerTool prompts
2. Re-run evaluation to confirm improvements
3. Document baseline vs improved performance
**Option B: Move to Next Tool**
1. Set up ValidatorTool evaluation following same pattern
2. Add other tool evaluations (ClassificationTool, etc.)
3. Move to end-to-end agent evaluation
**Option C: Document & Continue**
1. Push current prompts to Braintrust for version control
2. Document current baseline (0.767)
3. Move to ValidatorTool while noting areas for improvement

View File

@@ -1,258 +0,0 @@
import { readFileSync } from 'fs'
import path from 'path'
import { z } from 'zod'
import { generatePlannerSystemPrompt, generatePlannerTaskPrompt } from '@/lib/tools/planning/PlannerTool.prompt'
import { ChatOpenAI } from '@langchain/openai'
// Define the schema for each test case using Zod
// This ensures that your test data is well-structured and validated
const PlannerTestCaseSchema = z.object({
id: z.string(), // Unique identifier for the test case
task: z.string(), // The user task to be planned
category: z.enum(['ecommerce', 'research', 'navigation', 'interaction', 'auth']), // Task domain
complexity: z.enum(['simple', 'medium', 'complex']), // Task difficulty
expected: z.object({
requiredActions: z.array(z.string()), // Actions the plan must include
maxSteps: z.number().optional(), // Optional upper bound on steps
minSteps: z.number().optional() // Optional lower bound on steps
})
})
// Load and validate planner test cases from a JSON file
function loadPlannerTestCases() {
const datasetPath = path.resolve('src/evals/tools/planner/test-cases.json') // Path to test cases
const rawJson = JSON.parse(readFileSync(datasetPath, 'utf8')) // Read and parse JSON
return z.array(PlannerTestCaseSchema).parse(rawJson) // Validate against schema
}
// Generate a plan using the same prompts as your PlannerTool
// This bypasses Chrome APIs and directly uses OpenAI via LangChain
async function generatePlan(task: string): Promise<any> {
if (!process.env.OPENAI_API_KEY) {
// Fail early if no API key is set
return {
error: 'No API key found. Set OPENAI_API_KEY',
steps: []
}
}
try {
// Initialize the LLM with your API key and desired model
const llm = new ChatOpenAI({
apiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-4o-mini',
temperature: 0.3 // Lower temperature for more deterministic output
})
// Generate system and user prompts using your PlannerTool logic
const systemPrompt = generatePlannerSystemPrompt()
const taskPrompt = generatePlannerTaskPrompt(
task,
5, // Max steps
`User: ${task}`,
'Current page: example.com'
)
// Construct the message array for the LLM
const messages = [
{ role: 'system' as const, content: systemPrompt },
{ role: 'user' as const, content: taskPrompt }
]
// Send the prompt to the LLM and get the response
const response = await llm.invoke(messages)
const content = response.content as string
// Parse the JSON response from the LLM
const parsed = JSON.parse(content)
return { steps: parsed.steps || [] }
} catch (error) {
// Catch and return any errors during LLM invocation or parsing
return {
error: error instanceof Error ? error.message : String(error),
steps: []
}
}
}
// Score the generated plan using another LLM call
// This evaluates the plan against expected actions and structure
async function scorePlanWithLLM(task: string, plan: any, expected: any): Promise<{ score: number, reasoning: string }> {
if (!process.env.OPENAI_API_KEY) {
// Fail early if no API key is set
return { score: 0, reasoning: 'No API key for scoring' }
}
try {
// Initialize a second LLM instance for scoring
const llm = new ChatOpenAI({
apiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-4o-mini',
temperature: 0.1 // Lower temperature for more consistent scoring
})
// Construct a scoring prompt with clear evaluation criteria
const scoringPrompt = `Evaluate this plan for the given task.
TASK: ${task}
GENERATED PLAN:
${JSON.stringify(plan.steps, null, 2)}
EXPECTED REQUIREMENTS:
- Required actions: ${expected.requiredActions.join(', ')}
- Max steps: ${expected.maxSteps || 'not specified'}
- Min steps: ${expected.minSteps || 'not specified'}
Evaluate on these criteria:
1. Completeness: Does the plan cover all required actions?
2. Logical order: Are steps in a sensible sequence?
3. Clarity: Are steps specific and actionable?
4. Efficiency: Is the plan concise without being too brief?
Respond with JSON:
{
"score": 0.85,
"reasoning": "Brief explanation of the score"
}`
// Send the scoring prompt to the LLM
const response = await llm.invoke([{ role: 'user', content: scoringPrompt }])
const result = JSON.parse(response.content as string)
// Clamp the score between 0 and 1
return {
score: Math.max(0, Math.min(1, result.score)),
reasoning: result.reasoning
}
} catch (error) {
// Catch and return any errors during scoring
return {
score: 0,
reasoning: `LLM scoring failed: ${error instanceof Error ? error.message : String(error)}`
}
}
}
// Run the evaluation locally for development purposes
async function runLLMEvaluation() {
console.log('Running PlannerTool LLM Evaluation')
// Check for API key
if (!process.env.OPENAI_API_KEY) {
console.log('Error: No API key found')
console.log('Set OPENAI_API_KEY environment variable')
return
}
// Load and slice test cases (limit to first 3 for quick testing)
const testCases = loadPlannerTestCases().slice(0, 3)
const results = []
// Loop through each test case
for (let i = 0; i < testCases.length; i++) {
const testCase = testCases[i]
console.log(`\nTest ${i + 1}/${testCases.length}: ${testCase.id}`)
console.log(`Task: ${testCase.task}`)
try {
// Generate a plan using the LLM
console.log(' Generating plan...')
const plan = await generatePlan(testCase.task)
if (plan.error) {
// Handle plan generation errors
console.log(` Plan Error: ${plan.error}`)
results.push({ id: testCase.id, score: 0, error: plan.error })
continue
}
console.log(` Generated ${plan.steps.length} steps`)
// Score the plan using the LLM
console.log(' Scoring with LLM...')
const scoring = await scorePlanWithLLM(testCase.task, plan, testCase.expected)
console.log(` Score: ${scoring.score.toFixed(2)}`)
console.log(` Reasoning: ${scoring.reasoning}`)
// Save the result
results.push({
id: testCase.id,
score: scoring.score,
reasoning: scoring.reasoning,
stepCount: plan.steps.length
})
} catch (error) {
// Catch any unexpected errors
const errorMsg = error instanceof Error ? error.message : String(error)
console.log(` Error: ${errorMsg}`)
results.push({ id: testCase.id, score: 0, error: errorMsg })
}
}
// Compute summary statistics
const avgScore = results.reduce((sum, r) => sum + r.score, 0) / results.length
const passed = results.filter(r => r.score > 0.7).length
console.log(`\n=== RESULTS ===`)
console.log(`Passed: ${passed}/${results.length}`)
console.log(`Average Score: ${avgScore.toFixed(3)}`)
return results
}
// Export a Braintrust-compatible evaluation function
// This allows you to run the eval via CLI or dashboard
export default async function Eval() {
return {
data: loadPlannerTestCases().slice(0, 3), // Load test cases
task: async (input: z.infer<typeof PlannerTestCaseSchema>) => {
// Generate a plan for each input
const plan = await generatePlan(input.task)
if (plan.error) {
return { error: plan.error, steps: [] }
}
return { steps: plan.steps }
},
scores: [
// Custom scoring function using LLM
async (input: z.infer<typeof PlannerTestCaseSchema>, output: any) => {
if (output.error) {
return { name: 'llm_quality', score: 0, metadata: { error: output.error } }
}
const scoring = await scorePlanWithLLM(input.task, output, input.expected)
return {
name: 'llm_quality',
score: scoring.score,
metadata: {
reasoning: scoring.reasoning,
stepCount: output.steps.length
}
}
}
]
}
}
// If this file is run directly (e.g. `ts-node planner-llm.eval.ts`), execute the local evaluation
if (require.main === module) {
runLLMEvaluation()
.then(() => {
// Log success message and exit cleanly
console.log('\nLLM evaluation completed')
process.exit(0)
})
.catch((error) => {
// Log failure message and exit with error code
console.error('LLM evaluation failed:', error)
process.exit(1)
})
}

View File

@@ -1,153 +0,0 @@
/**
* Utility to push all agent prompts from src/ to Braintrust
*
* Benefits of pushing prompts to Braintrust:
* 1. Version Control: Track prompt changes across experiments
* 2. A/B Testing: Compare different prompt versions systematically
* 3. Collaboration: Share prompts with team members
* 4. Rollback: Easily revert to previous working versions
* 5. Analytics: See which prompts perform best across different tasks
* 6. Experiment Tracking: Link prompts to specific evaluation runs
*/
import { readFileSync, writeFileSync } from 'fs'
import path from 'path'
// Import planner tool prompt functions
import { generatePlannerSystemPrompt, generatePlannerTaskPrompt } from '@/lib/tools/planning/PlannerTool.prompt'
// Define planner prompts to extract
const PROMPTS_TO_EXTRACT = [
{
name: 'planner-system',
description: 'PlannerTool system prompt for task breakdown',
category: 'planning',
extract: () => generatePlannerSystemPrompt()
},
{
name: 'planner-task',
description: 'PlannerTool task prompt template',
category: 'planning',
extract: () => generatePlannerTaskPrompt(
'TASK_PLACEHOLDER',
3,
'CONVERSATION_HISTORY_PLACEHOLDER',
'BROWSER_STATE_PLACEHOLDER'
)
}
]
/**
* Extract all prompts to a JSON file for Braintrust upload
*/
function extractPromptsToFile() {
const prompts = PROMPTS_TO_EXTRACT.map(config => {
try {
const content = config.extract()
return {
name: config.name,
description: config.description,
category: config.category,
content: content,
length: content.length,
extractedAt: new Date().toISOString()
}
} catch (error) {
return {
name: config.name,
description: config.description,
category: config.category,
content: null,
error: error instanceof Error ? error.message : String(error),
extractedAt: new Date().toISOString()
}
}
})
const output = {
metadata: {
extractedAt: new Date().toISOString(),
totalPrompts: prompts.length,
successfulExtractions: prompts.filter(p => p.content).length
},
prompts
}
const outputPath = path.resolve('src/evals/extracted-prompts.json')
writeFileSync(outputPath, JSON.stringify(output, null, 2))
console.log(`Extracted ${output.metadata.successfulExtractions}/${output.metadata.totalPrompts} prompts to: ${outputPath}`)
// Print summary
prompts.forEach(prompt => {
if (prompt.content) {
console.log(`${prompt.name} (${prompt.length} chars)`)
} else {
console.log(`${prompt.name} - ${prompt.error}`)
}
})
return output
}
/**
* Create Braintrust SDK upload script (when ready to use Braintrust)
*/
function generateBraintrustUploadScript() {
const script = `
// Braintrust prompt upload script
// Run with: npx tsx src/evals/upload-to-braintrust.ts
import { initLogger } from 'braintrust'
async function uploadPrompts() {
const logger = initLogger({
projectName: 'nxtscape-agent',
experiment: 'prompt-versions'
})
// Load extracted prompts
const promptsData = require('./extracted-prompts.json')
for (const prompt of promptsData.prompts) {
if (prompt.content) {
await logger.logPrompt({
name: prompt.name,
description: prompt.description,
prompt: prompt.content,
metadata: {
category: prompt.category,
length: prompt.length,
extractedAt: prompt.extractedAt
}
})
console.log(\`Uploaded: \${prompt.name}\`)
}
}
console.log('All prompts uploaded to Braintrust!')
}
uploadPrompts().catch(console.error)
`
const scriptPath = path.resolve('src/evals/upload-to-braintrust.ts')
writeFileSync(scriptPath, script.trim())
console.log(`\nCreated Braintrust upload script: ${scriptPath}`)
console.log('When ready to use Braintrust, run: npx tsx src/evals/upload-to-braintrust.ts')
}
// Run if called directly
if (require.main === module) {
console.log('Extracting prompts from src/...')
extractPromptsToFile()
generateBraintrustUploadScript()
console.log('\n=== BENEFITS OF BRAINTRUST PROMPT MANAGEMENT ===')
console.log('1. Version Control: Track how prompts evolve over time')
console.log('2. A/B Testing: Test multiple prompt versions side-by-side')
console.log('3. Performance Analytics: See which prompts work best')
console.log('4. Team Collaboration: Share and review prompts')
console.log('5. Experiment Linking: Connect prompts to evaluation results')
console.log('6. Easy Rollback: Revert to previous working versions')
}

View File

@@ -1,119 +0,0 @@
[
{
"id": "planner-001",
"task": "Order toothpaste on Amazon",
"category": "ecommerce",
"complexity": "complex",
"expected": {
"requiredActions": ["Navigate to Amazon", "Search for toothpaste", "Select a toothpaste", "Add to cart", "Proceed to checkout"],
"maxSteps": 7,
"minSteps": 4,
"sequenceOrder": [["Navigate", "Search"], ["Search", "Add to cart"], ["Add to cart", "checkout"]]
}
},
{
"id": "planner-002",
"task": "Compare MacBook Air M2 prices on Amazon and Best Buy",
"category": "research",
"complexity": "complex",
"expected": {
"requiredActions": ["Navigate to Amazon", "Search MacBook Air M2", "Extract price", "Navigate to Best Buy", "Search MacBook Air M2", "Extract price", "Compare prices"],
"maxSteps": 10,
"minSteps": 6
}
},
{
"id": "planner-003",
"task": "Open example.com and extract the page title",
"category": "navigation",
"complexity": "simple",
"expected": {
"requiredActions": ["Navigate to example.com", "Extract page title"],
"maxSteps": 3,
"minSteps": 2
}
},
{
"id": "planner-004",
"task": "Log into the dashboard and verify access is denied without credentials",
"category": "auth",
"complexity": "medium",
"expected": {
"requiredActions": ["Navigate to login", "Recognize login required"],
"forbiddenActions": ["Submit credentials"],
"maxSteps": 5,
"minSteps": 2,
"sequenceOrder": [["Navigate", "Recognize login"]]
}
},
{
"id": "planner-005",
"task": "Search for 'Nxtscape docs' and open the first result",
"category": "navigation",
"complexity": "medium",
"expected": {
"requiredActions": ["Navigate", "Search", "Click first result", "Refresh"],
"maxSteps": 6,
"minSteps": 3,
"sequenceOrder": [["Search", "Click first result"]]
}
},
{
"id": "planner-006",
"task": "On docs homepage, extract all nav links",
"category": "research",
"complexity": "medium",
"expected": {
"requiredActions": ["Navigate", "Extract links"],
"maxSteps": 5,
"minSteps": 2
}
},
{
"id": "planner-007",
"task": "Fill a contact form and submit",
"category": "interaction",
"complexity": "complex",
"expected": {
"requiredActions": ["Navigate", "Fill", "Submit"],
"maxSteps": 8,
"minSteps": 3,
"sequenceOrder": [["Fill", "Submit"]]
}
},
{
"id": "planner-008",
"task": "Open two product pages in separate tabs",
"category": "navigation",
"complexity": "medium",
"expected": {
"requiredActions": ["Navigate", "Open new tab", "Navigate"],
"maxSteps": 6,
"minSteps": 3
}
},
{
"id": "planner-009",
"task": "Log into GitHub and open your repositories page",
"category": "auth",
"complexity": "complex",
"expected": {
"requiredActions": ["Navigate to GitHub", "Click login", "Enter credentials", "Submit login", "Navigate to Repositories"],
"maxSteps": 7,
"minSteps": 4,
"sequenceOrder": [["Click login", "Enter credentials"], ["Submit login", "Navigate to Repositories"]]
}
},
{
"id": "planner-010",
"task": "Search for 'weather in Tokyo' and extract the temperature",
"category": "research",
"complexity": "medium",
"expected": {
"requiredActions": ["Navigate to search engine", "Search 'weather in Tokyo'", "Extract temperature"],
"maxSteps": 5,
"minSteps": 3,
"sequenceOrder": [["Search", "Extract temperature"]]
}
}
]

View File

@@ -1,142 +0,0 @@
[
{
"id": "validator-001",
"task": "Order toothpaste on Amazon",
"category": "ecommerce",
"complexity": "complex",
"currentState": {
"url": "https://amazon.com/confirmation/order-12345",
"title": "Order Confirmation - Amazon.com",
"elements": ["Order #112-7685432-1234567", "Thank you for your order", "Delivery estimate: Tomorrow"],
"messageHistory": "Successfully added Colgate toothpaste to cart and placed order"
},
"expected": {
"isComplete": true,
"reasoning": "Order confirmation page with order number shows task completion",
"confidence": "high"
}
},
{
"id": "validator-002",
"task": "Order toothpaste on Amazon",
"category": "ecommerce",
"complexity": "complex",
"currentState": {
"url": "https://amazon.com/cart",
"title": "Shopping Cart - Amazon.com",
"elements": ["Colgate Total Toothpaste", "Qty: 1", "Proceed to checkout"],
"messageHistory": "Added toothpaste to cart but haven't completed checkout"
},
"expected": {
"isComplete": false,
"reasoning": "Task requires placing order, but item is only in cart",
"confidence": "high",
"suggestions": ["Click 'Proceed to checkout' to complete the order", "Complete payment process to place order"]
}
},
{
"id": "validator-003",
"task": "Find the current weather in San Francisco",
"category": "research",
"complexity": "simple",
"currentState": {
"url": "https://google.com/search?q=weather+san+francisco",
"title": "weather san francisco - Google Search",
"elements": ["72°F", "Partly cloudy", "San Francisco, CA", "Weather forecast"],
"messageHistory": "Searched for San Francisco weather and found current temperature"
},
"expected": {
"isComplete": true,
"reasoning": "Weather information successfully found and displayed",
"confidence": "high"
}
},
{
"id": "validator-004",
"task": "Submit contact form with name John Doe",
"category": "interaction",
"complexity": "medium",
"currentState": {
"url": "https://example.com/contact",
"title": "Contact Us - Example.com",
"elements": ["Name: John Doe", "Email: (empty)", "Message: (empty)", "Submit"],
"messageHistory": "Filled in name field but form not yet submitted"
},
"expected": {
"isComplete": false,
"reasoning": "Form filled partially but not submitted, no confirmation shown",
"confidence": "high",
"suggestions": ["Fill in the required email field", "Fill in message field", "Click Submit button to complete form submission"]
}
},
{
"id": "validator-005",
"task": "Submit contact form with name John Doe",
"category": "interaction",
"complexity": "medium",
"currentState": {
"url": "https://example.com/contact/success",
"title": "Thank You - Example.com",
"elements": ["Thank you for your message!", "We'll get back to you soon", "Reference ID: CNT-12345"],
"messageHistory": "Filled form and submitted successfully"
},
"expected": {
"isComplete": true,
"reasoning": "Form submitted with success confirmation page and reference ID",
"confidence": "high"
}
},
{
"id": "validator-006",
"task": "Login to account with email user@test.com",
"category": "auth",
"complexity": "medium",
"currentState": {
"url": "https://app.example.com/login",
"title": "Login - Example App",
"elements": ["Email: user@test.com", "Password: (filled)", "Login button", "Remember me"],
"messageHistory": "Filled login form but haven't clicked login button yet"
},
"expected": {
"isComplete": false,
"reasoning": "Credentials entered but login not attempted, still on login page",
"confidence": "high",
"suggestions": ["Click the 'Login' button to complete authentication"]
}
},
{
"id": "validator-007",
"task": "Login to account with email user@test.com",
"category": "auth",
"complexity": "medium",
"currentState": {
"url": "https://app.example.com/dashboard",
"title": "Dashboard - Example App",
"elements": ["Welcome back, John!", "Dashboard", "Account menu", "Logout"],
"messageHistory": "Successfully logged in and redirected to dashboard"
},
"expected": {
"isComplete": true,
"reasoning": "Successfully authenticated and on dashboard page with welcome message",
"confidence": "high"
}
},
{
"id": "validator-008",
"task": "Compare iPhone 15 prices on Amazon and Best Buy",
"category": "research",
"complexity": "complex",
"currentState": {
"url": "https://amazon.com/search?q=iphone+15",
"title": "iphone 15 - Amazon.com",
"elements": ["iPhone 15 128GB", "$799.00", "Add to cart", "Prime delivery"],
"messageHistory": "Found iPhone 15 price on Amazon ($799) but haven't checked Best Buy yet"
},
"expected": {
"isComplete": false,
"reasoning": "Only checked one retailer, need to compare prices from both Amazon and Best Buy",
"confidence": "high",
"suggestions": ["Navigate to Best Buy to find iPhone 15 price", "Compare prices from both retailers", "Report which retailer has the better price"]
}
}
]

View File

@@ -1,24 +0,0 @@
import { ExecutionContext } from '@/lib/runtime/ExecutionContext'
import { BrowserContext } from '@/lib/browser/BrowserContext'
import { MessageManager } from '@/lib/runtime/MessageManager'
export function makeStubExecutionContext(options: {
browserState: string
messageHistory: string
useVision: boolean
}): ExecutionContext {
// Create minimal stubs for testing
const stubBrowserContext = new BrowserContext()
const stubMessageManager = new MessageManager()
// Add the message history if provided
if (options.messageHistory) {
stubMessageManager.addHuman(options.messageHistory)
}
return new ExecutionContext({
browserContext: stubBrowserContext,
messageManager: stubMessageManager,
abortSignal: new AbortController().signal
})
}

View File

@@ -1,299 +0,0 @@
/**
* ValidatorTool evaluation with LLM scoring
* Tests validation accuracy for task completion detection
*/
import { readFileSync } from 'fs'
import path from 'path'
import { z } from 'zod'
import { generateValidatorSystemPrompt, generateValidatorTaskPrompt } from '@/lib/tools/validation/ValidatorTool.prompt'
import { ChatOpenAI } from '@langchain/openai'
// Test case schema
const ValidatorTestCaseSchema = z.object({
id: z.string(),
task: z.string(),
category: z.enum(['ecommerce', 'research', 'interaction', 'auth']),
complexity: z.enum(['simple', 'medium', 'complex']),
currentState: z.object({
url: z.string(),
title: z.string(),
elements: z.array(z.string()),
messageHistory: z.string()
}),
expected: z.object({
isComplete: z.boolean(),
reasoning: z.string(),
confidence: z.enum(['high', 'medium', 'low']),
suggestions: z.array(z.string()).optional()
})
})
function loadValidatorTestCases() {
const datasetPath = path.resolve('src/evals/tools/validator/test-cases.json')
const rawJson = JSON.parse(readFileSync(datasetPath, 'utf8'))
return z.array(ValidatorTestCaseSchema).parse(rawJson)
}
// Validation result schema (same as ValidatorTool)
const ValidationResultSchema = z.object({
isComplete: z.boolean(), // Whether the task is complete
reasoning: z.string(), // Explanation of validation result
confidence: z.enum(['high', 'medium', 'low']), // Confidence in validation
suggestions: z.array(z.string()) // Suggestions for the planner if task incomplete
})
/**
* Call LLM to perform validation using ValidatorTool prompts
*/
async function performValidation(task: string, currentState: any): Promise<any> {
if (!process.env.OPENAI_API_KEY) {
return {
error: 'No API key found. Set OPENAI_API_KEY',
validation: null
}
}
try {
// Use OpenAI with structured output (same as ValidatorTool)
const llm = new ChatOpenAI({
apiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-4o-mini',
temperature: 0.1
})
// Generate the same prompts ValidatorTool would use
const systemPrompt = generateValidatorSystemPrompt()
// Create browser state string from test data
const browserStateString = `URL: ${currentState.url}
Title: ${currentState.title}
Elements: ${currentState.elements.join(', ')}`
const taskPrompt = generateValidatorTaskPrompt(
task,
browserStateString,
currentState.messageHistory,
'' // No screenshot in test
)
// Use structured output like the real ValidatorTool
const structuredLLM = llm.withStructuredOutput(ValidationResultSchema)
const validation = await structuredLLM.invoke([
{ role: 'system', content: systemPrompt },
{ role: 'user', content: taskPrompt }
])
return { validation }
} catch (error) {
return {
error: error instanceof Error ? error.message : String(error),
validation: null
}
}
}
/**
* LLM-based scorer for validation accuracy
*/
async function scoreValidationWithLLM(
task: string,
currentState: any,
actualValidation: any,
expectedValidation: any
): Promise<{ score: number, reasoning: string }> {
if (!process.env.OPENAI_API_KEY) {
return { score: 0, reasoning: 'No API key for scoring' }
}
try {
const llm = new ChatOpenAI({
apiKey: process.env.OPENAI_API_KEY,
modelName: 'gpt-4o-mini',
temperature: 0.1
})
const scoringPrompt = `Evaluate this validation result for accuracy.
TASK: ${task}
CURRENT STATE:
- URL: ${currentState.url}
- Title: ${currentState.title}
- Elements: ${currentState.elements.join(', ')}
- History: ${currentState.messageHistory}
ACTUAL VALIDATION:
${JSON.stringify(actualValidation, null, 2)}
EXPECTED VALIDATION:
${JSON.stringify(expectedValidation, null, 2)}
Evaluate on these criteria:
1. **Completion Accuracy**: Did it correctly identify if the task is complete/incomplete? (40%)
2. **Reasoning Quality**: Is the reasoning logical and well-supported by evidence? (30%)
3. **Confidence Appropriateness**: Is the confidence level justified by the evidence? (20%)
4. **Suggestion Quality**: Are suggestions specific and actionable (if task incomplete)? (10%)
Scoring guide:
- 1.0: Perfect validation with accurate completion status and excellent reasoning
- 0.8-0.9: Correct completion status with good reasoning, minor issues
- 0.6-0.7: Correct completion status but weak reasoning, or minor accuracy issues
- 0.4-0.5: Incorrect completion status but reasonable reasoning given the evidence
- 0.2-0.3: Major errors in both completion status and reasoning
- 0.0-0.1: Completely incorrect validation
Respond with JSON:
{
"score": 0.85,
"reasoning": "Brief explanation of the score focusing on accuracy and reasoning quality"
}`
const response = await llm.invoke([{ role: 'user', content: scoringPrompt }])
let content = response.content as string
// Remove markdown code blocks if present
content = content.replace(/```json\s*|\s*```/g, '').trim()
const result = JSON.parse(content)
return {
score: Math.max(0, Math.min(1, result.score)),
reasoning: result.reasoning
}
} catch (error) {
return {
score: 0,
reasoning: `LLM scoring failed: ${error instanceof Error ? error.message : String(error)}`
}
}
}
async function runValidatorLLMEvaluation() {
console.log('Running ValidatorTool LLM Evaluation')
// Check API key first
if (!process.env.OPENAI_API_KEY) {
console.log('Error: No API key found')
console.log('Set OPENAI_API_KEY environment variable')
return
}
const testCases = loadValidatorTestCases().slice(0, 5) // Test first 5 cases
const results = []
for (let i = 0; i < testCases.length; i++) {
const testCase = testCases[i]
console.log(`\nTest ${i + 1}/${testCases.length}: ${testCase.id}`)
console.log(`Task: ${testCase.task}`)
console.log(`State: ${testCase.currentState.url}`)
try {
// Perform validation
console.log(' Performing validation...')
const validation = await performValidation(testCase.task, testCase.currentState)
if (validation.error) {
console.log(` Validation Error: ${validation.error}`)
results.push({ id: testCase.id, score: 0, error: validation.error })
continue
}
console.log(` Result: ${validation.validation.isComplete ? 'Complete' : 'Incomplete'}`)
console.log(` Confidence: ${validation.validation.confidence}`)
// Score with LLM
console.log(' Scoring accuracy...')
const scoring = await scoreValidationWithLLM(
testCase.task,
testCase.currentState,
validation.validation,
testCase.expected
)
console.log(` Score: ${scoring.score.toFixed(2)}`)
console.log(` Reasoning: ${scoring.reasoning}`)
results.push({
id: testCase.id,
score: scoring.score,
reasoning: scoring.reasoning,
actualResult: validation.validation.isComplete,
expectedResult: testCase.expected.isComplete
})
} catch (error) {
const errorMsg = error instanceof Error ? error.message : String(error)
console.log(` Error: ${errorMsg}`)
results.push({ id: testCase.id, score: 0, error: errorMsg })
}
}
const avgScore = results.reduce((sum, r) => sum + r.score, 0) / results.length
const passed = results.filter(r => r.score > 0.7).length
const accurateValidations = results.filter(r => r.actualResult === r.expectedResult).length
console.log(`\n=== RESULTS ===`)
console.log(`Passed: ${passed}/${results.length}`)
console.log(`Validation Accuracy: ${accurateValidations}/${results.length}`)
console.log(`Average Score: ${avgScore.toFixed(3)}`)
return results
}
// Braintrust-compatible evaluation function
export default async function Eval() {
return {
data: loadValidatorTestCases().slice(0, 5), // Test first 5 cases
task: async (input: z.infer<typeof ValidatorTestCaseSchema>) => {
// Perform validation using our ValidatorTool prompts
const validation = await performValidation(input.task, input.currentState)
if (validation.error) {
return { error: validation.error, result: null }
}
return { result: validation.validation }
},
scores: [
async (input: z.infer<typeof ValidatorTestCaseSchema>, output: any) => {
if (output.error) {
return { name: 'validation_accuracy', score: 0, metadata: { error: output.error } }
}
const scoring = await scoreValidationWithLLM(
input.task,
input.currentState,
output.result,
input.expected
)
return {
name: 'validation_accuracy',
score: scoring.score,
metadata: {
reasoning: scoring.reasoning,
actualResult: output.result.isComplete,
expectedResult: input.expected.isComplete,
accurateValidation: output.result.isComplete === input.expected.isComplete
}
}
}
]
}
}
// Local runner for development
if (require.main === module) {
runValidatorLLMEvaluation()
.then(() => {
console.log('\nValidator LLM evaluation completed')
process.exit(0)
})
.catch((error) => {
console.error('Validator LLM evaluation failed:', error)
process.exit(1)
})
}

View File

@@ -0,0 +1,217 @@
import { ENABLE_EVALS2, BRAINTRUST_API_KEY, BRAINTRUST_PROJECT_NAME } from '@/config';
import { z } from 'zod';
import { initLogger } from 'braintrust';
// Session metadata schema
export const SessionMetadataSchema = z.object({
sessionId: z.string(),
task: z.string(),
timestamp: z.number(),
agentVersion: z.string().optional()
});
export type SessionMetadata = z.infer<typeof SessionMetadataSchema>;
/**
* Simplified Braintrust event manager that maintains session and parent span tracking
* Much simpler than the original BraintrustEventCollector but keeps the useful parts
*/
export class SimpleBraintrustEventManager {
private static instance: SimpleBraintrustEventManager | null = null;
private logger: any = null;
private initialized: boolean = false;
private enabled: boolean = false;
private parentSpanId: string | null = null;
private sessionId: string | null = null;
private sessionStartTime: number = 0;
private sessionScores: number[] = []; // Track task scores for session average
// Singleton pattern
static getInstance(): SimpleBraintrustEventManager {
if (!SimpleBraintrustEventManager.instance) {
SimpleBraintrustEventManager.instance = new SimpleBraintrustEventManager();
}
return SimpleBraintrustEventManager.instance;
}
private constructor() {}
/**
* Check if evals2 is enabled
*/
isEnabled(): boolean {
if (!this.initialized) {
this.initialized = true;
this.enabled = ENABLE_EVALS2 && !!BRAINTRUST_API_KEY;
if (this.enabled) {
console.log('%c✓ Evals2 enabled', 'color: #00ff00; font-size: 10px');
}
}
return this.enabled;
}
/**
* Initialize Braintrust logger
*/
private ensureLogger(): boolean {
if (this.logger) return true;
if (!BRAINTRUST_API_KEY) {
return false;
}
try {
// Initialize Braintrust logger
this.logger = initLogger({
apiKey: BRAINTRUST_API_KEY,
projectName: BRAINTRUST_PROJECT_NAME
});
return true;
} catch (error) {
console.warn('Failed to initialize Braintrust logger:', error);
return false;
}
}
/**
* Start a new session (parent span for conversation)
*/
async startSession(metadata: SessionMetadata): Promise<{ parent?: string }> {
if (!this.isEnabled()) {
return {};
}
const hasLogger = this.ensureLogger();
if (!hasLogger) {
return {};
}
try {
this.sessionId = metadata.sessionId;
this.sessionStartTime = Date.now();
this.sessionScores = [];
// Create parent span for the conversation
const parent = await this.logger.traced(async (span: any) => {
span.log({
input: metadata.task,
metadata: {
sessionId: metadata.sessionId,
timestamp: metadata.timestamp,
agentVersion: metadata.agentVersion,
type: 'session_start',
conversation: true
}
});
return await span.export(); // Returns parent span ID
}, { name: 'agent_session' });
this.parentSpanId = parent || null;
if (this.parentSpanId) {
console.log('%c✓ Evals2 session initialized', 'color: #00ff00; font-size: 10px');
console.log(`%c Session ID: ${this.sessionId}`, 'color: #888; font-size: 10px');
}
return { parent: this.parentSpanId || undefined };
} catch (error) {
console.debug('Failed to start session:', error);
return {};
}
}
/**
* Add a task score to the session
*/
addTaskScore(score: number): void {
if (this.isEnabled() && this.sessionId) {
this.sessionScores.push(score);
}
}
/**
* End the current session with aggregated scores
*/
async endSession(reason: string = 'unknown'): Promise<void> {
if (!this.isEnabled() || !this.sessionId || !this.parentSpanId || !this.logger) {
return;
}
try {
const duration = Date.now() - this.sessionStartTime;
// Calculate average score for session
const avgScore = this.sessionScores.length > 0
? this.sessionScores.reduce((sum, score) => sum + score, 0) / this.sessionScores.length
: 1.0;
console.log(`%c📈 Session average score: ${avgScore.toFixed(2)} from ${this.sessionScores.length} tasks`,
'color: #4caf50; font-weight: bold; font-size: 11px');
// Log session end
await this.logger.traced(async (span: any) => {
span.log({
metadata: {
type: 'session_end',
sessionId: this.sessionId,
reason,
duration_ms: duration,
task_count: this.sessionScores.length
},
scores: {
session_average: avgScore
}
});
}, {
name: 'session_end',
parent: this.parentSpanId
});
console.log(`%c← Evals2 session ended (${reason})`, 'color: #888; font-size: 10px');
// Clear session state
this.sessionId = null;
this.parentSpanId = null;
this.sessionScores = [];
} catch (error) {
console.debug('Failed to end session:', error);
}
}
/**
* Get the current parent span ID for child spans
*/
getParentSpanId(): string | null {
return this.parentSpanId;
}
/**
* Get the current session ID
*/
getSessionId(): string | null {
return this.sessionId;
}
/**
* Reset the event manager (for testing)
*/
reset(): void {
this.sessionId = null;
this.parentSpanId = null;
this.sessionScores = [];
this.sessionStartTime = 0;
this.logger = null;
this.initialized = false;
this.enabled = false;
}
/**
* Flush any pending logs
*/
async flush(): Promise<void> {
if (this.logger && this.logger.flush) {
await this.logger.flush();
}
}
}

View File

@@ -0,0 +1,148 @@
import { BRAINTRUST_API_KEY, BRAINTRUST_PROJECT_NAME } from '@/config';
import { ScoreResult } from './EvalScorer.types';
import { TIME_EFFICIENCY_BUCKETS } from './Evals.config';
import { initLogger } from 'braintrust';
/**
* Get human-readable time efficiency bucket
*/
function getTimeEfficiencyBucket(durationMs: number): string {
if (durationMs <= TIME_EFFICIENCY_BUCKETS.perfect) return '⚡ <30s (Perfect)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.exceptional) return '🚀 <1min (Exceptional)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.excellent) return '✨ <2min (Excellent)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.veryGood) return '👍 <3min (Very Good)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.good) return '✅ <4min (Good)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.average) return '📊 <5min (Average)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.belowAverage) return '⚠️ <6min (Below Average)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.poor) return '🐢 <8min (Poor)';
if (durationMs <= TIME_EFFICIENCY_BUCKETS.veryPoor) return '❌ <10min (Very Poor)';
return '💀 >10min (Terrible)';
}
/**
* Simple Braintrust logger that only uploads scores
* No complex spans, no session management, just scores
*/
export class SimpleBraintrustLogger {
private logger: any = null;
private initialized: boolean = false;
initialize(): boolean {
if (this.initialized) return true;
this.initialized = true;
if (!BRAINTRUST_API_KEY) {
console.log('%c⚠ No Braintrust API key, scores won\'t be uploaded', 'color: #ff9900; font-size: 10px');
return false;
}
try {
// Initialize Braintrust logger
this.logger = initLogger({
apiKey: BRAINTRUST_API_KEY,
projectName: BRAINTRUST_PROJECT_NAME
});
console.log('%c✓ Braintrust logger initialized', 'color: #00ff00; font-size: 10px');
return true;
} catch (error) {
console.warn('Failed to initialize Braintrust:', error);
return false;
}
}
async logTaskScore(
query: string,
score: ScoreResult,
duration_ms: number,
metadata?: any,
parentSpanId?: string,
contextMetrics?: {
messageCount: number;
totalCharacters: number;
estimatedTokens: number;
}
): Promise<void> {
if (!this.logger) {
const success = this.initialize();
if (!success) return;
}
try {
// Log as a simple traced event with scores
await this.logger.traced(async (span: any) => {
span.log({
input: query,
output: `Task completed with score: ${score.weightedTotal.toFixed(2)}`,
scores: {
// Normalize scores from 1-10 to 0-1 for Braintrust
goal_completion: (score.goalCompletion - 1) / 9, // Convert 1-10 to 0-1
plan_correctness: (score.planCorrectness - 1) / 9, // Convert 1-10 to 0-1
error_free_execution: (score.errorFreeExecution - 1) / 9, // Convert 1-10 to 0-1
context_efficiency: (score.contextEfficiency - 1) / 9, // Convert 1-10 to 0-1
weighted_total: (score.weightedTotal - 1) / 9 // Convert 1-10 to 0-1
},
metadata: {
type: 'evals2_task',
duration_ms,
total_duration_seconds: (score.details.totalDurationMs || duration_ms) / 1000,
// Raw scores (1-10 scale) for comparison
raw_scores: {
goal_completion: score.goalCompletion,
plan_correctness: score.planCorrectness,
error_free_execution: score.errorFreeExecution,
context_efficiency: score.contextEfficiency,
weighted_total: score.weightedTotal
},
// Tool execution details
tool_execution: {
total_calls: score.details.toolCalls,
failed_calls: score.details.failedCalls,
success_rate: score.details.toolCalls > 0
? ((score.details.toolCalls - score.details.failedCalls) / score.details.toolCalls * 100).toFixed(1) + '%'
: '0%',
retries: score.details.retries,
total_tool_duration_ms: score.details.totalDurationMs || 0,
},
// Context usage metrics
context_usage: contextMetrics || {
messageCount: 0,
totalCharacters: 0,
estimatedTokens: 0
},
// Scoring metadata
scoring_info: {
reasoning: score.details.reasoning || 'No reasoning provided',
scoring_method: score.details.reasoning?.includes('Heuristic') ? 'heuristic' : 'llm',
time_efficiency_bucket: getTimeEfficiencyBucket(score.details.totalDurationMs || duration_ms)
},
// Original metadata passed from NxtScape
...metadata
}
});
}, {
name: 'evals2_task_score',
parent: parentSpanId // Use parent span if provided
});
console.log('%c📊 Scores uploaded to Braintrust', 'color: #4caf50; font-size: 10px');
} catch (error) {
// Silent failure - don't break execution
console.debug('Failed to log to Braintrust:', error);
}
}
async flush(): Promise<void> {
if (this.logger && this.logger.flush) {
await this.logger.flush();
}
}
}
// Export singleton instance
export const braintrustLogger = new SimpleBraintrustLogger();

View File

@@ -0,0 +1,427 @@
import { BaseMessage, AIMessage, HumanMessage, SystemMessage } from '@langchain/core/messages';
import { ToolExecution } from './EvalScorer.types';
import { TokenCounter } from '@/lib/utils/TokenCounter';
/**
* Individual scoring prompts for Gemini 2.5 Pro - each dimension scored separately
* NTN: Focused prompts with only required context for each dimension
*/
/**
* Helper to wrap any content in XML tags with proper formatting
*/
function wrapInXML(tagName: string, content: string): string {
return `<${tagName}>
${content}
</${tagName}>`;
}
/**
* Format message history with XML structure and descriptive title
*/
function formatMessageHistoryXML(messages: BaseMessage[]): string {
if (!messages || messages.length === 0) {
return wrapInXML('MessageHistory', 'No messages recorded');
}
const formattedMessages = messages.map(msg => {
const role = msg instanceof HumanMessage ? 'Human' :
msg instanceof AIMessage ? 'Assistant' :
msg instanceof SystemMessage ? 'System' : 'Unknown';
const content = typeof msg.content === 'string' ?
msg.content : JSON.stringify(msg.content);
// Truncate very long messages
const truncatedContent = content.length > 500 ?
content.substring(0, 500) + '...' : content;
return `${role}: ${truncatedContent}`;
}).join('\n');
return wrapInXML('MessageHistory',
`## Message History from actual run
${formattedMessages}`);
}
/**
* Format failed tools list with XML structure
*/
function formatFailedToolsXML(failedCalls: ToolExecution[]): string {
if (!failedCalls || failedCalls.length === 0) {
return wrapInXML('FailedTools', 'No failed tool executions');
}
const toolList = failedCalls.map(t => t.toolName).join(', ');
return wrapInXML('FailedTools',
`## Failed Tools from actual run
${toolList}`);
}
/**
* Format error details with XML structure
*/
function formatErrorDetailsXML(failedCalls: ToolExecution[]): string {
if (!failedCalls || failedCalls.length === 0) {
return wrapInXML('ErrorDetails', 'No errors occurred');
}
const errors = failedCalls.slice(0, 5).map((call, idx) => {
const errorMsg = call.error || 'Unknown error';
const duration = call.duration !== undefined ? `${call.duration}ms` : 'N/A';
return `${idx + 1}. ${call.toolName} (${duration}): ${errorMsg}`;
}).join('\n');
return wrapInXML('ErrorDetails',
`## Error Details from actual run (first 5)
${errors}`);
}
/**
* Score goal completion - did the agent achieve what was asked?
*/
export function getGoalCompletionPrompt(
query: string,
messages: BaseMessage[],
toolCalls: ToolExecution[]
): string {
// Extract key signals of completion
const hasDoneTool = messages.some(msg =>
msg instanceof AIMessage &&
msg.tool_calls?.some(tc => tc.name === 'done_tool')
);
// Get last few messages to understand final state
const lastMessages = messages.slice(-5).map((msg, idx) =>
`[${idx}] ${msg._getType()}: ${typeof msg.content === 'string' ? msg.content.slice(0, 200) : '...'}`
).join('\n');
// Extract any results or extracted data
const resultTools = toolCalls.filter(t =>
t.toolName === 'result_tool' ||
t.toolName === 'extract_tool' ||
t.toolName === 'done_tool'
);
// Build prompt with proper structure
let prompt = `Evaluate if an AI agent completed the user's goal.
`;
// Add user request in XML
prompt += wrapInXML('UserRequest',
`## User Request from actual run
"${query}"`);
prompt += '\n\n';
// Add execution summary in XML
prompt += wrapInXML('ExecutionSummary',
`## Execution Summary from actual run
- Total tools executed: ${toolCalls.length}
- Done tool called: ${hasDoneTool ? 'Yes' : 'No'}
- Result/Extract tools used: ${resultTools.length}`);
prompt += '\n\n';
// Add final messages in XML
prompt += wrapInXML('FinalMessages',
`## Final Messages from actual run (last 5)
${lastMessages}`);
prompt += '\n\n';
// Add key tool results in XML
prompt += wrapInXML('KeyToolResults',
`## Key Tool Results from actual run
${resultTools.map(t => `${t.toolName}: success=${t.success}`).join('\n') || 'No result tools used'}`);
prompt += '\n\n';
// Add scoring instructions
prompt += `## SCORING INSTRUCTIONS
Rate goal completion on a 1-10 scale:
10: Perfect - Task fully completed, results delivered clearly
9: Excellent - Task completed with all requirements met
8: Very Good - Task completed with minor gaps
7: Good - Main goal achieved, some details missing
6: Satisfactory - Core task done but incomplete
5: Partial - About half completed
4: Limited - Less than half done
3: Minimal - Very little progress
2: Failed - Almost no progress
1: Complete Failure - Nothing accomplished
Consider:
- Was the specific request fulfilled?
- If user asked for information, was it provided?
- If user asked for an action, was it performed?
- If done_tool was called, task was likely completed
Return ONLY a number between 1-10:`;
// ALWAYS append message history at the END
if (messages) {
prompt += '\n\n' + formatMessageHistoryXML(messages);
}
return prompt;
}
/**
* Score plan efficiency - was the execution efficient and well-planned?
*/
export function getPlanEfficiencyPrompt(
query: string,
toolCalls: ToolExecution[],
totalDurationMs: number,
messages?: BaseMessage[]
): string {
// Analyze tool sequence for patterns
const toolSequence = toolCalls.map(t => t.toolName).join(' → ');
const uniqueTools = new Set(toolCalls.map(t => t.toolName)).size;
const retries = countConsecutiveDuplicates(toolCalls);
// Check for planning tools
const hasPlanning = toolCalls.some(t =>
t.toolName === 'classification_tool' ||
t.toolName === 'planner_tool'
);
// Time efficiency
const durationSeconds = totalDurationMs / 1000;
const avgTimePerTool = totalDurationMs / Math.max(1, toolCalls.length);
// Build prompt with proper structure
let prompt = `Evaluate the efficiency of an AI agent's execution plan.
`;
// Add task in XML
prompt += wrapInXML('Task',
`## Task from actual run
"${query}"`);
prompt += '\n\n';
// Add execution metrics in XML
prompt += wrapInXML('ExecutionMetrics',
`## Execution Metrics from actual run
- Duration: ${durationSeconds.toFixed(1)} seconds
- Tool calls: ${toolCalls.length}
- Unique tools: ${uniqueTools}
- Consecutive retries: ${retries}
- Used planning: ${hasPlanning ? 'Yes' : 'No'}`);
prompt += '\n\n';
// Add tool sequence in XML
prompt += wrapInXML('ToolSequence',
`## Tool Sequence from actual run
${toolSequence || 'No tools executed'}`);
prompt += '\n\n';
// Add scoring instructions
prompt += `## SCORING INSTRUCTIONS
Rate execution efficiency on a 1-10 scale:
10: Lightning fast (<30s), optimal tool sequence
9: Very fast (<1min), efficient path
8: Fast (<2min), good decisions
7: Quick (<3min), mostly efficient
6: Reasonable (<4min), acceptable path
5: Average (<5min), some inefficiency
4: Slow (<6min), redundant steps
3: Very slow (<8min), poor planning
2: Extremely slow (<10min), many issues
1: Terrible (>10min), excessive redundancy
Consider:
- Execution time vs task complexity
- Tool sequence logic
- Unnecessary repetitions
- Whether planning was needed/used appropriately
Return ONLY a number between 1-10:`;
// ALWAYS append message history at the END
if (messages) {
prompt += '\n\n' + formatMessageHistoryXML(messages);
}
return prompt;
}
/**
* Score error handling - how well were errors managed?
*/
export function getErrorHandlingPrompt(
toolCalls: ToolExecution[],
messages?: BaseMessage[]
): string {
const totalCalls = toolCalls.length;
const failedCalls = toolCalls.filter(t => !t.success);
const failureRate = totalCalls > 0 ? (failedCalls.length / totalCalls) * 100 : 0;
const recoveryAttempts = analyzeRecoveryPatterns(toolCalls);
// Build prompt without message history
let prompt = `Evaluate how well an AI agent handled errors during execution.
`;
// Add structured statistics
prompt += wrapInXML('ErrorStatistics',
`## Error Statistics from actual run
- Total tool calls: ${totalCalls}
- Failed calls: ${failedCalls.length}
- Failure rate: ${failureRate.toFixed(1)}%
- Recovery attempts: ${recoveryAttempts}`);
prompt += '\n\n';
// Add failed tools list
prompt += formatFailedToolsXML(failedCalls);
prompt += '\n\n';
// Add error details
prompt += formatErrorDetailsXML(failedCalls);
prompt += '\n\n';
// Add scoring instructions
prompt += `## SCORING INSTRUCTIONS
Rate error handling on a 1-10 scale:
10: Flawless - No errors occurred
9: Excellent - Minor issues handled perfectly
8: Very Good - Errors recovered gracefully
7: Good - Most errors handled well
6: Adequate - Some recovery from errors
5: Mixed - Half of errors handled
4: Poor - Many unhandled errors
3: Very Poor - Most errors not addressed
2: Critical - Errors caused major issues
1: Complete Failure - Errors prevented any progress
Consider:
- If no errors occurred, score 10
- If errors occurred, was recovery attempted?
- Did errors block task completion?
- Were errors handled gracefully?
Return ONLY a number between 1-10:`;
// ALWAYS append message history at the END
if (messages) {
prompt += '\n\n' + formatMessageHistoryXML(messages);
}
return prompt;
}
/**
* Score context efficiency - how efficiently were tokens/context used?
*/
export function getContextEfficiencyPrompt(
messages: BaseMessage[],
toolCalls: ToolExecution[]
): string {
// Calculate context usage with proper TokenCounter
const messageCount = messages.length;
const totalChars = messages.reduce((sum, msg) => {
const content = typeof msg.content === 'string' ? msg.content : JSON.stringify(msg.content);
return sum + content.length;
}, 0);
const estimatedTokens = TokenCounter.countMessages(messages); // Use accurate token counting
// Analyze redundancy
const toolNames = toolCalls.map(t => t.toolName);
const duplicateTools = toolNames.length - new Set(toolNames).size;
const redundancyRate = toolNames.length > 0 ? (duplicateTools / toolNames.length) * 100 : 0;
// Build prompt with proper formatting
let prompt = `Evaluate how efficiently an AI agent used context and tokens.
`;
// Add context usage stats in XML
prompt += wrapInXML('ContextUsage',
`## Context Usage from actual run
- Messages: ${messageCount}
- Total characters: ${totalChars.toLocaleString()}
- Estimated tokens: ${estimatedTokens.toLocaleString()} (accurate with message overhead)
- Tools called: ${toolCalls.length}
- Duplicate tool calls: ${duplicateTools}
- Redundancy rate: ${redundancyRate.toFixed(1)}%`);
prompt += '\n\n';
// Add efficiency indicators in XML
prompt += wrapInXML('EfficiencyIndicators',
`## Efficiency Indicators from actual run
- Tokens per tool: ${toolCalls.length > 0 ? Math.round(estimatedTokens / toolCalls.length) : 'N/A'}
- Average message length: ${Math.round(totalChars / Math.max(1, messageCount))} chars
- Unique vs total tools: ${new Set(toolNames).size}/${toolNames.length}
- Token estimation method: TokenCounter with overhead`);
prompt += '\n\n';
// Add scoring instructions
prompt += `## SCORING INSTRUCTIONS
Rate context efficiency on a 1-10 scale:
10: Extremely concise (<32K tokens)
9: Very efficient (<64K tokens)
8: Efficient (<100K tokens)
7: Good usage (<128K tokens)
6: Acceptable (<200K tokens)
5: Average (<300K tokens)
4: Somewhat wasteful (<500K tokens)
3: Inefficient (<750K tokens)
2: Very wasteful (<1000K tokens)
1: Extremely wasteful (>1000K tokens)
Consider:
- Token usage vs task complexity
- Redundant operations
- Message verbosity
- Efficient tool usage
Return ONLY a number between 1-10:`;
// ALWAYS append message history at the END
if (messages) {
prompt += '\n\n' + formatMessageHistoryXML(messages);
}
return prompt;
}
/**
* Helper function to count consecutive duplicate tool calls
*/
function countConsecutiveDuplicates(toolCalls: ToolExecution[]): number {
let count = 0;
for (let i = 1; i < toolCalls.length; i++) {
if (toolCalls[i].toolName === toolCalls[i-1].toolName) {
count++;
}
}
return count;
}
/**
* Helper function to analyze recovery patterns after failures
*/
function analyzeRecoveryPatterns(toolCalls: ToolExecution[]): number {
let recoveries = 0;
for (let i = 0; i < toolCalls.length - 1; i++) {
// If a tool failed and the next tool succeeded, count as recovery
if (!toolCalls[i].success && toolCalls[i + 1].success) {
recoveries++;
}
}
return recoveries;
}

View File

@@ -0,0 +1,107 @@
import { describe, it, expect, vi } from 'vitest';
import { SimplifiedScorer } from './EvalScorer';
import { HumanMessage, AIMessage, ToolMessage } from '@langchain/core/messages';
describe('SimplifiedScorer with Gemini', () => {
it('tests that the scorer can be created', () => {
const scorer = new SimplifiedScorer();
expect(scorer).toBeDefined();
});
it('tests that scores are in 1-10 range', async () => {
const scorer = new SimplifiedScorer();
// Use heuristic scoring for testing without API key
scorer['llm'] = null;
const score = await scorer.scoreFromMessages([], 'test query');
expect(score.goalCompletion).toBeGreaterThanOrEqual(1);
expect(score.goalCompletion).toBeLessThanOrEqual(10);
expect(score.planCorrectness).toBeGreaterThanOrEqual(1);
expect(score.planCorrectness).toBeLessThanOrEqual(10);
expect(score.errorFreeExecution).toBeGreaterThanOrEqual(1);
expect(score.errorFreeExecution).toBeLessThanOrEqual(10);
expect(score.contextEfficiency).toBeGreaterThanOrEqual(1);
expect(score.contextEfficiency).toBeLessThanOrEqual(10);
expect(score.weightedTotal).toBeGreaterThanOrEqual(1);
expect(score.weightedTotal).toBeLessThanOrEqual(10);
});
it('tests that tool calls are extracted correctly', async () => {
const messages = [
new HumanMessage('test'),
new AIMessage({
content: '',
tool_calls: [{
id: 'call_1',
name: 'test_tool',
args: { input: 'test' }
}]
}),
new ToolMessage({
content: JSON.stringify({ ok: true, output: 'result' }),
tool_call_id: 'call_1'
})
];
const scorer = new SimplifiedScorer();
// Use heuristic scoring for testing without API key
scorer['llm'] = null;
const score = await scorer.scoreFromMessages(messages, 'test');
expect(score.details.toolCalls).toBe(1);
expect(score.details.failedCalls).toBe(0);
});
it('tests that time efficiency scoring works', async () => {
const scorer = new SimplifiedScorer();
// Use heuristic scoring for testing without API key
scorer['llm'] = null;
const toolMetrics = new Map([
['call_1', { toolName: 'test', duration: 30000, success: true, timestamp: Date.now() }],
['call_2', { toolName: 'test2', duration: 15000, success: true, timestamp: Date.now() }]
]);
const messages = [
new AIMessage({
content: '',
tool_calls: [{
id: 'call_1',
name: 'test',
args: {}
}, {
id: 'call_2',
name: 'test2',
args: {}
}]
})
];
const score = await scorer.scoreFromMessages(messages, 'test', toolMetrics);
expect(score.details.totalDurationMs).toBe(45000); // 45 seconds total
// Should get high efficiency score (8-9) for < 1 minute
});
it('tests that heuristic fallback works', async () => {
// Test without LLM available
const scorer = new SimplifiedScorer();
// Mock getLLM to return null
scorer['llm'] = null;
const messages = [
new HumanMessage('test'),
new AIMessage({
content: '',
tool_calls: [{
id: 'call_1',
name: 'done_tool',
args: {}
}]
})
];
const score = await scorer.scoreFromMessages(messages, 'test query');
expect(score.details.reasoning).toContain('Heuristic');
expect(score.goalCompletion).toBeGreaterThanOrEqual(1);
expect(score.goalCompletion).toBeLessThanOrEqual(10);
});
});

337
src/evals2/EvalScorer.ts Normal file
View File

@@ -0,0 +1,337 @@
import { BaseMessage, AIMessage, ToolMessage } from '@langchain/core/messages';
import { BaseChatModel } from '@langchain/core/language_models/chat_models';
import { ChatGoogleGenerativeAI } from '@langchain/google-genai';
import { getLLM } from '@/lib/llm/LangChainProvider';
import { SCORE_WEIGHTS, GEMINI_SCORING_CONFIG, TIME_EFFICIENCY_BUCKETS } from './Evals.config';
import { ScoreResult, ToolExecution } from './EvalScorer.types';
import { GOOGLE_GENAI_API_KEY, GEMINI_API_KEY } from '@/config';
import {
getGoalCompletionPrompt,
getPlanEfficiencyPrompt,
getErrorHandlingPrompt,
getContextEfficiencyPrompt
} from './EvalScorer.prompt';
export class SimplifiedScorer {
private llm: BaseChatModel | null | undefined = undefined;
constructor() {
// Gemini 2.5 Pro is hardcoded, no model parameter needed
}
private async getLLM(): Promise<BaseChatModel | null> {
// If llm is explicitly set to null (for testing), return null
if (this.llm === null) {
return null;
}
if (this.llm === undefined) {
// Always require Gemini 2.5 Pro - no fallbacks
const apiKey = GOOGLE_GENAI_API_KEY || GEMINI_API_KEY;
if (!apiKey) {
throw new Error('Gemini API key is required for evals2 scoring. Set GOOGLE_GENAI_API_KEY or GEMINI_API_KEY environment variable.');
}
try {
// Directly instantiate Gemini 2.5 Pro
this.llm = new ChatGoogleGenerativeAI({
model: GEMINI_SCORING_CONFIG.modelId,
temperature: GEMINI_SCORING_CONFIG.temperature,
maxOutputTokens: GEMINI_SCORING_CONFIG.maxTokens,
apiKey: apiKey,
convertSystemMessageToHumanContent: true
});
} catch (error) {
console.error('Failed to initialize Gemini 2.5 Pro for scoring:', error);
throw error; // Re-throw to fail fast
}
}
return this.llm;
}
/**
* Score task completion from message history
*/
async scoreFromMessages(
messages: BaseMessage[],
query: string,
toolMetrics?: Map<string, any>,
actualDurationMs?: number // Actual task execution duration
): Promise<ScoreResult> {
// Extract tool calls with metrics
const toolCalls = this.extractToolCalls(messages, toolMetrics);
const toolExecutionMs = this.getTotalDuration(toolCalls);
// Use actual duration if provided, otherwise fall back to tool execution sum
const totalDurationMs = actualDurationMs || toolExecutionMs;
try {
// Get LLM for scoring - this will throw if no API key
const llm = await this.getLLM();
if (!llm) {
// Only use heuristic if explicitly set to null for testing
return this.getHeuristicScores(messages, toolCalls, totalDurationMs, toolExecutionMs, query);
}
// Score each dimension separately with focused prompts
const [goalScore, planScore, errorScore, contextScore] = await Promise.all([
this.scoreGoalCompletion(llm, query, messages, toolCalls),
this.scorePlanEfficiency(llm, query, toolCalls, totalDurationMs, messages),
this.scoreErrorHandling(llm, toolCalls, messages),
this.scoreContextEfficiency(llm, messages, toolCalls)
]);
// Calculate weighted total (1-10 scale)
const weightedTotal =
goalScore * SCORE_WEIGHTS.goalCompletion +
planScore * SCORE_WEIGHTS.planCorrectness +
errorScore * SCORE_WEIGHTS.errorFreeExecution +
contextScore * SCORE_WEIGHTS.contextEfficiency;
return {
goalCompletion: goalScore,
planCorrectness: planScore,
errorFreeExecution: errorScore,
contextEfficiency: contextScore,
weightedTotal: Math.round(weightedTotal),
details: {
toolCalls: toolCalls.length,
failedCalls: toolCalls.filter(t => !t.success).length,
retries: this.countRetries(toolCalls),
totalDurationMs,
toolExecutionMs, // Keep tool execution time separate
reasoning: `Scored with individual LLM calls: ${toolCalls.length} tools, actual: ${totalDurationMs}ms, tools: ${toolExecutionMs}ms`
}
};
} catch (error) {
// If getLLM throws (no API key), let it bubble up
// Don't fall back to heuristics for configuration errors
if (error instanceof Error && error.message.includes('API key is required')) {
throw error;
}
// For other scoring errors, we can still use heuristics
console.error('LLM scoring failed:', error);
return this.getHeuristicScores(messages, toolCalls, totalDurationMs, toolExecutionMs, query);
}
}
/**
* Extract tool calls from message history
* @param messages - Message history from MessageManager
* @param toolMetrics - Optional metrics Map from ExecutionContext
*/
private extractToolCalls(messages: BaseMessage[], toolMetrics?: Map<string, any>): ToolExecution[] {
const toolCalls: ToolExecution[] = [];
// Simple iteration using instanceof
for (let i = 0; i < messages.length; i++) {
const msg = messages[i];
// Check if it's an AIMessage with tool calls
if (msg instanceof AIMessage && msg.tool_calls && msg.tool_calls.length > 0) {
for (const toolCall of msg.tool_calls) {
// Find the next ToolMessage with matching ID
const toolMsg = messages.slice(i + 1).find(
m => m instanceof ToolMessage && m.tool_call_id === (toolCall.id || '')
) as ToolMessage | undefined;
// Get metrics from ExecutionContext if available
const metrics = toolMetrics?.get(toolCall.id || '');
let success = true;
let error: string | undefined;
if (toolMsg) {
// Parse tool result to check success
try {
const result = JSON.parse(toolMsg.content as string);
success = result.ok !== false;
error = result.error;
} catch {
// Not JSON, assume success
}
}
toolCalls.push({
toolName: toolCall.name,
duration: metrics?.duration || 100, // Use tracked duration or default
success: metrics?.success ?? success,
timestamp: metrics?.timestamp || Date.now(),
args: toolCall.args,
error: metrics?.error || error
});
}
}
}
return toolCalls;
}
private countRetries(toolCalls: ToolExecution[]): number {
let retries = 0;
for (let i = 1; i < toolCalls.length; i++) {
// Same tool called consecutively = likely retry
if (toolCalls[i].toolName === toolCalls[i-1].toolName) {
retries++;
}
}
return retries;
}
/**
* Calculate total duration from tool metrics
*/
private getTotalDuration(toolCalls: ToolExecution[]): number {
return toolCalls.reduce((sum, tool) => sum + (tool.duration || 0), 0);
}
/**
* Score efficiency based on execution time
* NTN: Direct 10-point scale, no conversion needed
*/
/**
* Score goal completion using focused prompt
*/
private async scoreGoalCompletion(
llm: BaseChatModel,
query: string,
messages: BaseMessage[],
toolCalls: ToolExecution[]
): Promise<number> {
const prompt = getGoalCompletionPrompt(query, messages, toolCalls);
return this.invokeLLMForScore(llm, prompt, 'goal completion');
}
/**
* Score plan efficiency using focused prompt
*/
private async scorePlanEfficiency(
llm: BaseChatModel,
query: string,
toolCalls: ToolExecution[],
totalDurationMs: number,
messages?: BaseMessage[]
): Promise<number> {
const prompt = getPlanEfficiencyPrompt(query, toolCalls, totalDurationMs, messages);
return this.invokeLLMForScore(llm, prompt, 'plan efficiency');
}
/**
* Score error handling using focused prompt
*/
private async scoreErrorHandling(
llm: BaseChatModel,
toolCalls: ToolExecution[],
messages?: BaseMessage[]
): Promise<number> {
const prompt = getErrorHandlingPrompt(toolCalls, messages);
return this.invokeLLMForScore(llm, prompt, 'error handling');
}
/**
* Score context efficiency using focused prompt
*/
private async scoreContextEfficiency(
llm: BaseChatModel,
messages: BaseMessage[],
toolCalls: ToolExecution[]
): Promise<number> {
const prompt = getContextEfficiencyPrompt(messages, toolCalls);
return this.invokeLLMForScore(llm, prompt, 'context efficiency');
}
/**
* Invoke LLM and parse score response
*/
private async invokeLLMForScore(
llm: BaseChatModel,
prompt: string,
dimension: string
): Promise<number> {
try {
const response = await llm.invoke(prompt);
let content = typeof response.content === 'string' ? response.content : '5';
// Clean up any formatting
content = content.trim().replace(/[^0-9.]/g, '');
const score = parseFloat(content);
const validScore = Math.min(10, Math.max(1, isNaN(score) ? 5 : score));
console.log(`Scored ${dimension}: ${validScore}`);
return validScore;
} catch (error) {
console.error(`Failed to score ${dimension}:`, error);
return 5; // Default middle score on error
}
}
private scoreTimeEfficiency(durationMs: number): number {
if (durationMs <= TIME_EFFICIENCY_BUCKETS.perfect) return 10;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.exceptional) return 9;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.excellent) return 8;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.veryGood) return 7;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.good) return 6;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.average) return 5;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.belowAverage) return 4;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.poor) return 3;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.veryPoor) return 2;
return 1;
}
/**
* Heuristic scoring fallback when LLM is unavailable
* NTN: Returns 1-10 scores based on simple heuristics
*/
private getHeuristicScores(
messages: BaseMessage[],
toolCalls: ToolExecution[],
totalDurationMs: number,
toolExecutionMs: number,
query: string
): ScoreResult {
// Goal completion heuristic
const hasDone = messages.some(msg =>
msg instanceof AIMessage &&
msg.tool_calls?.some(tc => tc.name === 'done_tool')
);
const goalScore = hasDone ? 7 : 3;
// Plan efficiency based on time
const planScore = this.scoreTimeEfficiency(totalDurationMs);
// Error handling based on failure rate
const failureRate = toolCalls.filter(t => !t.success).length / Math.max(1, toolCalls.length);
const errorScore = Math.round(10 * (1 - failureRate));
// Context efficiency based on message count
const messageCount = messages.length;
let contextScore = 5;
if (messageCount < 10) contextScore = 9;
else if (messageCount < 20) contextScore = 7;
else if (messageCount < 30) contextScore = 5;
else if (messageCount < 50) contextScore = 3;
else contextScore = 2;
const weightedTotal =
goalScore * SCORE_WEIGHTS.goalCompletion +
planScore * SCORE_WEIGHTS.planCorrectness +
errorScore * SCORE_WEIGHTS.errorFreeExecution +
contextScore * SCORE_WEIGHTS.contextEfficiency;
return {
goalCompletion: goalScore,
planCorrectness: planScore,
errorFreeExecution: errorScore,
contextEfficiency: contextScore,
weightedTotal: Math.round(weightedTotal),
details: {
toolCalls: toolCalls.length,
failedCalls: toolCalls.filter(t => !t.success).length,
retries: this.countRetries(toolCalls),
totalDurationMs,
toolExecutionMs, // Keep tool execution time separate
reasoning: 'Heuristic scoring (LLM unavailable)'
}
};
}
}

View File

@@ -0,0 +1,36 @@
import { z } from "zod";
// Tool execution metadata schema
export const ToolExecutionSchema = z.object({
toolName: z.string(), // Name of the tool
duration: z.number(), // Duration in milliseconds
success: z.boolean(), // Whether tool succeeded (ok: true/false)
timestamp: z.number(), // When tool was executed
args: z.any().optional(), // Tool arguments
error: z.string().optional() // Error message if failed
});
export type ToolExecution = z.infer<typeof ToolExecutionSchema>;
// Scoring result schema
export const ScoreResultSchema = z.object({
goalCompletion: z.number().min(1).max(10), // How well goal was achieved (1-10 scale)
planCorrectness: z.number().min(1).max(10), // Quality and efficiency of the plan (1-10 scale)
errorFreeExecution: z.number().min(1).max(10), // Error-free execution score (1-10 scale)
contextEfficiency: z.number().min(1).max(10), // Efficient context usage (1-10 scale)
weightedTotal: z.number().min(1).max(10), // Weighted average (1-10 scale)
details: z.object({ // Scoring details
toolCalls: z.number(), // Total number of tool calls
failedCalls: z.number(), // Number of failed calls
retries: z.number(), // Number of retried calls
totalDurationMs: z.number().optional(), // Total execution duration in ms
toolExecutionMs: z.number().optional(), // Sum of tool execution durations in ms
reasoning: z.string().optional() // LLM reasoning
})
});
export type ScoreResult = z.infer<typeof ScoreResultSchema>;
// Duration storage options
export const DurationStorageSchema = z.enum(["result", "context", "collector"]);
export type DurationStorage = z.infer<typeof DurationStorageSchema>;

View File

@@ -0,0 +1,69 @@
import { DynamicStructuredTool } from '@langchain/core/tools';
import type { ExecutionContext } from '@/lib/runtime/ExecutionContext';
/**
* Wrap a tool to track execution duration in ExecutionContext
* Stores metrics in context.toolMetrics Map
*/
export function wrapToolForMetrics(
tool: DynamicStructuredTool,
context: ExecutionContext,
toolCallId: string
): DynamicStructuredTool {
return new DynamicStructuredTool({
name: tool.name,
description: tool.description,
schema: tool.schema,
func: async (input: any) => {
const start = Date.now();
try {
const result = await tool.func(input);
const duration = Date.now() - start;
// Parse result to check success
let success = true;
try {
const parsed = JSON.parse(result);
success = parsed.ok !== false;
} catch {
// If not JSON, assume success
}
// Store metrics in ExecutionContext
if (!context.toolMetrics) {
context.toolMetrics = new Map();
}
context.toolMetrics.set(toolCallId, {
toolName: tool.name,
duration,
success,
timestamp: start
});
console.log(`⚡ Tool: ${tool.name} (${duration}ms)`);
return result;
} catch (error: any) {
const duration = Date.now() - start;
// Store error metrics
if (!context.toolMetrics) {
context.toolMetrics = new Map();
}
context.toolMetrics.set(toolCallId, {
toolName: tool.name,
duration,
success: false,
timestamp: start,
error: error.message
});
console.error(`❌ Tool: ${tool.name} failed (${duration}ms)`);
throw error;
}
}
});
}
export { wrapToolForMetrics as wrapToolForDuration }; // Alias for compatibility

View File

@@ -0,0 +1,40 @@
// Scoring weights
export const SCORE_WEIGHTS = {
goalCompletion: 0.40, // 40% - Most important
planCorrectness: 0.30, // 30% - Plan quality
errorFreeExecution: 0.15, // 15% - Error handling (renamed per NTN feedback)
contextEfficiency: 0.15 // 15% - Efficiency
} as const;
// Default scoring model - removed, using Gemini 2.5 Pro exclusively
// Gemini 2.5 Pro configuration (hardcoded for evals2)
export const GEMINI_SCORING_CONFIG = {
provider: 'google_gemini',
modelId: 'gemini-2.5-pro',
temperature: 0,
maxTokens: 8192, // Output tokens for scoring
contextWindow: 2000000 // 2M token context
} as const;
// Time buckets for plan efficiency scoring (in milliseconds)
// NTN: Using 10-point scale for finer granularity
export const TIME_EFFICIENCY_BUCKETS = {
perfect: 30000, // < 30s = 10
exceptional: 60000, // < 1 min = 9
excellent: 120000, // < 2 min = 8
veryGood: 180000, // < 3 min = 7
good: 240000, // < 4 min = 6
average: 300000, // < 5 min = 5
belowAverage: 360000, // < 6 min = 4
poor: 480000, // < 8 min = 3
veryPoor: 600000, // < 10 min = 2
terrible: Infinity // > 10 min = 1
} as const;
// Environment variable names (for reference)
export const ENV_VARS = {
ENABLE: "ENABLE_EVALS2",
BRAINTRUST_KEY: "BRAINTRUST_API_KEY",
GEMINI_KEY: "GOOGLE_GENAI_API_KEY" // Or GEMINI_API_KEY
} as const;

View File

@@ -0,0 +1,131 @@
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import { SimpleBraintrustEventManager } from './BraintrustEventManager';
import { SimplifiedScorer } from './EvalScorer';
import { wrapToolForMetrics } from './EvalToolWrapper';
import { DynamicStructuredTool } from '@langchain/core/tools';
import { z } from 'zod';
import { HumanMessage, AIMessage, ToolMessage } from '@langchain/core/messages';
describe('Evals2 Integration', () => {
let eventManager: SimpleBraintrustEventManager;
beforeAll(() => {
// Set env var for testing
process.env.ENABLE_EVALS2 = 'true';
eventManager = SimpleBraintrustEventManager.getInstance();
});
afterAll(() => {
// Clean up
eventManager.reset();
delete process.env.ENABLE_EVALS2;
});
it('tests that the event manager can be initialized', () => {
expect(eventManager).toBeDefined();
// Will be false without API key, which is expected in test
expect(eventManager.isEnabled()).toBeDefined();
});
it('tests that tool wrapping tracks duration', async () => {
// Create a mock execution context
const mockContext = {
toolMetrics: undefined as any,
// Add other required properties as needed
} as any;
// Create a simple tool
const testTool = new DynamicStructuredTool({
name: 'test_tool',
description: 'A test tool',
schema: z.object({
input: z.string()
}),
func: async (input: any) => {
// Simulate work
await new Promise(resolve => setTimeout(resolve, 50));
return JSON.stringify({ ok: true, output: 'test result' });
}
});
// Wrap the tool
const wrappedTool = wrapToolForMetrics(testTool, mockContext, 'test_call_123');
// Execute the wrapped tool
const result = await wrappedTool.func({ input: 'test' });
// Verify metrics were tracked
expect(mockContext.toolMetrics).toBeDefined();
expect(mockContext.toolMetrics.size).toBe(1);
const metrics = mockContext.toolMetrics.get('test_call_123');
expect(metrics).toBeDefined();
expect(metrics.toolName).toBe('test_tool');
expect(metrics.duration).toBeGreaterThan(40); // Should be at least 50ms
expect(metrics.success).toBe(true);
});
it('tests that scorer can process messages with tool metrics', async () => {
// Create mock tool metrics
const toolMetrics = new Map();
toolMetrics.set('call_1', {
toolName: 'navigation_tool',
duration: 123,
success: true,
timestamp: Date.now()
});
// Create test messages
const messages = [
new HumanMessage('Navigate to example.com'),
new AIMessage({
content: '',
tool_calls: [{
id: 'call_1',
name: 'navigation_tool',
args: { url: 'https://example.com' }
}]
}),
new ToolMessage({
content: JSON.stringify({ ok: true, output: 'Navigated successfully' }),
tool_call_id: 'call_1'
}),
new AIMessage({
content: '',
tool_calls: [{
id: 'call_2',
name: 'done_tool',
args: {}
}]
}),
new ToolMessage({
content: JSON.stringify({ ok: true }),
tool_call_id: 'call_2'
})
];
const scorer = new SimplifiedScorer();
// Test 1: Without API key, it should throw
if (!process.env.GOOGLE_GENAI_API_KEY && !process.env.GEMINI_API_KEY) {
await expect(scorer.scoreFromMessages(messages, 'Navigate to example.com', toolMetrics))
.rejects.toThrow('Gemini API key is required');
} else {
// Test 2: With API key, it should work
const score = await scorer.scoreFromMessages(messages, 'Navigate to example.com', toolMetrics);
expect(score).toBeDefined();
expect(score.weightedTotal).toBeGreaterThanOrEqual(1);
expect(score.weightedTotal).toBeLessThanOrEqual(10);
expect(score.details.toolCalls).toBe(2); // navigation_tool and done_tool
expect(score.details.failedCalls).toBe(0);
}
// Test 3: With llm set to null, should use heuristics
scorer['llm'] = null;
const heuristicScore = await scorer.scoreFromMessages(messages, 'Navigate to example.com', toolMetrics);
expect(heuristicScore).toBeDefined();
expect(heuristicScore.weightedTotal).toBeGreaterThanOrEqual(1);
expect(heuristicScore.weightedTotal).toBeLessThanOrEqual(10);
expect(heuristicScore.details.reasoning).toContain('Heuristic');
});
});

100
src/evals2/README.md Normal file
View File

@@ -0,0 +1,100 @@
# Evals2 - Simplified Evaluation System
## Overview
Evals2 is a lightweight evaluation system that tracks agent execution metrics and scores task completion quality. It's a simplified replacement for the original evaluation system with ~75% less code complexity.
## Key Features
- **Lightweight Tool Tracking**: Simple Map-based duration tracking (no complex spans)
- **4-Category Scoring**: Goal completion (40%), Plan correctness (30%), Error-free execution (15%), Context efficiency (15%)
- **Session Management**: Maintains parent-child span relationships for Braintrust hierarchy
- **Minimal Integration**: Only 2 hooks in existing code (BrowserAgent + NxtScape)
## Usage
### Enabling Evals2
Set the environment variable:
```bash
export ENABLE_EVALS2=true
export BRAINTRUST_API_KEY=your-key # Required for uploading scores
```
### How It Works
1. **Session Start**: When a conversation begins, SimpleBraintrustEventManager creates a parent span
2. **Tool Execution**: Each tool call is wrapped with SimpleToolWrapper to track duration
3. **Task Scoring**: After task completion, SimplifiedScorer analyzes messages and tool metrics
4. **Score Upload**: Scores are sent to Braintrust via SimpleBraintrustLogger
### Architecture
```
NxtScape
├── SimpleBraintrustEventManager (session management)
│ └── Creates parent span for conversation
├── BrowserAgent
│ └── wrapToolForMetrics() (duration tracking)
│ └── Stores metrics in ExecutionContext.toolMetrics Map
└── SimplifiedScorer (post-execution scoring)
├── Extracts tool calls from messages
├── Uses tool metrics for accurate durations
└── Calculates 4 dimension scores
SimpleBraintrustLogger
└── Uploads scores to Braintrust dashboard
```
## Components
### SimpleToolWrapper.ts
- Wraps tools with lightweight duration tracking
- Stores metrics in ExecutionContext.toolMetrics Map
- Adds ~1ms overhead per tool call
### SimplifiedScorer.ts
- Scores tasks based on message history
- 4 scoring dimensions with configurable weights
- Can use LLM for goal/plan scoring or fallback to heuristics
### SimpleBraintrustEventManager.ts
- Singleton session manager
- Maintains parent span for conversation hierarchy
- Tracks task scores for session averaging
### SimpleBraintrustLogger.ts
- Simple Braintrust integration for score upload
- No complex span management
- Lazy loads Braintrust SDK
## Differences from Original System
| Aspect | Old Evals | Evals2 |
|--------|-----------|--------|
| Code Size | ~2000 lines | ~500 lines |
| Scoring Dimensions | 6 complex | 4 simple |
| Tool Tracking | Braintrust wrapTraced | Map-based duration |
| Session Management | Complex telemetry | Simple parent span |
| Dependencies | Multiple | Minimal |
## Testing
```bash
# Run unit tests
npm run test:run -- src/evals2/SimplifiedScorer.test.ts
# Run integration tests
npm run test:run -- src/evals2/integration.test.ts
```
## Monitoring
Scores appear in Braintrust dashboard at:
https://braintrust.dev/app/Felafax/p/browseros-agent-online/logs
Look for events with:
- Type: `evals2_task_score`
- Session events: `agent_session`

11
src/evals2/index.ts Normal file
View File

@@ -0,0 +1,11 @@
// Main exports from evals2 simplified evaluation system
export { SimplifiedScorer } from './EvalScorer';
export { SimpleBraintrustLogger, braintrustLogger } from './BraintrustLogger';
export { SimpleBraintrustEventManager } from './BraintrustEventManager';
export { wrapToolForMetrics, wrapToolForDuration } from './EvalToolWrapper';
// Type exports
export * from './EvalScorer.types';
// Config exports
export * from './Evals.config';

View File

@@ -45,6 +45,7 @@ import { ExecutionContext } from '@/lib/runtime/ExecutionContext';
import { MessageManager } from '@/lib/runtime/MessageManager';
import { ToolManager } from '@/lib/tools/ToolManager';
import { ExecutionMetadata } from '@/lib/types/messaging';
import { DynamicStructuredTool } from '@langchain/core/tools';
import { createPlannerTool } from '@/lib/tools/planning/PlannerTool';
import { createTodoManagerTool } from '@/lib/tools/planning/TodoManagerTool';
import { createRequirePlanningTool } from '@/lib/tools/planning/RequirePlanningTool';
@@ -71,6 +72,9 @@ import { AIMessage, AIMessageChunk } from '@langchain/core/messages';
import { PLANNING_CONFIG } from '@/lib/tools/planning/PlannerTool.config';
import { AbortError } from '@/lib/utils/Abortable';
import { GlowAnimationService } from '@/lib/services/GlowAnimationService';
// Import evals2 lightweight tool wrapper
import { wrapToolForMetrics } from '@/evals2/EvalToolWrapper';
import { ENABLE_EVALS2 } from '@/config';
import { NarratorService } from '@/lib/services/NarratorService';
import { PubSub } from '@/lib/pubsub'; // For static helper methods
import { HumanInputResponse } from '@/lib/pubsub/types';
@@ -128,6 +132,7 @@ export class BrowserAgent {
private readonly executionContext: ExecutionContext;
private readonly toolManager: ToolManager;
private readonly glowService: GlowAnimationService;
private toolsRegistered = false; // Track if tools have been registered
private narrator?: NarratorService; // Narrator service for human-friendly messages
constructor(executionContext: ExecutionContext) {
@@ -203,6 +208,11 @@ export class BrowserAgent {
// 3. STANDARD FLOW: CLASSIFY task type
const classification = await this._classifyTask(task);
// Log classification result to console for visibility
if (ENABLE_EVALS2) {
console.log(`%c→ Classification: ${classification.is_simple_task ? 'simple' : 'complex'}`, 'color: #888; font-size: 10px');
}
// Clear message history if this is not a follow-up task
if (!classification.is_followup_task) {
this.messageManager.clear();
@@ -228,6 +238,8 @@ export class BrowserAgent {
// 5. FINALISE: Generate final result
await this._generateTaskResult(task);
// Task completion is logged by NxtScape, not here
} catch (error) {
this._handleExecutionError(error, task);
} finally {
@@ -312,6 +324,7 @@ export class BrowserAgent {
const args = { task };
try {
// Tool start notification not needed in new pub-sub system
// Tool start notification not needed in new pub-sub system
const result = await classificationTool.func(args);
const parsedResult = jsonParseToolOutput(result);
@@ -319,6 +332,7 @@ export class BrowserAgent {
if (parsedResult.ok) {
const classification = parsedResult.output;
// Tool end notification not needed in new pub-sub system
// Tool end notification not needed in new pub-sub system
return {
is_simple_task: classification.is_simple_task,
is_followup_task: classification.is_followup_task
@@ -326,6 +340,7 @@ export class BrowserAgent {
}
} catch (error) {
// Tool end notification not needed in new pub-sub system
// Tool end notification not needed in new pub-sub system
}
// Default to complex task on any failure
@@ -612,7 +627,14 @@ export class BrowserAgent {
await this._maybeStartGlowAnimation(toolName);
const toolResult = await tool.func(args);
// Add evals2 lightweight wrapping if enabled
let toolFunc = tool.func;
if (ENABLE_EVALS2) {
const wrappedTool = wrapToolForMetrics(tool, this.executionContext, toolCallId);
toolFunc = wrappedTool.func;
}
const toolResult = await toolFunc(args);
const parsedResult = jsonParseToolOutput(toolResult);
@@ -661,7 +683,6 @@ export class BrowserAgent {
max_steps: BrowserAgent.MAX_STEPS_FOR_COMPLEX_TASKS
};
// Tool start for planner - not needed
const result = await plannerTool.func(args);
const parsedResult = jsonParseToolOutput(result);

View File

@@ -1,20 +1,25 @@
import { z } from "zod";
import { PubSub } from "@/lib/pubsub";
import { Logging } from "@/lib/utils/Logging";
import { BrowserContext } from "@/lib/browser/BrowserContext";
import { ExecutionContext } from "@/lib/runtime/ExecutionContext";
import { MessageManager } from "@/lib/runtime/MessageManager";
import { profileStart, profileEnd, profileAsync } from "@/lib/utils/profiler";
import { BrowserAgent } from "@/lib/agent/BrowserAgent";
import { ChatAgent } from "@/lib/agent/ChatAgent";
import { langChainProvider } from "@/lib/llm/LangChainProvider";
import { PubSub } from "@/lib/pubsub/PubSub";
// Import evals2 components
import { SimpleBraintrustEventManager, SimplifiedScorer } from "@/evals2";
import { TokenCounter } from "@/lib/utils/TokenCounter";
import { ExecutionMetadata } from "@/lib/types/messaging";
import { ENABLE_EVALS2 } from "@/config";
/**
* Configuration schema for NxtScape agent
*/
export const NxtScapeConfigSchema = z.object({
debug: z.boolean().default(false).optional(), // Debug mode flag
experimentId: z.string().optional(), // Optional experiment ID for logging to experiments
});
/**
@@ -28,13 +33,26 @@ export type NxtScapeConfig = z.infer<typeof NxtScapeConfigSchema>;
*/
export const RunOptionsSchema = z.object({
query: z.string(), // Natural language user query
mode: z.enum(['chat', 'browse']), // Execution mode: 'chat' for Q&A, 'browse' for automation
mode: z.enum(['chat', 'browse']).optional(), // Execution mode
tabIds: z.array(z.number()).optional(), // Optional array of tab IDs for context (e.g., which tabs to summarize) - NOT for agent operation
metadata: z.any().optional(), // Execution metadata for controlling execution mode
});
export type RunOptions = z.infer<typeof RunOptionsSchema>;
/**
* Result schema for NxtScape execution
*/
export const NxtScapeResultSchema = z.object({
success: z.boolean(), // Whether the operation succeeded
error: z.string().optional(), // Error message if failed
});
/**
* Result type for NxtScape execution
*/
export type NxtScapeResult = z.infer<typeof NxtScapeResultSchema>;
/**
* Main orchestration class for the NxtScape framework.
* Manages execution context and delegates task execution to BrowserAgent.
@@ -45,7 +63,16 @@ export class NxtScape {
private executionContext!: ExecutionContext; // Will be initialized in initialize()
private messageManager!: MessageManager; // Will be initialized in initialize()
private browserAgent: BrowserAgent | null = null; // The browser agent for task execution
private chatAgent: ChatAgent | null = null; // The chat agent for Q&A mode
private currentQuery: string | null = null; // Track current query for better cancellation messages
// Evals2 simplified evaluation components
private evals2Manager: SimpleBraintrustEventManager | null = null;
private evals2Enabled: boolean = false;
private telemetrySessionId: string | null = null; // For evals2 session tracking
private telemetryParentSpan: string | null = null; // For evals2 parent span
private taskStartTime: number = 0; // Track individual task timing
private taskCount: number = 0; // Track number of tasks in conversation
/**
* Creates a new NxtScape orchestration agent
@@ -97,7 +124,10 @@ export class NxtScape {
// Initialize the browser agent with execution context
this.browserAgent = new BrowserAgent(this.executionContext);
this.chatAgent = new ChatAgent(this.executionContext);
// Note: Telemetry session initialization is deferred until first task execution
// This prevents creating empty sessions when extension is just opened/closed
Logging.log(
"NxtScape",
"NxtScape initialization completed successfully",
@@ -113,6 +143,7 @@ export class NxtScape {
// Clean up partial initialization
this.browserContext = null as any;
this.browserAgent = null;
throw new Error(`NxtScape initialization failed: ${errorMessage}`);
}
@@ -124,7 +155,15 @@ export class NxtScape {
* @returns True if initialized, false otherwise
*/
public isInitialized(): boolean {
return this.browserContext !== null && !!this.browserAgent && !!this.chatAgent;
return this.browserContext !== null && this.browserAgent !== null;
}
/**
* Set chat mode (for backward compatibility)
* @param enabled - Whether chat mode is enabled
*/
public setChatMode(enabled: boolean): void {
this.executionContext.setChatMode(enabled);
}
/**
@@ -141,57 +180,52 @@ export class NxtScape {
}> {
// Ensure initialization
if (!this.isInitialized()) {
await this.initialize();
}
// Refresh token limit in case provider settings changed
const modelCapabilities = await langChainProvider.getModelCapabilities();
if (modelCapabilities.maxTokens !== this.messageManager.getMaxTokens()) {
Logging.log("NxtScape",
`Updating MessageManager token limit from ${this.messageManager.getMaxTokens()} to ${modelCapabilities.maxTokens}`);
this.messageManager.setMaxTokens(modelCapabilities.maxTokens);
await this.initialize();
}
const parsedOptions = RunOptionsSchema.parse(options);
const { query, tabIds, mode, metadata } = parsedOptions;
const { query, tabIds, mode = 'browse', metadata } = parsedOptions;
const startTime = Date.now();
Logging.log(
"NxtScape",
`Processing user query in ${mode} mode: ${query}${
`Processing user query with unified classification: ${query}${
tabIds ? ` (${tabIds.length} tabs)` : ""
}`,
);
// Validate browser context
if (!this.browserContext) {
throw new Error("NxtScape.initialize() must be awaited before run()");
}
// Clean up any running task (after initialization ensures executionContext exists)
if (this.isRunning()) {
Logging.log("NxtScape", "Another task is already running. Cleaning up...");
Logging.log(
"NxtScape",
"Another task is already running. Cleaning up...",
);
this._internalCancel();
}
// Reset abort controller if needed (executionContext guaranteed to exist after init)
if (this.executionContext && this.executionContext.abortController.signal.aborted) {
// Reset abort controller if it's aborted (from pause or previous execution)
if (this.executionContext.abortController.signal.aborted) {
this.executionContext.resetAbortController();
}
// Get current page and lock execution
// Always get the current page from browser context - this is the tab the agent will operate on
profileStart("NxtScape.getCurrentPage");
const currentPage = await this.browserContext.getCurrentPage();
const currentTabId = currentPage.tabId;
profileEnd("NxtScape.getCurrentPage");
// Lock browser context to current tab
// Lock browser context to the current tab to prevent tab switches during execution
this.browserContext.lockExecutionToTab(currentTabId);
// Start execution context
// Mark execution as started
this.executionContext.startExecution(currentTabId);
// Set selected tab IDs for context
// Set selected tab IDs for context (e.g., for summarizing multiple tabs)
// These are NOT the tabs the agent operates on, just context for tools like ExtractTool
this.executionContext.setSelectedTabIds(tabIds || [currentTabId]);
// Publish running status
@@ -204,35 +238,51 @@ export class NxtScape {
* Executes the appropriate agent based on mode
* @private
*/
private async _executeAgent(query: string, mode: 'chat' | 'browse', metadata?: any): Promise<void> {
private async _executeAgent(query: string, mode: 'chat' | 'browse', metadata?: any, tabIds?: number[]): Promise<void> {
// Chat mode is not currently implemented, always use browse mode
if (mode === 'chat') {
if (!this.chatAgent) {
throw new Error('Chat agent not initialized');
}
await this.chatAgent.execute(query);
} else {
if (!this.browserAgent) {
throw new Error('Browser agent not initialized');
}
await this.browserAgent.execute(query, metadata as ExecutionMetadata | undefined);
throw new Error('Chat mode is not currently implemented');
}
this.currentQuery = query;
// Initialize telemetry session on first task if not already initialized
// This ensures we only create sessions when there's actual work
if (!this.telemetrySessionId) {
await this._initializeTelemetrySession();
}
// Track task start for evals2
if (this.evals2Enabled) {
this.taskCount++;
this.taskStartTime = Date.now();
console.log(`%c→ Task ${this.taskCount}: "${query.substring(0, 40)}..."`, 'color: #00ff00; font-size: 10px');
}
// Pass evals2 parent span to execution context for tool wrapping
this.executionContext.parentSpanId = this.telemetryParentSpan;
Logging.log("NxtScape", "Agent execution completed");
}
try {
// Check that browser agent is initialized
if (!this.browserAgent) {
throw new Error("BrowserAgent not initialized");
}
/**
* Handles execution errors and publishes appropriate status
* @private
*/
private _handleExecutionError(error: unknown): void {
const errorMessage = error instanceof Error ? error.message : String(error);
const wasCancelled = error instanceof Error && error.name === "AbortError";
// Execute the browser agent with the task
await this.browserAgent.execute(query, metadata as ExecutionMetadata | undefined);
// BrowserAgent handles all logging and result management internally
Logging.log("NxtScape", "Agent execution completed");
} catch (error) {
const errorMessage = error instanceof Error ? error.message : String(error);
const wasCancelled = error instanceof Error && error.name === "AbortError";
if (wasCancelled) {
Logging.log("NxtScape", `Execution cancelled: ${errorMessage}`);
PubSub.getInstance().publishExecutionStatus('cancelled', errorMessage);
} else {
Logging.log("NxtScape", `Execution error: ${errorMessage}`, "error");
if (wasCancelled) {
Logging.log("NxtScape", `Execution cancelled: ${errorMessage}`);
} else {
Logging.log("NxtScape", `Execution error: ${errorMessage}`, "error");
}
// Publish error status
PubSub.getInstance().publishExecutionStatus('error', errorMessage);
@@ -243,6 +293,68 @@ export class NxtScape {
'error'
);
PubSub.getInstance().publishMessage(errorMsg);
} finally {
// Add evals2 scoring if enabled - runs even if task was paused or errored
if (this.evals2Enabled && this.evals2Manager) {
const taskEndTime = Date.now();
const duration = this.taskStartTime ? taskEndTime - this.taskStartTime : 0;
try {
// Score the task
const scorer = new SimplifiedScorer();
const messages = this.messageManager.getMessages();
const score = await scorer.scoreFromMessages(
messages,
query,
this.executionContext.toolMetrics, // Pass tool metrics for duration data
duration // Pass actual task execution duration
);
// Calculate context metrics using TokenCounter for accuracy
const messageCount = messages.length;
const totalCharacters = messages.reduce((sum, msg) => {
const content = typeof msg.content === 'string' ? msg.content : JSON.stringify(msg.content);
return sum + content.length;
}, 0);
const estimatedTokens = TokenCounter.countMessages(messages); // Use proper token counting
// Log to console with more details
console.log('Evals2 Score:', {
goal: score.goalCompletion.toFixed(2),
plan: score.planCorrectness.toFixed(2),
errors: score.errorFreeExecution.toFixed(2),
context: score.contextEfficiency.toFixed(2),
total: score.weightedTotal.toFixed(2),
messages: messageCount,
tokens: estimatedTokens
});
// Upload to Braintrust with parent span and context metrics
const { braintrustLogger } = await import('@/evals2/BraintrustLogger');
await braintrustLogger.logTaskScore(
query,
score,
duration,
{
selectedTabIds: tabIds || [],
mode: mode || 'browse'
},
this.telemetryParentSpan || undefined,
{
messageCount,
totalCharacters,
estimatedTokens
}
);
// Add score to session manager for averaging
this.evals2Manager.addTaskScore(score.weightedTotal);
} catch (error) {
console.warn('Evals2 scoring failed:', error);
// Don't break execution if scoring fails
}
}
}
}
@@ -289,14 +401,26 @@ export class NxtScape {
executionContext = await this._prepareExecution(options);
// Phase 2: Execute agent
await this._executeAgent(executionContext.query, executionContext.mode, executionContext.metadata);
await this._executeAgent(executionContext.query, executionContext.mode, executionContext.metadata, executionContext.tabIds);
// Success: Publish done status
PubSub.getInstance().publishExecutionStatus('done');
} catch (error) {
// Phase 3: Handle errors
this._handleExecutionError(error);
const errorMessage = error instanceof Error ? error.message : String(error);
const wasCancelled = error instanceof Error && error.name === "AbortError";
if (wasCancelled) {
Logging.log("NxtScape", `Execution cancelled: ${errorMessage}`);
} else {
Logging.log("NxtScape", `Execution error: ${errorMessage}`, "error");
}
// Publish error status
PubSub.getInstance().publishExecutionStatus('error', errorMessage);
// Error scoring handled by evals2 if enabled
} finally {
// Phase 4: Always cleanup
if (executionContext) {
@@ -308,19 +432,27 @@ export class NxtScape {
public isRunning(): boolean {
return this.executionContext && this.executionContext.isExecuting();
return this.executionContext.isExecuting();
}
/**
* Cancel the currently running task
* @returns Object with cancellation info including the query that was cancelled
*/
public cancel(): void {
if (this.executionContext) {
Logging.log("NxtScape", "User cancelling current task execution");
this.executionContext.cancelExecution( true);
public async cancel(): Promise<{ wasCancelled: boolean; query?: string }> {
if (this.executionContext && !this.executionContext.abortController.signal.aborted) {
const cancelledQuery = this.currentQuery;
Logging.log(
"NxtScape",
`User cancelling current task execution: "${cancelledQuery}"`,
);
// Pause scoring handled by evals2 if enabled
this.executionContext.cancelExecution(
/*isUserInitiatedsCancellation=*/ true,
);
// Publish cancelled status with message
PubSub.getInstance().publishExecutionStatus('cancelled', 'Task cancelled by user');
// Emit a friendly pause message so UI shows clear state
PubSub.getInstance().publishMessage(
PubSub.createMessageWithId(
@@ -329,7 +461,11 @@ export class NxtScape {
'assistant'
)
);
return { wasCancelled: true, query: cancelledQuery || undefined };
}
return { wasCancelled: false };
}
/**
@@ -339,32 +475,16 @@ export class NxtScape {
* @private
*/
private _internalCancel(): void {
if (this.executionContext) {
Logging.log("NxtScape", "Internal cleanup: cancelling previous execution");
if (this.executionContext && !this.executionContext.abortController.signal.aborted) {
Logging.log(
"NxtScape",
"Internal cleanup: cancelling previous execution",
);
// false = not user-initiated, this is internal cleanup
this.executionContext.cancelExecution(false);
}
}
/**
* Enable or disable chat mode (Q&A mode)
* @param enabled - Whether to enable chat mode
*/
public setChatMode(enabled: boolean): void {
if (this.executionContext) {
this.executionContext.setChatMode(enabled);
Logging.log("NxtScape", `Chat mode ${enabled ? 'enabled' : 'disabled'}`);
}
}
/**
* Check if chat mode is enabled
* @returns Whether chat mode is enabled
*/
public isChatMode(): boolean {
return this.executionContext ? this.executionContext.isChatMode() : false;
}
/**
* Get the current execution status
* @returns Object with execution status information
@@ -372,10 +492,12 @@ export class NxtScape {
public getExecutionStatus(): {
isRunning: boolean;
lockedTabId: number | null;
query: string | null;
} {
return {
isRunning: this.isRunning(),
lockedTabId: this.executionContext.getLockedTabId(),
query: this.currentQuery,
};
}
@@ -383,43 +505,91 @@ export class NxtScape {
* Clear conversation history (useful for reset functionality)
*/
public reset(): void {
// 1. Stop current task if running
// stop the current task if it is running
if (this.isRunning()) {
// Use internal cancel to avoid publishing status
this._internalCancel();
this.cancel();
}
// 2. Clean up existing agents (call cleanup to unsubscribe)
if (this.browserAgent) {
this.browserAgent.cleanup();
this.browserAgent = null;
}
if (this.chatAgent) {
this.chatAgent.cleanup();
this.chatAgent = null;
}
// 3. Clear PubSub buffer only (NOT subscribers - UI needs to stay subscribed!)
PubSub.getInstance().clearBuffer();
// 4. Clear message history
// Clear current query to ensure clean state
this.currentQuery = null;
// End current telemetry session if one exists
if (this.telemetrySessionId) {
this._endTelemetrySession('user_reset');
}
this.taskCount = 0; // Reset task counter for new conversation
// Note: New session will be created on next task execution
// Recreate MessageManager to clear history
this.messageManager.clear();
// 5. Reset execution context and abort controller
// reset the execution context
this.executionContext.reset();
// Ensure abort controller is reset for next run
if (this.executionContext.abortController.signal.aborted) {
this.executionContext.resetAbortController();
}
// 6. Recreate agents with fresh state (they will subscribe themselves)
this.browserAgent = new BrowserAgent(this.executionContext);
this.chatAgent = new ChatAgent(this.executionContext);
// forces initalize of nextscape again
// this would pick-up new mew message mangaer context length, etc
this.browserAgent = null;
Logging.log(
"NxtScape",
"Conversation history and state cleared completely",
);
}
/**
* Initialize evals2 session for conversation tracking
* This creates a parent session that spans multiple tasks
*/
private async _initializeTelemetrySession(): Promise<void> {
// Check if evals2 is enabled
this.evals2Enabled = ENABLE_EVALS2;
if (!this.evals2Enabled) {
return;
}
// Use simplified evals2 system
try {
this.evals2Manager = SimpleBraintrustEventManager.getInstance();
if (!this.evals2Manager.isEnabled()) {
this.evals2Manager = null;
this.evals2Enabled = false;
return;
}
const sessionId = crypto.randomUUID();
const { parent } = await this.evals2Manager.startSession({
sessionId,
task: this.currentQuery || 'No query provided',
timestamp: Date.now(),
agentVersion: typeof chrome !== 'undefined' ? chrome.runtime.getManifest().version : 'unknown'
});
this.telemetrySessionId = sessionId;
this.telemetryParentSpan = parent || null;
// Also update execution context for tool wrapping
if (this.executionContext) {
this.executionContext.parentSpanId = this.telemetryParentSpan;
}
} catch (error) {
// Silent failure
this.evals2Enabled = false;
}
}
/**
* End the current evals2 session
* @param reason - Why the session is ending (reset, close, timeout, etc.)
*/
private async _endTelemetrySession(reason: string = 'unknown'): Promise<void> {
// Handle evals2 session end
if (this.evals2Enabled && this.evals2Manager) {
await this.evals2Manager.endSession(reason);
this.telemetrySessionId = null;
this.telemetryParentSpan = null;
}
}
}

View File

@@ -1,5 +1,5 @@
import { z } from 'zod'
import BrowserContext from '../browser/BrowserContext'
import BrowserContext from '@/lib/browser/BrowserContext'
import { MessageManager } from '@/lib/runtime/MessageManager'
import { getLLM as getLLMFromProvider } from '@/lib/llm/LangChainProvider'
import { BaseChatModel } from '@langchain/core/language_models/chat_models'
@@ -16,7 +16,7 @@ export const ExecutionContextOptionsSchema = z.object({
messageManager: z.instanceof(MessageManager), // Message manager for communication
debugMode: z.boolean().default(false), // Whether to enable debug logging
todoStore: z.instanceof(TodoStore).optional() // TODO store for complex task management
})
}).passthrough() // Allow extra properties to be passed (like abortController from tests)
export type ExecutionContextOptions = z.infer<typeof ExecutionContextOptionsSchema>
@@ -30,16 +30,27 @@ export class ExecutionContext {
debugMode: boolean // Whether debug logging is enabled
selectedTabIds: number[] | null = null // Selected tab IDs
todoStore: TodoStore // TODO store for complex task management
parentSpanId: string | null = null // Parent span ID for evals2 tracing
private userInitiatedCancel: boolean = false // Track if cancellation was user-initiated
private _isExecuting: boolean = false // Track actual execution state
private _lockedTabId: number | null = null // Tab that execution is locked to
private _currentTask: string | null = null // Current user task being executed
private _chatMode: boolean = false // Whether ChatAgent mode is enabled
private _taskNumber: number = 0 // Track number of user tasks in this session
private _humanInputRequestId: string | undefined // Current human input request ID
private _humanInputResponse: HumanInputResponse | undefined // Human input response
// Tool metrics Map for evals2 lightweight tracking
toolMetrics: Map<string, {
toolName: string
duration: number
success: boolean
timestamp: number
error?: string
}> | undefined
constructor(options: ExecutionContextOptions) {
// Validate options at runtime
// Validate options at runtime with proper type checking
const validatedOptions = ExecutionContextOptionsSchema.parse(options)
// Create our own AbortController - single source of truth
@@ -146,6 +157,9 @@ export class ExecutionContext {
this.userInitiatedCancel = false;
this._currentTask = null;
this.todoStore.reset();
// Clear tool metrics for evals2
this.toolMetrics?.clear();
this.toolMetrics = undefined;
}
/**
@@ -163,6 +177,7 @@ export class ExecutionContext {
*/
public setCurrentTask(task: string): void {
this._currentTask = task;
this._taskNumber++; // Increment task counter when new user task starts
}
/**
@@ -173,6 +188,14 @@ export class ExecutionContext {
return this._currentTask;
}
/**
* Get the current task number (how many user tasks in this session)
* @returns The current task number (1-based)
*/
public getCurrentTaskNumber(): number {
return this._taskNumber;
}
/**
* Get KlavisAPIManager singleton for MCP operations
* @returns The KlavisAPIManager instance

View File

@@ -62,4 +62,4 @@ Keep todos single-level without nesting.`,
}
}
})
}
}

View File

@@ -422,6 +422,7 @@ export const PlanGenerationUpdateMessageSchema = MessageSchema.extend({
export type PlanGenerationUpdateMessage = z.infer<typeof PlanGenerationUpdateMessageSchema>
/**
* Union of all message types
*/

View File

@@ -0,0 +1,259 @@
import React, { useState, useEffect } from 'react'
import { Button } from '@/sidepanel/components/ui/button'
import { Beaker } from 'lucide-react'
import { MessageType } from '@/lib/types/messaging'
import { isDevelopmentMode, ENABLE_TELEMETRY } from '@/config'
interface ExperimentModalProps {
trackClick: (action: string) => void
sendMessage: (type: MessageType, payload: any) => void
addMessageListener: <T>(type: MessageType, handler: (payload: T) => void) => void
removeMessageListener: <T>(type: MessageType, handler: (payload: T) => void) => void
isProcessing: boolean
}
export function ExperimentModal({
trackClick,
sendMessage,
addMessageListener,
removeMessageListener,
isProcessing
}: ExperimentModalProps) {
const [experimentStatus, setExperimentStatus] = useState<string>('')
const [isRunningExperiment, setIsRunningExperiment] = useState(false)
const [showExperimentModal, setShowExperimentModal] = useState(false)
const [experimentConfig, setExperimentConfig] = useState({
logsTag: ''
})
const [availableTags, setAvailableTags] = useState<Array<{tag: string, count: number}>>([])
const [isLoadingTags, setIsLoadingTags] = useState(false)
const [tagsError, setTagsError] = useState<string | null>(null)
const fetchAvailableTags = () => {
setIsLoadingTags(true)
setTagsError(null)
sendMessage(MessageType.FETCH_AVAILABLE_TAGS, {})
}
const handleRunExperiment = () => {
trackClick('run_experiment')
setShowExperimentModal(true)
// Fetch tags when modal opens (if not already loaded)
if (availableTags.length === 0) {
fetchAvailableTags()
}
}
const handleStartExperiment = () => {
setShowExperimentModal(false)
setIsRunningExperiment(true)
setExperimentStatus('Starting experiment...')
// Send message to background with configured values
sendMessage(MessageType.RUN_EXPERIMENT, {
logsTag: experimentConfig.logsTag
})
}
// Handle escape key to close experiment modal
useEffect(() => {
const handleEscape = (e: KeyboardEvent) => {
if (e.key === 'Escape' && showExperimentModal) {
setShowExperimentModal(false)
}
}
if (showExperimentModal) {
document.addEventListener('keydown', handleEscape)
return () => document.removeEventListener('keydown', handleEscape)
}
}, [showExperimentModal])
// Listen for available tags response
useEffect(() => {
const handler = (payload: any) => {
setIsLoadingTags(false)
if (payload.status === 'success') {
// console.log('Received tags:', payload.tags)
setAvailableTags(payload.tags || [])
setTagsError(null)
} else {
setTagsError(payload.error || 'Failed to fetch tags')
}
}
addMessageListener(MessageType.AVAILABLE_TAGS_RESPONSE, handler)
return () => removeMessageListener(MessageType.AVAILABLE_TAGS_RESPONSE, handler)
}, [addMessageListener, removeMessageListener])
// Listen for experiment updates
useEffect(() => {
const handler = (payload: any) => {
const { status, message: statusMessage, progress, results, error } = payload
if (status === 'error') {
setExperimentStatus(`Error: ${error}`)
setIsRunningExperiment(false)
setTimeout(() => setExperimentStatus(''), 15000) // Show error for 15 seconds
} else if (status === 'completed' && isRunningExperiment) {
setExperimentStatus('Experiment completed!')
setIsRunningExperiment(false)
// Log results to console for debugging (only if experiment was running)
// console.log('Experiment Results:', results)
// If we have a compare URL, open it in a new tab
if (results?.compareUrl) {
console.log('Compare experiments at:', results.compareUrl)
}
// Show summary
if (results?.results) {
const successful = results.results.filter((r: any) => r.success).length
const total = results.results.length
setExperimentStatus(`Completed: ${successful}/${total} successful`)
}
setTimeout(() => setExperimentStatus(''), 15000) // Show success for 15 seconds
} else if (status === 'running' && progress) {
setExperimentStatus(`${progress.current}/${progress.total} - ${statusMessage}`)
} else {
setExperimentStatus(statusMessage || status)
}
}
addMessageListener(MessageType.EXPERIMENT_UPDATE, handler)
return () => removeMessageListener(MessageType.EXPERIMENT_UPDATE, handler)
}, [addMessageListener, removeMessageListener, isRunningExperiment])
return (
<>
{/* Experiment button - Dev mode + telemetry enabled only */}
{isDevelopmentMode() && ENABLE_TELEMETRY && (
<Button
onClick={handleRunExperiment}
variant="ghost"
size="sm"
className="h-9 w-9 p-0 rounded-xl hover:bg-brand/10 hover:text-brand transition-all duration-300"
aria-label="Run experiment"
disabled={isRunningExperiment}
>
<Beaker className="w-4 h-4" />
</Button>
)}
{/* Experiment Status Message */}
{experimentStatus && (
<div
className={`fixed top-12 left-0 right-0 z-40 px-4 py-2 text-sm whitespace-pre-wrap ${
experimentStatus.includes('Error')
? 'bg-red-100 text-red-700 dark:bg-red-900 dark:text-red-200'
: 'bg-blue-100 text-blue-700 dark:bg-blue-900 dark:text-blue-200'
}`}
>
{experimentStatus}
</div>
)}
{/* Experiment Configuration Modal */}
{showExperimentModal && (
<div
className="fixed inset-0 bg-black/50 backdrop-blur-sm z-50 flex items-center justify-center p-4"
onClick={(e) => {
// Close modal when clicking on backdrop
if (e.target === e.currentTarget) {
setShowExperimentModal(false)
}
}}
>
<div className="bg-background rounded-xl shadow-xl max-w-md w-full p-6 space-y-4 animate-in zoom-in-95 duration-200">
<div className="flex items-center justify-between">
<h2 className="text-lg font-semibold">Configure Experiment</h2>
<button
onClick={() => setShowExperimentModal(false)}
className="text-muted-foreground hover:text-foreground transition-colors"
>
</button>
</div>
<div className="space-y-4">
<div>
<div className="flex items-center justify-between mb-2">
<label className="block text-sm font-medium">
Logs Tag (source data)
</label>
<button
onClick={fetchAvailableTags}
className="text-xs text-muted-foreground hover:text-foreground transition-colors"
disabled={isLoadingTags}
type="button"
>
{isLoadingTags ? '⟳ Loading...' : '⟳ Refresh'}
</button>
</div>
{isLoadingTags ? (
<div className="w-full px-3 py-2 rounded-lg border border-border bg-background text-muted-foreground">
Loading tags...
</div>
) : tagsError ? (
<div className="text-red-500 text-sm">{tagsError}</div>
) : (
<select
value={experimentConfig.logsTag}
onChange={(e) => {
const newLogsTag = e.target.value
setExperimentConfig({
logsTag: newLogsTag
})
}}
className="w-full px-3 py-2 rounded-lg border border-border bg-background text-foreground"
>
<option value="">Select a tag...</option>
{availableTags.map((item) => {
const { tag, count } = item || {}
if (!tag) return null
return (
<option key={tag} value={tag}>
{tag} ({count} {count === 1 ? 'prompt' : 'prompts'})
</option>
)
})}
</select>
)}
{experimentConfig.logsTag && (
<>
<p className="text-xs text-muted-foreground mt-1">
Fetches prompts tagged with: {experimentConfig.logsTag}
</p>
</>
)}
</div>
</div>
<div className="flex gap-3 justify-end">
<Button
onClick={() => setShowExperimentModal(false)}
variant="ghost"
size="sm"
>
Cancel
</Button>
<Button
onClick={handleStartExperiment}
size="sm"
className="bg-brand hover:bg-brand/90"
disabled={!experimentConfig.logsTag}
>
Start Experiment
</Button>
</div>
</div>
</div>
)}
</>
)
}

View File

@@ -5,7 +5,8 @@ import { MessageType } from '@/lib/types/messaging'
import { useAnalytics } from '../hooks/useAnalytics'
import { SettingsModal } from './SettingsModal'
import { HelpSection } from './HelpSection'
import { Settings, Pause, RotateCcw, ChevronDown, Plus, Trash2, Star } from 'lucide-react'
// import { ExperimentModal } from './ExperimentModal' // Removed - old evals system deprecated
import { HelpCircle, Settings, Pause, RotateCcw, ChevronDown, Plus, Trash2, Star } from 'lucide-react'
import { useSettingsStore } from '@/sidepanel/stores/settingsStore'
import { useEffect } from 'react'
import { z } from 'zod'
@@ -319,6 +320,15 @@ export const Header = memo(function Header({ onReset, showReset, isProcessing }:
<Settings className="w-4 h-4" />
</Button>
{/* Experiment Modal - renders its own button */}
{/* <ExperimentModal
trackClick={trackClick}
sendMessage={sendMessage}
addMessageListener={addMessageListener}
removeMessageListener={removeMessageListener}
isProcessing={isProcessing}
/> */} {/* Commented out - old evals system deprecated */}
{isProcessing && (
<Button
onClick={handleCancel}

View File

@@ -0,0 +1,638 @@
# Evals2 Gemini 2.5 Pro Enhancement Implementation Plan
## Overview
Enhance the evals2 scoring system to use Gemini 2.5 Pro via LangChain with its full 2M token context window, implement better scoring prompts, leverage ExecutionContext.toolMetrics for time-based efficiency scoring, and use a 10-point scoring scale for higher granularity.
## Current State Analysis
The current implementation uses OpenAI GPT-4o-mini for scoring with simple prompts and 0-1 score ranges. The system already collects toolMetrics in ExecutionContext but doesn't utilize the duration data. Scoring is done through SimplifiedScorer with four dimensions weighted at specific percentages.
### Key Discoveries:
- SimplifiedScorer.getLLM() at line 15-24 creates LLM instances dynamically
- Scoring methods return 0-1 values that need conversion to 1-5 scale
- ExecutionContext.toolMetrics (line 44-50) already tracks duration, success, timestamp
- LangChainProvider supports Google Gemini via ChatGoogleGenerativeAI (line 404)
- Current prompts are minimal and inline (lines 130-140, 168-177)
## Desired End State
After implementation:
- All scoring uses Gemini 2.5 Pro exclusively (hardcoded, no configuration)
- Full untruncated message history passed to LLM (2M context window)
- Rich, detailed prompts for each scoring dimension
- Time-based plan efficiency using actual execution duration
- All scores on 1-10 scale with clear criteria
### Verification:
- Scoring always uses Gemini 2.5 Pro regardless of config
- No truncation of message history (remove slice(-5) limitations)
- Detailed prompts produce more accurate scores
- Plan efficiency correlates with actual execution time
- All scores returned in 1-10 range
## What We're NOT Doing
- Not refactoring the entire scoring architecture
- Not changing the four scoring dimensions or their weights
- Not modifying the Braintrust logging infrastructure
- Not changing how toolMetrics are collected
- Not creating a complex prompt management system
- Not adding score conversion functions (LLM returns 1-10 directly)
- Not adding configuration options for model selection
## Implementation Approach
Minimal refactor approach focusing on three key changes:
1. Force Gemini 2.5 Pro in getLLM() method
2. Create detailed prompts in SimplifiedScorer.prompt.ts file
3. Add time-based scoring helpers (no conversion needed - LLM returns 1-10)
## Phase 1: Setup Gemini Provider and Update Types
### Overview
Configure SimplifiedScorer to always use Gemini 2.5 Pro and update score types to support 1-10 scale.
### Changes Required:
#### 1. Update Score Types
**File**: `src/evals2/types.ts`
**Changes**: Modify score ranges from 0-1 to 1-10
```typescript
// Scoring result schema
export const ScoreResultSchema = z.object({
goalCompletion: z.number().min(1).max(10), // How well goal was achieved (1-10 scale)
planCorrectness: z.number().min(1).max(10), // Quality and efficiency of the plan (1-10 scale)
errorFreeExecution: z.number().min(1).max(10), // Error-free execution score (1-10 scale)
contextEfficiency: z.number().min(1).max(10), // Efficient context usage (1-10 scale)
weightedTotal: z.number().min(1).max(10), // Weighted average (1-10 scale)
details: z.object({ // Scoring details
toolCalls: z.number(), // Total number of tool calls
failedCalls: z.number(), // Number of failed calls
retries: z.number(), // Number of retried calls
totalDurationMs: z.number().optional(), // Total execution duration in ms
reasoning: z.string().optional() // LLM reasoning
})
});
```
#### 2. Update Configuration Constants
**File**: `src/evals2/config.ts`
**Changes**: Add Gemini-specific constants
```typescript
// Gemini 2.5 Pro configuration (hardcoded for evals2)
export const GEMINI_SCORING_CONFIG = {
provider: 'google_gemini',
modelId: 'gemini-2.5-pro',
temperature: 0,
maxTokens: 8192, // Output tokens for scoring
contextWindow: 2000000 // 2M token context
} as const;
// Time buckets for plan efficiency scoring (in milliseconds)
// NTN: Using 10-point scale for finer granularity
export const TIME_EFFICIENCY_BUCKETS = {
perfect: 30000, // < 30s = 10
exceptional: 60000, // < 1 min = 9
excellent: 120000, // < 2 min = 8
veryGood: 180000, // < 3 min = 7
good: 240000, // < 4 min = 6
average: 300000, // < 5 min = 5
belowAverage: 360000, // < 6 min = 4
poor: 480000, // < 8 min = 3
veryPoor: 600000, // < 10 min = 2
terrible: Infinity // > 10 min = 1
} as const;
```
### Success Criteria:
#### Automated Verification:
- [ ] Type checking passes: `npm run typecheck`
- [ ] Existing tests still pass: `npm run test:run -- src/evals2`
- [ ] No linting errors: `npm run lint`
#### Manual Verification:
- [ ] Score types properly validated as 1-10 range
- [ ] Configuration constants accessible
---
## Phase 2: Implement Gemini LLM Integration
### Overview
Modify SimplifiedScorer to always use Gemini 2.5 Pro regardless of configuration.
### Changes Required:
#### 1. Force Gemini Provider in SimplifiedScorer
**File**: `src/evals2/SimplifiedScorer.ts`
**Changes**: Update getLLM() method and add helper methods
```typescript
import { ChatGoogleGenerativeAI } from '@langchain/google-genai';
import { GEMINI_SCORING_CONFIG, TIME_EFFICIENCY_BUCKETS } from './config';
private async getLLM(): Promise<BaseChatModel | null> {
if (!this.llm) {
try {
// Always use Gemini 2.5 Pro for scoring
const apiKey = process.env.GOOGLE_GENAI_API_KEY || process.env.GEMINI_API_KEY;
if (!apiKey) {
console.warn('No Gemini API key found, falling back to default LLM');
this.llm = await getLLM({ temperature: 0, maxTokens: 100 });
} else {
this.llm = new ChatGoogleGenerativeAI({
model: GEMINI_SCORING_CONFIG.modelId,
temperature: GEMINI_SCORING_CONFIG.temperature,
maxOutputTokens: GEMINI_SCORING_CONFIG.maxTokens,
apiKey: apiKey,
convertSystemMessageToHumanContent: true
});
}
} catch (error) {
console.error('Failed to initialize Gemini for scoring:', error);
return null;
}
}
return this.llm;
}
/**
* Calculate total duration from tool metrics
*/
private getTotalDuration(toolCalls: ToolExecution[]): number {
return toolCalls.reduce((sum, tool) => sum + (tool.duration || 0), 0);
}
/**
* Score efficiency based on execution time
* NTN: Direct 10-point scale, no conversion needed
*/
private scoreTimeEfficiency(durationMs: number): number {
if (durationMs <= TIME_EFFICIENCY_BUCKETS.perfect) return 10;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.exceptional) return 9;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.excellent) return 8;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.veryGood) return 7;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.good) return 6;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.average) return 5;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.belowAverage) return 4;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.poor) return 3;
if (durationMs <= TIME_EFFICIENCY_BUCKETS.veryPoor) return 2;
return 1;
}
```
### Success Criteria:
#### Automated Verification:
- [ ] SimplifiedScorer compiles without errors: `npm run build:dev`
- [ ] Unit tests pass: `npm run test:run -- src/evals2/SimplifiedScorer.test.ts`
#### Manual Verification:
- [ ] Gemini provider is used when API key is available
- [ ] Fallback to default LLM works when no Gemini key
- [ ] Helper methods correctly calculate durations and time-based scores
---
## Phase 3: Create Detailed Scoring Prompts
### Overview
Create a new prompts file with rich, detailed prompts for each scoring dimension that leverage the full context and return 10-point scores directly.
### Changes Required:
#### 1. Create Scoring Prompts Module
**File**: `src/evals2/SimplifiedScorer.prompt.ts`
**Changes**: New file with detailed prompts for 10-point scoring
```typescript
import { BaseMessage } from '@langchain/core/messages';
import { ToolExecution } from './types';
/**
* Scoring prompts for Gemini 2.5 Pro - returns 1-10 scores directly
* NTN: Leverages full 2M token context, no truncation needed
*/
export function getComprehensiveScoringPrompt(
messages: BaseMessage[],
query: string,
toolCalls: ToolExecution[],
totalDurationMs: number
): string {
// Build complete execution context
const messageHistory = messages.map((msg, idx) =>
`[${idx}] ${msg._getType()}: ${msg.content}`
).join('\n');
const toolSequence = toolCalls.map((tool, idx) =>
`[${idx}] ${tool.toolName} (${tool.duration}ms, ${tool.success ? '✓' : '✗'})`
).join('\n');
const failedTools = toolCalls.filter(t => !t.success);
const retryCount = countConsecutiveDuplicates(toolCalls);
return `You are an expert evaluator assessing an AI agent's task execution.
## TASK
User Request: "${query}"
## EXECUTION METRICS
- Total Duration: ${totalDurationMs}ms (${(totalDurationMs/1000).toFixed(1)}s)
- Tool Calls: ${toolCalls.length}
- Failed Calls: ${failedTools.length}
- Retries Detected: ${retryCount}
## TOOL EXECUTION SEQUENCE
${toolSequence}
## COMPLETE MESSAGE HISTORY
${messageHistory}
## SCORING INSTRUCTIONS
Analyze the execution and provide scores for each dimension on a 1-10 scale.
### 1. GOAL COMPLETION (Weight: 40%)
Did the agent achieve what the user requested?
- 10: Perfect completion, exceeded expectations
- 9: Fully completed with excellent quality
- 8: Fully completed with good quality
- 7: Mostly completed with minor gaps
- 6: Partially completed, main goal achieved
- 5: Half completed, significant gaps
- 4: Less than half completed
- 3: Minimal progress, mostly incomplete
- 2: Failed with very little progress
- 1: Complete failure, no progress
### 2. PLAN EFFICIENCY (Weight: 30%)
How efficient was the execution plan and timing?
Time Guidelines:
- 10: < 30 seconds - Lightning fast
- 9: < 1 minute - Extremely fast
- 8: < 2 minutes - Very efficient
- 7: < 3 minutes - Efficient
- 6: < 4 minutes - Good
- 5: < 5 minutes - Average
- 4: < 6 minutes - Below average
- 3: < 8 minutes - Slow
- 2: < 10 minutes - Very slow
- 1: > 10 minutes - Extremely slow
Also consider: tool sequence logic, unnecessary steps, optimal path taken.
### 3. ERROR HANDLING (Weight: 15%)
How well were errors and failures managed?
- 10: No errors, flawless execution
- 9: Minor issues handled perfectly
- 8: Good error recovery
- 7: Adequate error handling
- 6: Some errors, mostly recovered
- 5: Multiple errors, partial recovery
- 4: Poor error handling
- 3: Many unhandled errors
- 2: Critical errors not addressed
- 1: Complete failure due to errors
### 4. CONTEXT EFFICIENCY (Weight: 15%)
How efficiently was context/tokens used?
- 10: Extremely concise, minimal tokens
- 9: Very efficient use of context
- 8: Good efficiency
- 7: Reasonable efficiency
- 6: Acceptable usage
- 5: Average efficiency
- 4: Somewhat wasteful
- 3: Inefficient
- 2: Very inefficient
- 1: Extremely wasteful
## OUTPUT FORMAT
Return ONLY a JSON object with integer scores:
{
"goalCompletion": <1-10>,
"planEfficiency": <1-10>,
"errorHandling": <1-10>,
"contextEfficiency": <1-10>,
"reasoning": "<Brief explanation of scores>"
}`;
}
function countConsecutiveDuplicates(toolCalls: ToolExecution[]): number {
let count = 0;
for (let i = 1; i < toolCalls.length; i++) {
if (toolCalls[i].toolName === toolCalls[i-1].toolName) {
count++;
}
}
return count;
}
```
### Success Criteria:
#### Automated Verification:
- [ ] New file compiles: `npm run build:dev`
- [ ] No TypeScript errors: `npm run typecheck`
#### Manual Verification:
- [ ] Prompts are comprehensive and leverage full context
- [ ] Clear 1-10 scoring rubrics defined
- [ ] Prompts utilize tool metrics data
- [ ] Single comprehensive prompt for all dimensions
---
## Phase 4: Update Scoring Methods
### Overview
Modify scoring to use a single comprehensive prompt that returns all dimensions in 1-10 scale, leveraging tool metrics.
### Changes Required:
#### 1. Update scoreFromMessages Method
**File**: `src/evals2/SimplifiedScorer.ts`
**Changes**: Update main scoring orchestration
```typescript
async scoreFromMessages(
messages: BaseMessage[],
query: string,
toolMetrics?: Map<string, any>
): Promise<ScoreResult> {
// Extract tool calls with metrics
const toolCalls = this.extractToolCalls(messages, toolMetrics);
const totalDurationMs = this.getTotalDuration(toolCalls);
// Get LLM for scoring
const llm = await this.getLLM();
if (!llm) {
// Fallback heuristic scoring
return this.getHeuristicScores(messages, toolCalls, totalDurationMs, query);
}
// NTN: Single comprehensive prompt for all dimensions
const prompt = getComprehensiveScoringPrompt(
messages,
query,
toolCalls,
totalDurationMs
);
try {
const response = await llm.invoke(prompt);
const content = typeof response.content === 'string' ? response.content : '{}';
const scores = JSON.parse(content);
// Validate and clamp scores to 1-10 range
const goalScore = Math.min(10, Math.max(1, scores.goalCompletion || 5));
const planScore = Math.min(10, Math.max(1, scores.planEfficiency || 5));
const errorScore = Math.min(10, Math.max(1, scores.errorHandling || 5));
const contextScore = Math.min(10, Math.max(1, scores.contextEfficiency || 5));
// Calculate weighted total (1-10 scale)
const weightedTotal =
goalScore * SCORE_WEIGHTS.goalCompletion +
planScore * SCORE_WEIGHTS.planCorrectness +
errorScore * SCORE_WEIGHTS.errorFreeExecution +
contextScore * SCORE_WEIGHTS.contextEfficiency;
return {
goalCompletion: goalScore,
planCorrectness: planScore,
errorFreeExecution: errorScore,
contextEfficiency: contextScore,
weightedTotal: Math.round(weightedTotal),
details: {
toolCalls: toolCalls.length,
failedCalls: toolCalls.filter(t => !t.success).length,
retries: this.countRetries(toolCalls),
totalDurationMs,
reasoning: scores.reasoning || `Scored ${toolCalls.length} tool calls in ${totalDurationMs}ms`
}
};
} catch (error) {
console.error('LLM scoring failed:', error);
return this.getHeuristicScores(messages, toolCalls, totalDurationMs, query);
}
}
```
#### 2. Add Heuristic Fallback Method
**File**: `src/evals2/SimplifiedScorer.ts`
**Changes**: Add fallback scoring when LLM is unavailable
```typescript
/**
* Heuristic scoring fallback when LLM is unavailable
* NTN: Returns 1-10 scores based on simple heuristics
*/
private getHeuristicScores(
messages: BaseMessage[],
toolCalls: ToolExecution[],
totalDurationMs: number,
query: string
): ScoreResult {
// Goal completion heuristic
const hasDone = messages.some(msg =>
msg instanceof AIMessage &&
msg.tool_calls?.some(tc => tc.name === 'done_tool')
);
const goalScore = hasDone ? 7 : 3;
// Plan efficiency based on time
const planScore = this.scoreTimeEfficiency(totalDurationMs);
// Error handling based on failure rate
const failureRate = toolCalls.filter(t => !t.success).length / Math.max(1, toolCalls.length);
const errorScore = Math.round(10 * (1 - failureRate));
// Context efficiency based on message count
const messageCount = messages.length;
let contextScore = 5;
if (messageCount < 10) contextScore = 9;
else if (messageCount < 20) contextScore = 7;
else if (messageCount < 30) contextScore = 5;
else if (messageCount < 50) contextScore = 3;
else contextScore = 2;
const weightedTotal =
goalScore * SCORE_WEIGHTS.goalCompletion +
planScore * SCORE_WEIGHTS.planCorrectness +
errorScore * SCORE_WEIGHTS.errorFreeExecution +
contextScore * SCORE_WEIGHTS.contextEfficiency;
return {
goalCompletion: goalScore,
planCorrectness: planScore,
errorFreeExecution: errorScore,
contextEfficiency: contextScore,
weightedTotal: Math.round(weightedTotal),
details: {
toolCalls: toolCalls.length,
failedCalls: toolCalls.filter(t => !t.success).length,
retries: this.countRetries(toolCalls),
totalDurationMs,
reasoning: 'Heuristic scoring (LLM unavailable)'
}
};
}
```
### Success Criteria:
#### Automated Verification:
- [ ] All tests pass: `npm run test:run -- src/evals2`
- [ ] Integration test works: `npm run test:run -- src/evals2/integration.test.ts`
- [ ] Build succeeds: `npm run build:dev`
#### Manual Verification:
- [ ] Scores are returned in 1-10 range
- [ ] Time-based efficiency properly calculated with 10 buckets
- [ ] Full message history used (no truncation)
- [ ] Tool metrics properly integrated
- [ ] Single LLM call returns all dimensions
---
## Phase 5: Testing and Validation
### Overview
Add tests to verify the new scoring system works correctly with Gemini and 1-5 scale.
### Changes Required:
#### 1. Update Unit Tests
**File**: `src/evals2/SimplifiedScorer.test.ts`
**Changes**: Add tests for new functionality
```typescript
describe('SimplifiedScorer with Gemini', () => {
it('tests that scores are in 1-10 range', async () => {
const scorer = new SimplifiedScorer();
const messages = [/* test messages */];
const score = await scorer.scoreFromMessages(messages, 'test query');
expect(score.goalCompletion).toBeGreaterThanOrEqual(1);
expect(score.goalCompletion).toBeLessThanOrEqual(10);
expect(score.weightedTotal).toBeGreaterThanOrEqual(1);
expect(score.weightedTotal).toBeLessThanOrEqual(10);
});
it('tests that time efficiency scoring works', async () => {
const scorer = new SimplifiedScorer();
const toolMetrics = new Map([
['call_1', { toolName: 'test', duration: 30000, success: true, timestamp: Date.now() }],
['call_2', { toolName: 'test2', duration: 15000, success: true, timestamp: Date.now() }]
]);
const score = await scorer.scoreFromMessages([], 'test', toolMetrics);
expect(score.details.totalDurationMs).toBe(45000); // 45 seconds total
// Should get high efficiency score (8-9) for < 1 minute
});
it('tests that heuristic fallback works', async () => {
// Test without LLM available
const scorer = new SimplifiedScorer();
// Mock getLLM to return null
scorer['llm'] = null;
const messages = [/* test messages with done_tool */];
const score = await scorer.scoreFromMessages(messages, 'test query');
expect(score.details.reasoning).toContain('Heuristic');
expect(score.goalCompletion).toBeGreaterThanOrEqual(1);
expect(score.goalCompletion).toBeLessThanOrEqual(10);
});
});
```
### Success Criteria:
#### Automated Verification:
- [ ] All new tests pass: `npm run test:run -- src/evals2/SimplifiedScorer.test.ts`
- [ ] No regression in existing tests: `npm run test:run -- src/evals2`
- [ ] Type checking passes: `npm run typecheck`
- [ ] Linting passes: `npm run lint`
#### Manual Verification:
- [ ] Scoring system uses Gemini when API key available
- [ ] Scores consistently in 1-10 range
- [ ] Time-based efficiency correlates with actual duration (10 buckets)
- [ ] Full context utilized without truncation
- [ ] Heuristic fallback works when LLM unavailable
---
## Testing Strategy
### Unit Tests:
- Test 1-10 score range validation
- Test time efficiency buckets (10 levels)
- Test tool metrics extraction and duration calculation
- Test Gemini provider initialization
- Test heuristic fallback scoring
### Integration Tests:
- Run actual scoring with Gemini API
- Verify full context handling (large message arrays)
- Test fallback behavior without API key
- Validate scoring consistency
### Manual Testing Steps:
1. Set GOOGLE_GENAI_API_KEY or GEMINI_API_KEY environment variable
2. Run evals2 integration test with real agent execution
3. Verify scores are in 1-10 range
4. Check that execution time maps to correct efficiency bucket (1-10)
5. Confirm Gemini model is being used (check logs)
6. Test heuristic fallback by running without API key
## Performance Considerations
- Gemini 2.5 Pro can handle 2M tokens but responses are limited to 8192 tokens
- No truncation needed for input context
- Scoring latency may increase slightly with Gemini vs GPT-4o-mini
- Cache LLM instance to avoid re-initialization
## Migration Notes
- Environment variable required: GOOGLE_GENAI_API_KEY or GEMINI_API_KEY
- Existing scores in Braintrust will shift from 0-1 to 1-10 scale
- Consider running parallel scoring for validation period
## References
- Original requirements: User request in this conversation
- LangChain Gemini docs: @langchain/google-genai package
- Similar implementation: `src/evals/scoring/LLMJudge.prompts.ts:10-166`
- Tool metrics source: `src/lib/runtime/ExecutionContext.ts:44-50`
## Summary of Key Changes (Per NTN Feedback)
### 10-Point Scale Implementation
- **Direct 10-point scoring**: LLM returns 1-10 scores directly, no conversion needed
- **Removed conversion function**: No `convertToFivePointScale()` function
- **10 time efficiency buckets**: Finer granularity from 30s to 10+ minutes
- **Heuristic fallback**: Returns 1-10 scores when LLM unavailable
### Prompt Architecture
- **Single comprehensive prompt**: One LLM call for all dimensions
- **File location**: `SimplifiedScorer.prompt.ts` (not ScoringPrompts.ts)
- **Full context utilization**: No truncation, leverages Gemini's 2M token window
- **Structured JSON output**: LLM returns all scores in one JSON response
### Scoring Dimensions (1-10 scale)
1. **Goal Completion** (40%): 10=Perfect, 5=Half done, 1=Complete failure
2. **Plan Efficiency** (30%): Time-based with 10 buckets + sequence logic
3. **Error Handling** (15%): 10=Flawless, 5=Partial recovery, 1=Critical failures
4. **Context Efficiency** (15%): 10=Minimal tokens, 5=Average, 1=Extremely wasteful
### Implementation Strategy
- **Minimal refactor**: Keep existing structure, update scoring logic only
- **Hardcoded Gemini**: Always use Gemini 2.5 Pro, no configuration changes
- **Comprehensive testing**: Unit tests for 10-point scale, time buckets, and fallback

View File

@@ -0,0 +1,963 @@
# Evals2 Simplified Implementation Plan
## Overview
Implement a simplified evaluation system (evals2) that combines lightweight tool duration tracking with message-based analysis. The system will use minimal hooks in the existing code and extract all scoring data from the MessageManager history.
// NTN -- youc an use ExecutionContext as well. Need not be just MessageManager
## Current State Analysis
The current evaluation system in `src/evals/` has:
- Complex telemetry with Braintrust integration (BraintrustEventCollector)
- Dynamic tool wrapping with createTrackedTool
- Multi-dimensional LLM scoring with 6 categories
- Tight coupling to NxtScape and BrowserAgent
- Session and task tracking with parent-child spans
### Key Discoveries:
- Tool wrapping happens at execution time in `BrowserAgent._processToolCalls()` (line 632-635)
- Telemetry initialization in `NxtScape._initializeTelemetrySession()` (line 532-576)
- Task finalization with scoring in `NxtScape._finalizeTask()` (line 619-817)
- LLMJudge accesses ExecutionContext directly (line 111-200)
## Desired End State
A clean, simple evaluation system in `src/evals2/` that:
- Tracks tool duration with minimal overhead (just Date.now() calls)
- Scores executions based on MessageManager history
- Uses 4 scoring categories with specific weights
- Logs scores to Braintrust for visualization and tracking
- Has only 2 integration points in existing code
- Can be easily enabled/disabled via environment variable
### Key Requirements:
- 4 scoring categories: goal (40%), plan (30%), errors (15%), context (15%)
- Duration tracking via lightweight wrapper (no spans, no telemetry)
- All scoring data extracted from messages post-execution and any maps we create in execution context
- Simple Braintrust logger for uploading scores (no complex spans)
- Use existing LangChainProvider for LLM scoring
## What We're NOT Doing
// NTN -- let's keep parent-child relationships still
- NOT creating complex telemetry spans or parent-child relationships
- NOT using Braintrust's wrapTraced or complex span tracing
- NOT tracking individual tool metrics beyond duration
- NOT modifying execution flow or adding callbacks
- NOT creating session management or experiment infrastructure
- NOT creating a new OpenAI client (use LangChainProvider)
- NOT implementing the full 6-dimensional scoring system
## Implementation Approach
Hybrid approach combining:
1. **Lightweight tool wrapping** - ONLY for precise duration tracking
2. **Message analysis** - Everything else inferred from MessageManager
3. **Post-execution scoring** - Score after task completion using message history
4. **Minimal integration** - Just 2 hooks in existing code
## Phase 1: Remove Old Eval Hooks
### Overview
Clean up existing telemetry and evaluation hooks from the main codebase.
### Changes Required:
#### 1. NxtScape.ts
**File**: `src/lib/core/NxtScape.ts`
**Changes**: Remove telemetry imports and usage
```typescript
// Remove these imports (lines 11-16)
- import { BraintrustEventCollector } from "@/evals/BraintrustEventCollector";
- import { LLMJudge } from "@/evals/scoring/LLMJudge";
- import { BRAINTRUST_API_KEY } from "@/config";
// NTN -- need not remove this -- we can just delete not useful ones.
// NTN -- like if we want session id and parent spane we can keep that
// NTN -- we can write a much simpler BrainTrustEventManager
// Remove telemetry properties (lines 71-78)
- private telemetrySessionId: string | null = null;
- private telemetryParentSpan: string | null = null;
- private telemetry: BraintrustEventCollector | null = null;
- private conversationStartTime: number = 0;
- private taskCount: number = 0;
- private taskStartTime: number = 0;
- private sessionWeightedTotals: number[] = [];
- private experimentId: string | null = null;
// Remove telemetry initialization (lines 256-297)
// Remove telemetry session methods (lines 528-817)
// Remove _initializeTelemetrySession()
// Remove _endTelemetrySession()
// Remove _finalizeTask()
```
#### 2. BrowserAgent.ts
**File**: `src/lib/agent/BrowserAgent.ts`
**Changes**: Remove tool wrapping
```typescript
// Remove import (line 76)
- import { createTrackedTool } from '@/evals/tool-wrapper';
- if (this.executionContext.telemetry?.isEnabled() && this.executionContext.parentSpanId) {
- const wrappedTool = createTrackedTool(tool, this.executionContext);
- toolFunc = wrappedTool.func;
- }
```
#### 3. ExecutionContext.ts
**File**: `src/lib/runtime/ExecutionContext.ts`
**Changes**: Remove telemetry references
```typescript
// Remove telemetry properties
- telemetry: BraintrustEventCollector | null
- parentSpanId: string | null
```
### Success Criteria:
#### Automated Verification:
- [ ] Code compiles: `npm run build`
- [ ] Type checking passes: `npm run typecheck`
- [ ] No remaining imports from `@/evals/`: `grep -r "@/evals" src/lib/`
#### Manual Verification:
- [ ] Extension loads without errors
- [ ] Tasks execute normally without telemetry
---
## Phase 2: Create Evals2 Structure
### Overview
Create the new simplified evaluation system directory structure.
### Changes Required:
#### 1. Create Directory Structure
**Files to create**:
```
src/evals2/
├── SimpleToolWrapper.ts # Lightweight duration tracking
├── SimplifiedScorer.ts # 4-category scoring from messages
├── SimpleBraintrustLogger.ts # Minimal Braintrust integration
├── types.ts # Simple types/schemas with Zod
├── index.ts # Clean exports
└── config.ts # Configuration constants
```
#### 2. types.ts
**File**: `src/evals2/types.ts`
**Changes**: Define core types with Zod
```typescript
import { z } from "zod";
// Tool execution metadata schema
export const ToolExecutionSchema = z.object({
toolName: z.string(), // Name of the tool
duration: z.number(), // Duration in milliseconds
success: z.boolean(), // Whether tool succeeded (ok: true/false)
timestamp: z.number(), // When tool was executed
args: z.any().optional(), // Tool arguments
error: z.string().optional() // Error message if failed
});
export type ToolExecution = z.infer<typeof ToolExecutionSchema>;
// Scoring result schema
export const ScoreResultSchema = z.object({
goalCompletion: z.number().min(0).max(1), // How well goal was achieved
planCorrectness: z.number().min(0).max(1), // Quality of the plan
// NTN -- let's call this errorFreeExecution
successRatio: z.number().min(0).max(1), // Error-free execution ratio
contextEfficiency: z.number().min(0).max(1), // Efficient context usage
weightedTotal: z.number().min(0).max(1), // Weighted average
details: z.object({ // Scoring details
toolCalls: z.number(), // Total number of tool calls
failedCalls: z.number(), // Number of failed calls
retries: z.number(), // Number of retried calls
reasoning: z.string().optional() // LLM reasoning
})
});
export type ScoreResult = z.infer<typeof ScoreResultSchema>;
// Duration storage options
export const DurationStorageSchema = z.enum(["result", "context", "collector"]);
export type DurationStorage = z.infer<typeof DurationStorageSchema>;
```
#### 3. config.ts
**File**: `src/evals2/config.ts`
**Changes**: Configuration constants
```typescript
// Scoring weights
export const SCORE_WEIGHTS = {
goalCompletion: 0.40, // 40% - Most important
planCorrectness: 0.30, // 30% - Plan quality
successRatio: 0.15, // 15% - Error handling
contextEfficiency: 0.15 // 15% - Efficiency
} as const;
// Default scoring model
export const DEFAULT_SCORING_MODEL = "gpt-4o-mini";
// Environment variable names
export const ENV_VARS = {
ENABLE: "ENABLE_EVALS2",
BRAINTRUST_KEY: "BRAINTRUST_API_KEY",
SCORING_MODEL: "OPENAI_MODEL_FOR_SCORING"
} as const;
```
### Success Criteria:
#### Automated Verification:
- [ ] New directory exists: `test -d src/evals2`
- [ ] All files created: `ls src/evals2/*.ts | wc -l` returns 5
- [ ] Types compile: `npm run typecheck`
#### Manual Verification:
- [ ] Directory structure matches specification
- [ ] Type definitions are complete
---
## Phase 3: Implement Core Components
### Overview
Implement the lightweight tool wrapper and simplified scorer.
### Changes Required:
#### 1. SimpleToolWrapper.ts
**File**: `src/evals2/SimpleToolWrapper.ts`
**Changes**: Minimal duration tracking wrapper
```typescript
import { DynamicStructuredTool } from '@langchain/core/tools';
import type { ExecutionContext } from '@/lib/runtime/ExecutionContext';
/**
* Wrap a tool to track execution duration in ExecutionContext
* Stores metrics in context.toolMetrics Map
*/
export function wrapToolForMetrics(
tool: DynamicStructuredTool,
context: ExecutionContext,
toolCallId: string
): DynamicStructuredTool {
return new DynamicStructuredTool({
name: tool.name,
description: tool.description,
schema: tool.schema,
func: async (input: any) => {
const start = Date.now();
try {
const result = await tool.func(input);
const duration = Date.now() - start;
// Parse result to check success
let success = true;
try {
const parsed = JSON.parse(result);
success = parsed.ok !== false;
} catch {
// If not JSON, assume success
}
// Store metrics in ExecutionContext
if (!context.toolMetrics) {
context.toolMetrics = new Map();
}
context.toolMetrics.set(toolCallId, {
toolName: tool.name,
duration,
success,
timestamp: start
});
console.log(`⚡ Tool: ${tool.name} (${duration}ms)`);
return result;
} catch (error: any) {
const duration = Date.now() - start;
// Store error metrics
if (!context.toolMetrics) {
context.toolMetrics = new Map();
}
context.toolMetrics.set(toolCallId, {
toolName: tool.name,
duration,
success: false,
timestamp: start,
error: error.message
});
console.error(`❌ Tool: ${tool.name} failed (${duration}ms)`);
throw error;
}
}
});
}
export { wrapToolForMetrics as wrapToolForDuration }; // Alias for compatibility
```
#### 2. SimplifiedScorer.ts
**File**: `src/evals2/SimplifiedScorer.ts`
**Changes**: Score from message history
```typescript
import { BaseMessage, AIMessage, ToolMessage } from '@langchain/core/messages';
import { BaseChatModel } from '@langchain/core/language_models/chat_models';
import { getLLM } from '@/lib/llm/LangChainProvider';
import { SCORE_WEIGHTS, DEFAULT_SCORING_MODEL } from './config';
import { ScoreResult, ToolExecution } from './types';
export class SimplifiedScorer {
private model: string;
private llm: BaseChatModel | null = null;
constructor(model?: string) {
this.model = model || process.env.OPENAI_MODEL_FOR_SCORING || DEFAULT_SCORING_MODEL;
}
private async getLLM(): Promise<BaseChatModel | null> {
if (!this.llm) {
try {
this.llm = await getLLM({ temperature: 0, maxTokens: 100 });
} catch {
return null;
}
}
return this.llm;
}
/**
* Score task completion from message history
*/
async scoreFromMessages(
messages: BaseMessage[],
query: string
): Promise<ScoreResult> {
// Extract tool calls from messages
const toolCalls = this.extractToolCalls(messages);
// Calculate individual scores
const goalScore = await this.scoreGoalCompletion(messages, query);
const planScore = this.scorePlanCorrectness(toolCalls, query);
const errorScore = this.scoreSuccessRatio(toolCalls);
const contextScore = this.scoreContextEfficiency(messages, toolCalls);
// Calculate weighted total
const weightedTotal =
goalScore * SCORE_WEIGHTS.goalCompletion +
planScore * SCORE_WEIGHTS.planCorrectness +
errorScore * SCORE_WEIGHTS.successRatio +
contextScore * SCORE_WEIGHTS.contextEfficiency;
return {
goalCompletion: goalScore,
planCorrectness: planScore,
successRatio: errorScore,
contextEfficiency: contextScore,
weightedTotal,
details: {
toolCalls: toolCalls.length,
failedCalls: toolCalls.filter(t => !t.success).length,
retries: this.countRetries(toolCalls),
reasoning: `Scored ${toolCalls.length} tool calls for query: ${query}`
}
};
}
/**
* Extract tool calls from message history
* @param messages - Message history from MessageManager
* @param toolMetrics - Optional metrics Map from ExecutionContext
*/
private extractToolCalls(messages: BaseMessage[], toolMetrics?: Map<string, any>): ToolExecution[] {
const toolCalls: ToolExecution[] = [];
// Simple iteration using instanceof
for (let i = 0; i < messages.length; i++) {
const msg = messages[i];
// Check if it's an AIMessage with tool calls
if (msg instanceof AIMessage && msg.tool_calls && msg.tool_calls.length > 0) {
for (const toolCall of msg.tool_calls) {
// Find the next ToolMessage with matching ID
const toolMsg = messages.slice(i + 1).find(
m => m instanceof ToolMessage && m.tool_call_id === toolCall.id
) as ToolMessage | undefined;
// Get metrics from ExecutionContext if available
const metrics = toolMetrics?.get(toolCall.id);
let success = true;
let error: string | undefined;
if (toolMsg) {
// Parse tool result to check success
try {
const result = JSON.parse(toolMsg.content as string);
success = result.ok !== false;
error = result.error;
} catch {
// Not JSON, assume success
}
}
toolCalls.push({
toolName: toolCall.name,
duration: metrics?.duration || 100, // Use tracked duration or default
success: metrics?.success ?? success,
timestamp: metrics?.timestamp || Date.now(),
args: toolCall.args,
error: metrics?.error || error
});
}
}
}
return toolCalls;
}
private async scoreGoalCompletion(messages: BaseMessage[], query: string): Promise<number> {
const llm = await this.getLLM();
if (!llm) {
// Simple heuristic: check if done_tool was called
const hasDone = messages.some(msg =>
msg instanceof AIMessage &&
msg.tool_calls?.some(tc => tc.name === 'done_tool')
);
return hasDone ? 0.8 : 0.3;
}
// Simple prompt for LLM scoring
const lastMessages = messages.slice(-5);
const prompt = `Task: "${query}"
Last 5 messages:
${lastMessages.map(m => `${m.constructor.name}: ${typeof m.content === 'string' ? m.content.slice(0, 100) : '...'}`).join('\n')}
Score task completion (0-1):
1 = fully completed
0.5 = partial
0 = not done
Reply with ONLY a number:`;
try {
const response = await llm.invoke(prompt);
const content = typeof response.content === 'string' ? response.content : '0.5';
const score = parseFloat(content.trim());
return Math.min(1, Math.max(0, isNaN(score) ? 0.5 : score));
} catch {
return 0.5;
}
}
private async scorePlanCorrectness(toolCalls: ToolExecution[], query: string): Promise<number> {
const llm = await this.getLLM();
if (!llm) {
// Simple heuristic based on tool count and pattern
if (toolCalls.length === 0) return 0;
if (toolCalls.length > 20) return 0.3;
const hasPlanning = toolCalls.some(t =>
t.toolName === 'classification_tool' ||
t.toolName === 'planner_tool'
);
return hasPlanning ? 0.7 : 0.5;
}
// Simple prompt for plan quality
const toolSequence = toolCalls.slice(0, 20).map(t => t.toolName).join(' → ');
const prompt = `Task: "${query}"
Tools: ${toolSequence}
Rate efficiency (0-1):
1 = efficient
0.5 = okay
0 = inefficient
Reply with ONLY a number:`;
try {
const response = await llm.invoke(prompt);
const content = typeof response.content === 'string' ? response.content : '0.5';
const score = parseFloat(content.trim());
return Math.min(1, Math.max(0, isNaN(score) ? 0.5 : score));
} catch {
return 0.5;
}
}
private scoreErrorFreeExecution(toolCalls: ToolExecution[]): number {
if (toolCalls.length === 0) return 1.0;
const successCount = toolCalls.filter(t => t.success).length;
const errorCount = toolCalls.filter(t => !t.success).length;
const retryCount = this.countRetries(toolCalls);
// Simple formula: success ratio minus penalties
const baseRatio = successCount / toolCalls.length;
const retryPenalty = retryCount * 0.05; // 5% per retry
const errorPenalty = errorCount * 0.10; // 10% per error
return Math.max(0, baseRatio - retryPenalty - errorPenalty);
}
private scoreContextEfficiency(messages: BaseMessage[]): number {
// Simple token estimation: ~4 chars per token
const totalChars = messages.reduce((sum, msg) => {
const content = typeof msg.content === 'string' ? msg.content : JSON.stringify(msg.content);
return sum + content.length;
}, 0);
const estimatedTokens = totalChars / 4;
// Simple scoring based on requirements
if (estimatedTokens <= 32000) return 1.0; // 5/5
if (estimatedTokens <= 64000) return 0.8; // 4/5
if (estimatedTokens <= 128000) return 0.6; // 3/5
if (estimatedTokens <= 256000) return 0.4; // 2/5
return 0.2; // 1/5
}
private countRetries(toolCalls: ToolExecution[]): number {
let retries = 0;
for (let i = 1; i < toolCalls.length; i++) {
// Same tool called consecutively = likely retry
if (toolCalls[i].toolName === toolCalls[i-1].toolName) {
retries++;
}
}
return retries;
}
}
```
#### 3. SimpleBraintrustLogger.ts
**File**: `src/evals2/SimpleBraintrustLogger.ts`
**Changes**: Minimal Braintrust integration for score logging
```typescript
import { BRAINTRUST_API_KEY } from '@/config';
import { ScoreResult } from './types';
// Lazy load Braintrust to avoid module loading issues
let initLogger: any = null;
/**
* Simple Braintrust logger that only uploads scores
* No complex spans, no session management, just scores
*/
export class SimpleBraintrustLogger {
private logger: any = null;
private initialized: boolean = false;
async initialize(): Promise<boolean> {
if (this.initialized) return true;
this.initialized = true;
if (!BRAINTRUST_API_KEY) {
console.log('%c⚠ No Braintrust API key, scores won\'t be uploaded', 'color: #ff9900; font-size: 10px');
return false;
}
try {
// Lazy load braintrust module
if (!initLogger) {
const braintrust = require('braintrust');
initLogger = braintrust.initLogger;
}
// Initialize simple logger (not experiment)
this.logger = initLogger({
apiKey: BRAINTRUST_API_KEY,
projectName: 'browseros-agent-online'
});
console.log('%c✓ Braintrust logger initialized', 'color: #00ff00; font-size: 10px');
return true;
} catch (error) {
console.warn('Failed to initialize Braintrust:', error);
return false;
}
}
async logTaskScore(
query: string,
score: ScoreResult,
duration_ms: number,
metadata?: any
): Promise<void> {
if (!this.logger) {
const success = await this.initialize();
if (!success) return;
}
try {
// Log as a simple traced event with scores
await this.logger.traced(async (span: any) => {
span.log({
input: query,
output: `Task completed with score: ${score.weightedTotal.toFixed(2)}`,
scores: {
// Our 4 simplified scores
goal_completion: score.goalCompletion,
plan_correctness: score.planCorrectness,
success_ratio: score.successRatio,
context_efficiency: score.contextEfficiency,
weighted_total: score.weightedTotal
},
metadata: {
type: 'evals2_task',
duration_ms,
tool_calls: score.details.toolCalls,
failed_calls: score.details.failedCalls,
retries: score.details.retries,
...metadata
}
});
}, {
name: 'evals2_task_score'
});
console.log('%c📊 Scores uploaded to Braintrust', 'color: #4caf50; font-size: 10px');
} catch (error) {
// Silent failure - don't break execution
console.debug('Failed to log to Braintrust:', error);
}
}
async flush(): Promise<void> {
if (this.logger && this.logger.flush) {
await this.logger.flush();
}
}
}
// Export singleton instance
export const braintrustLogger = new SimpleBraintrustLogger();
```
### Success Criteria:
#### Automated Verification:
- [ ] Components compile: `npm run build`
- [ ] No type errors: `npm run typecheck`
- [ ] Unit tests pass: `npm test src/evals2`
#### Manual Verification:
- [ ] Tool wrapper adds minimal overhead (<5ms)
- [ ] Scorer extracts correct tool calls from messages
- [ ] Scores are in 0-1 range
- [ ] Scores appear in Braintrust dashboard
---
## Phase 4: Add Integration Hooks
### Overview
Add minimal hooks in existing code to enable evals2.
### Changes Required:
#### 1. BrowserAgent Integration
**File**: `src/lib/agent/BrowserAgent.ts`
**Changes**: Add conditional tool wrapping
```typescript
// Add import at top
import { wrapToolForMetrics } from '@/evals2/SimpleToolWrapper';
// In _processToolCalls method (around line 630)
let toolFunc = tool.func;
// Add evals2 wrapping (pass tool call ID for metrics tracking)
if (process.env.ENABLE_EVALS2 === 'true') {
const toolCallId = `call_${Date.now()}_${Math.random().toString(36).substr(2, 9)}`;
// NTN -- I think tools might have an id field already
const wrappedTool = wrapToolForMetrics(tool, this.executionContext, toolCallId);
toolFunc = wrappedTool.func;
}
const toolResult = await toolFunc(args);
```
#### 2. NxtScape Integration
**File**: `src/lib/core/NxtScape.ts`
**Changes**: Add scoring and Braintrust logging after task completion
```typescript
// Add imports at top
import { SimplifiedScorer } from '@/evals2/SimplifiedScorer';
import { braintrustLogger } from '@/evals2/SimpleBraintrustLogger';
// In _executeAgent method, after successful execution (around line 316)
// Right after: await this.browserAgent.execute(query, metadata);
// Add evals2 scoring and logging
if (process.env.ENABLE_EVALS2 === 'true') {
const taskStartTime = Date.now();
try {
// Score the task
const scorer = new SimplifiedScorer();
const score = await scorer.scoreFromMessages(
this.messageManager.getMessages(),
query,
this.executionContext.toolMetrics // Pass tool metrics for duration data
);
const duration = Date.now() - taskStartTime;
// Log to console
console.log('Evals2 Score:', {
goal: score.goalCompletion.toFixed(2),
plan: score.planCorrectness.toFixed(2),
errors: score.successRatio.toFixed(2),
context: score.contextEfficiency.toFixed(2),
total: score.weightedTotal.toFixed(2)
});
// Upload to Braintrust
await braintrustLogger.logTaskScore(
query,
score,
duration,
{
selectedTabIds: tabIds || [],
mode: mode || 'browse'
}
);
} catch (error) {
console.warn('Evals2 scoring failed:', error);
// Don't break execution if scoring fails
}
}
```
### Success Criteria:
#### Automated Verification:
- [ ] Code compiles with hooks: `npm run build`
- [ ] Extension loads: `npm run build:dev`
- [ ] Environment variable check works: `ENABLE_EVALS2=true npm test`
#### Manual Verification:
- [ ] Tool durations are tracked when enabled
- [ ] Scores are logged to console when enabled
- [ ] No impact when disabled (default)
- [ ] Scoring errors don't break execution
---
## Phase 5: Testing & Cleanup
### Overview
Test the new system and clean up any remaining old code.
### Changes Required:
#### 1. Create Test File
**File**: `src/evals2/SimplifiedScorer.test.ts`
**Changes**: Basic unit tests
```typescript
import { describe, it, expect } from 'vitest';
import { SimplifiedScorer } from './SimplifiedScorer';
import { HumanMessage, AIMessage, ToolMessage } from '@langchain/core/messages';
describe('SimplifiedScorer', () => {
it('tests that the scorer can be created', () => {
const scorer = new SimplifiedScorer();
expect(scorer).toBeDefined();
});
it('tests that scoring handles empty messages', async () => {
const scorer = new SimplifiedScorer();
const score = await scorer.scoreFromMessages([], 'test query');
expect(score.weightedTotal).toBeGreaterThanOrEqual(0);
expect(score.weightedTotal).toBeLessThanOrEqual(1);
});
it('tests that tool calls are extracted correctly', async () => {
const messages = [
new HumanMessage('test'),
new AIMessage({
content: '',
tool_calls: [{
id: 'call_1',
name: 'test_tool',
args: { input: 'test' }
}]
}),
new ToolMessage({
content: JSON.stringify({ ok: true, output: 'result' }),
tool_call_id: 'call_1'
})
];
const scorer = new SimplifiedScorer();
const score = await scorer.scoreFromMessages(messages, 'test');
expect(score.details.toolCalls).toBe(1);
expect(score.details.failedCalls).toBe(0);
});
});
```
#### 2. Remove Old Evals Directory
**Actions**:
```bash
# After confirming new system works
rm -rf src/evals/
```
#### 3. Update ExecutionContext
**File**: `src/lib/runtime/ExecutionContext.ts`
**Changes**: Add toolMetrics Map
```typescript
export class ExecutionContext {
// ... existing properties ...
// Add tool metrics Map for evals2
toolMetrics: Map<string, {
toolName: string;
duration: number;
success: boolean;
timestamp: number;
error?: string;
}> | undefined;
// In reset() method, add:
public reset(): void {
// ... existing reset code ...
this.toolMetrics?.clear();
this.toolMetrics = undefined;
}
}
```
#### 4. Update Package.json Scripts
**File**: `package.json`
**Changes**: Add evals2 test script
```json
{
"scripts": {
"test:evals2": "ENABLE_EVALS2=true vitest run src/evals2"
}
}
```
### Success Criteria:
#### Automated Verification:
- [ ] Unit tests pass: `npm run test:evals2`
- [ ] No references to old evals: `grep -r "src/evals/" src/`
- [ ] Build succeeds: `npm run build`
- [ ] Linting passes: `npm run lint`
#### Manual Verification:
- [ ] Extension works with ENABLE_EVALS2=true
- [ ] Scores are reasonable for sample tasks
- [ ] Scores appear in Braintrust dashboard at https://braintrust.dev/app/Felafax/p/browseros-agent-online/logs
- [ ] No performance degradation
- [ ] Clean console output
---
## Testing Strategy
### Unit Tests:
- Test SimplifiedScorer with mock messages
- Test SimpleToolWrapper duration tracking
- Test score calculations with edge cases
### Integration Tests:
- Run simple task with evals2 enabled
- Verify scores are in expected range
- Check duration tracking accuracy
### Manual Testing Steps:
1. Set environment variables:
```bash
export ENABLE_EVALS2=true
export BRAINTRUST_API_KEY=your-key # From config.ts
```
2. Build extension: `npm run build:dev`
3. Execute simple task: "Navigate to google.com"
4. Verify:
- Console shows 4 scores (goal, plan, errors, context)
- Braintrust dashboard shows the task with scores
5. Execute complex task: "Find the weather in San Francisco"
6. Verify:
- Plan score reflects multi-step execution
- Tool durations are reasonable
- Scores uploaded to Braintrust
## Performance Considerations
- Tool wrapping adds ~1ms overhead per call (just Date.now())
- Scoring happens after execution (no runtime impact)
- Message extraction is O(n) where n = message count
- No memory leaks (durations cleared after scoring)
## Migration Notes
- Environment variable controls migration: ENABLE_EVALS2
- Can run both systems in parallel during transition
- Old telemetry can be removed after validation
- Scores may differ slightly due to simplified heuristics
- Braintrust scores will appear under 'evals2_task_score' events
## Key Differences from Old System
| Aspect | Old Evals | New Evals2 |
|--------|-----------|------------|
| Scoring Dimensions | 6 (complex weights) | 4 (simple weights) |
| Telemetry | Complex span hierarchy | Single task event |
| Tool Tracking | Braintrust wrapTraced | Simple duration Map |
| LLM Client | Direct OpenAI | LangChainProvider |
| Initialization | Lazy singleton | Direct instantiation |
| Session Management | Yes (parent spans) | No (just task scores) |
| Code Complexity | ~2000 lines | ~500 lines |
| Dependencies | Braintrust SDK, OpenAI | Braintrust SDK only |
## Implementation Summary
This plan creates a drastically simplified evaluation system that:
1. **Reduces complexity** from 2000+ lines to ~500 lines
2. **Keeps Braintrust integration** for score visualization and tracking
3. **Simplifies scoring** from 6 dimensions to 4 clear metrics
4. **Uses existing infrastructure** (LangChainProvider) instead of creating new clients
5. **Minimizes performance impact** with simple Map-based duration tracking
6. **Maintains compatibility** with existing Braintrust dashboards
The key insight is that we don't need complex telemetry infrastructure to get valuable evaluation data. By focusing on the essential metrics (goal completion, plan quality, error rate, context efficiency) and using simple, direct Braintrust logging, we achieve the same visibility with much less code.
## References
- Original research: `thoughts/shared/research/2025-09-04_braintrust_evaluation_research.md`
- Current eval architecture: `docs/CURRENT_EVALS_ARCHITECTURE.md`
- Current eval code: `src/evals/`
- Message types: `src/lib/runtime/MessageManager.ts`
- Tool execution: `src/lib/agent/BrowserAgent.ts:630-640`
- LangChain Provider: `src/lib/llm/LangChainProvider.ts`

View File

@@ -0,0 +1,389 @@
---
date: 2025-09-04T10:00:00-08:00
researcher: Claude Code
git_commit: 16e091db20ab1c17354729c34bef5ed75a1a200c
branch: dev/evals2
repository: BrowserOS-agent
topic: "Braintrust Evaluation System Implementation"
tags: [research, codebase, braintrust, telemetry, evaluation, llm-judge, experiments]
status: complete
last_updated: 2025-09-04
last_updated_by: Claude Code
---
# Research: Braintrust Evaluation System Implementation
**Date**: 2025-09-04T10:00:00-08:00
**Researcher**: Claude Code
**Git Commit**: 16e091db20ab1c17354729c34bef5ed75a1a200c
**Branch**: dev/evals2
**Repository**: BrowserOS-agent
## Research Question
Thoroughly research and understand how the Braintrust evaluation system is currently implemented in this codebase, including telemetry, tool wrapping, scoring, and experiment running.
## Summary
The Braintrust evaluation system in BrowserOS-agent is a comprehensive telemetry and evaluation framework that tracks agent execution in real-time, scores task completion using an LLM judge, and enables A/B testing experiments. The system uses a singleton telemetry collector with lazy initialization, dynamic tool wrapping with Braintrust's `wrapTraced`, multi-dimensional scoring via OpenAI, and a replay mechanism for comparing different agent versions.
## Detailed Findings
### 1. Telemetry System Architecture
#### BraintrustEventCollector (`src/evals/BraintrustEventCollector.ts`)
- **Singleton Pattern**: Single instance across the entire application via `getInstance()`
- **Lazy Initialization**: Telemetry only initializes when first used AND when `ENABLE_TELEMETRY=true`
- **Session Management**: Creates parent spans for conversation sessions containing multiple tasks
- **Event Types**: Tracks `session_start`, `session_end`, `tool_execution`, `decision_point`, `error`, `browser_action`, `user_feedback`
- **Dual Logging**: Can log to both telemetry logger AND experiments simultaneously
Key implementation details:
```typescript
// Lazy initialization pattern - checks on every public method
private async _ensureInitialized(): Promise<void> {
if (this.initialized) return;
this.initialized = true;
this.enabled = this._checkIfEnabled();
if (this.enabled) {
await this._initialize();
}
}
// Session tracking with parent-child span relationships
async startSession(metadata: SessionMetadata): Promise<{ parent?: string }> {
const parent = await this.logger.traced(async (span: any) => {
span.log({
input: validatedMetadata.task,
metadata: { sessionId, timestamp, tabContext, browserInfo }
})
return await span.export()
}, { name: 'agent_session' })
return { parent }
}
```
#### Integration in NxtScape (`src/lib/core/NxtScape.ts`)
- **Deferred Initialization**: Telemetry session only starts on first task (not on extension open)
- **Task Tracking**: Each task gets a `task_N_start` and `task_N_[success|error|paused]` event
- **Score Aggregation**: Tracks `weighted_total` scores across tasks for session average
- **Dual Logging**: When `experimentId` is provided, logs to both telemetry AND experiment
### 2. Tool Telemetry Wrapping
#### createTrackedTool (`src/evals/tool-wrapper.ts`)
- **Dynamic Wrapping**: Tools are wrapped at execution time, not at creation
- **Braintrust Integration**: Uses Braintrust's `wrapTraced` for automatic span creation
- **Metrics Tracking**: Duration, success/failure, error counts
- **Error Handling**: Distinguishes between "soft errors" (tool returns `{ok: false}`) and exceptions
```typescript
export function createTrackedTool(tool: DynamicStructuredTool, context: ExecutionContext): DynamicStructuredTool {
const wrapTraced = telemetry.getWrapTraced()
if (!wrapTraced) return tool
const trackedFunc = wrapTraced(
async (input: any, span: any) => {
const startTime = performance.now()
try {
const result = await originalFunc(input)
// Check for soft errors (ok: false)
const parsedResult = JSON.parse(result)
if (!parsedResult.ok) {
// Log as error with structured format for Braintrust
span.log({
error: { name: 'Tool error', message: errorMessage },
metrics: { duration_ms, success: 0 },
logs: { 'Tool errors': [errorDetails] }
})
}
} catch (error) {
// Handle exceptions differently
}
},
{ type: 'tool', name: toolName, parent: context.parentSpanId }
)
}
```
#### Integration in BrowserAgent
- Tools are wrapped conditionally when telemetry is enabled
- Wrapping happens just before tool execution to capture current context
### 3. LLMJudge Scoring System
#### LLMJudge (`src/evals/scoring/LLMJudge.ts`)
- **Multi-Dimensional Scoring**: 6 dimensions with weighted average
- **Score Dimensions**:
- `goal_achievement` (40% weight) - Did we achieve the user's goal?
- `execution_quality` (20% weight) - Quality of execution steps
- `execution_precision` (15% weight) - No unnecessary retries
- `progress_made` (10% weight) - Amount of progress toward goal
- `plan_coherence` (8% weight) - Logic of the plan
- `error_handling` (7% weight) - How errors were handled
- **Full Context Access**: Directly accesses ExecutionContext stores (MessageManager, TodoStore, BrowserContext)
- **OpenAI Integration**: Uses raw OpenAI client (not wrapped) to avoid creating separate spans
```typescript
async scoreTaskCompletionWithContext(
userTask: string,
executionContext: ExecutionContext,
taskOutcome?: { outcome: 'success' | 'error' | 'paused', duration_ms: number }
): Promise<JudgeResult> {
// Build full context from ExecutionContext
const fullContext = await this.buildFullContext(executionContext, taskOutcome)
// Get multi-dimensional scoring prompt
const prompt = getMultiDimensionalScoringPrompt(userTask, fullContext)
// Score with OpenAI
const completion = await scoringOpenAI.chat.completions.create({
model: this.model,
messages: [{ role: 'user', content: prompt }],
response_format: { type: 'json_object' }
})
// Calculate weighted average
const weightedTotal = calculateWeightedAverage(dimensionScores)
return { score: weightedTotal, scores: dimensionScores, scoringDetails }
}
```
### 4. Experiment Runner
#### ExperimentHelper (`src/evals/ExperimentRunner.ts`)
- **Replay Mechanism**: Fetches historical logs tagged with version (e.g., "v1") and replays them
- **Baseline Comparison**: Creates two experiments - baseline (v1) and new (v2)
- **BTQL Queries**: Uses Braintrust Query Language to fetch logs by tag
- **Child Span Analysis**: Fetches child spans to find decision points with scores
- **Complete Cleanup**: Between tests, clears Chrome storage, resets singletons, closes tabs
```typescript
static async runSingleTest(log: any, index: number, v1ExperimentId: string, v2ExperimentId: string): Promise<Result> {
// Cleanup before test
await this.performCompleteCleanup()
// Run test with v2 code
const experimentNxtScape = new NxtScape({
experimentId: v2ExperimentId // Enables dual logging
})
await experimentNxtScape.run({ query: log.input })
// Fetch v1 scores from historical data
const decisionSpan = await this.fetchDecisionSpan(log, apiKey)
const v1Scores = this.extractV1Scores(decisionSpan)
// Log both v1 and v2 to experiments for comparison
// v1 uses historical scores, v2 uses new execution scores
}
private static async performCompleteCleanup(): Promise<void> {
// Clear Chrome storage
await chrome.storage.local.clear()
await chrome.storage.session.clear()
// Reset singleton instances
BraintrustEventCollector.getInstance().reset()
// Close all tabs and create fresh one
const newTab = await chrome.tabs.create()
await closeAllOtherTabs()
}
```
### 5. Data Flow
#### User Interaction → Braintrust Flow:
1. **User Query** → Side Panel → Background Script → `NxtScape.run()`
2. **Session Start**`BraintrustEventCollector.startSession()` creates parent span
3. **Task Start**`NxtScape._finalizeTask()` logs `task_N_start` event
4. **Tool Execution**`createTrackedTool()` wraps tool → logs metrics via `wrapTraced`
5. **LLM Scoring**`LLMJudge.scoreTaskCompletionWithContext()` → multi-dimensional scores
6. **Task End**`NxtScape._finalizeTask()` logs `task_N_[outcome]` with scores
7. **Session End**`BraintrustEventCollector.endSession()` with aggregated scores
#### Key Data Structures:
```typescript
// Event structure sent to Braintrust
{
type: 'decision_point',
name: 'task_1_success',
data: { task, duration_ms, success, phase },
scores: {
goal_achievement: 0.9,
execution_quality: 0.8,
weighted_total: 0.85,
task_completed: 1.0
},
scoring_details: { /* LLM response details */ },
error: { name, message, stack } // If error occurred
}
```
### 6. Configuration and Setup
#### Environment Variables (`src/config.ts`):
- `ENABLE_TELEMETRY=true` - Master switch for telemetry
- `BRAINTRUST_API_KEY` - Required for logging to Braintrust
- `OPENAI_API_KEY_FOR_SCORING` - Required for LLM scoring
- `OPENAI_MODEL_FOR_SCORING` - Model for scoring (default: gpt-4o)
- `BRAINTRUST_PROJECT_UUID` - Required for experiments
#### Braintrust Project Setup:
- Project name: `browseros-agent-online`
- Organization: `Felafax`
- Dashboard: `https://braintrust.dev/app/Felafax/p/browseros-agent-online`
### 7. Key Design Patterns
#### Singleton Pattern with Lazy Initialization
- `BraintrustEventCollector` uses singleton to ensure one instance
- Lazy initialization prevents overhead when telemetry is disabled
- Allows environment variables to be set after construction
#### Decorator Pattern for Tool Telemetry
- Tools are wrapped dynamically at execution time
- Preserves original tool functionality while adding telemetry
- Uses Braintrust's `wrapTraced` for proper span creation
#### Parent-Child Span Relationships
- Conversation session is parent span
- Individual tasks are child spans
- Tool executions are grandchild spans
- Creates hierarchical trace visualization in Braintrust
#### Dual Logging Pattern
- Normal execution logs to telemetry logger (`initLogger`)
- Experiment mode logs to BOTH telemetry AND experiment
- Enables A/B testing without losing regular telemetry
## Architecture Insights
### Why Tools are Wrapped Dynamically
1. **Context Availability**: Execution context (parent span, session ID) is only available at runtime
2. **Performance**: Avoids wrapping tools that won't be used
3. **Flexibility**: Different tools can be wrapped differently based on context
### Score Aggregation Strategy
- Individual tasks get multi-dimensional scores
- Session success = average of all task `weighted_total` scores
- Allows partial credit for incomplete sessions
- Preserves detailed scoring for analysis
### Experiment Isolation
- Complete cleanup between tests (storage, tabs, singletons)
- Each test runs in fresh environment
- Prevents state leakage between experiments
## Code References
- `src/evals/BraintrustEventCollector.ts:69-190` - Singleton initialization with lazy loading
- `src/evals/tool-wrapper.ts:38-227` - Dynamic tool wrapping with wrapTraced
- `src/evals/scoring/LLMJudge.ts:256-426` - Full context scoring implementation
- `src/evals/ExperimentRunner.ts:857-949` - Single test execution with cleanup
- `src/lib/core/NxtScape.ts:531-617` - Session management and score aggregation
- `src/lib/core/NxtScape.ts:619-817` - Task finalization with dual logging
- `src/background/index.ts:72-208` - Experiment UI integration
## Historical Context (from thoughts/)
No existing research documents found specifically about the Braintrust evaluation system. This appears to be a relatively new feature addition to the codebase.
## Related Research
- None found in `thoughts/shared/research/` directory related to evaluation systems
## Issues and Inefficiencies Identified
### 1. **Complex Initialization Chain**
- Telemetry initialization is spread across multiple files
- Lazy initialization pattern is complex and could be simplified
- Environment variable checking happens in multiple places
### 2. **Score Format Inconsistency**
- Multiple score field names (`success`, `task_completion`, `task_completed`)
- Score normalization logic duplicated in multiple places
- Confusion between session scores and task scores
### 3. **Error Handling Complexity**
- Different error formats for tools vs execution errors
- Error tracking duplicated between telemetry and scoring
- Structured error format not consistently applied
### 4. **Tight Coupling**
- `NxtScape` directly imports and uses `BraintrustEventCollector`
- LLM Judge directly accesses ExecutionContext internals
- Experiment runner has hardcoded cleanup logic
### 5. **Performance Overhead**
- Full context extraction for every scoring call
- Multiple API calls for experiment replay
- No caching of scores or context
## Suggestions for Cleaner Reimplementation
### 1. **Unified Telemetry Interface**
Create a clean `TelemetryService` interface that abstracts Braintrust implementation:
```typescript
interface TelemetryService {
startSession(metadata: SessionMetadata): Promise<string>
logEvent(event: TelemetryEvent): Promise<void>
endSession(sessionId: string, result: SessionResult): Promise<void>
wrapTool(tool: Tool): Tool
}
```
### 2. **Standardized Score Schema**
Use consistent Zod schemas for all scores:
```typescript
const ScoreSchema = z.object({
goal_achievement: z.number().min(0).max(1),
execution_quality: z.number().min(0).max(1),
// ... other dimensions
weighted_total: z.number().min(0).max(1)
})
```
### 3. **Event Bus for Telemetry**
Use existing PubSub system for telemetry events instead of direct coupling:
```typescript
PubSub.publish('telemetry:task:start', { task, context })
PubSub.publish('telemetry:tool:execute', { tool, input, output })
```
### 4. **Separate Scoring Service**
Extract scoring into independent service with clear interface:
```typescript
interface ScoringService {
scoreTask(task: string, context: TaskContext): Promise<Scores>
aggregateScores(scores: Scores[]): number
}
```
### 5. **Configuration Service**
Centralize all telemetry configuration:
```typescript
class TelemetryConfig {
private static instance: TelemetryConfig
isEnabled(): boolean
getApiKey(): string
getScoringModel(): string
getProjectId(): string
}
```
### 6. **Simplified Experiment Runner**
- Use factory pattern for creating test environments
- Extract cleanup logic into reusable utilities
- Use async iterators for test execution
### 7. **Type Safety Improvements**
- Use branded types for IDs (SessionId, SpanId, ExperimentId)
- Use discriminated unions for events
- Add runtime validation for all external data
## Open Questions
1. Why is telemetry initialization deferred until first task instead of on extension start?
2. How are tool error counts used beyond logging?
3. Why does experiment mode use dual logging instead of just experiment logging?
4. What determines the weights for multi-dimensional scoring?
5. How is the Braintrust project UUID determined/configured?
6. Why use raw OpenAI client for scoring instead of wrapped version?
7. What's the purpose of tracking `tool_success_rate` in session end?

View File

@@ -0,0 +1,283 @@
---
date: 2025-09-05T16:30:20Z
researcher: Claude
git_commit: 763beb159d1cd3f1d476f0112460ad5a8721af84
branch: dev/evals2
repository: BrowserOS-agent
topic: "Evals2 System Implementation Research"
tags: [research, codebase, evals2, evaluation, scoring, braintrust, telemetry]
status: complete
last_updated: 2025-09-05
last_updated_by: Claude
---
# Research: Evals2 System Implementation
**Date**: 2025-09-05T16:30:20Z
**Researcher**: Claude
**Git Commit**: 763beb159d1cd3f1d476f0112460ad5a8721af84
**Branch**: dev/evals2
**Repository**: BrowserOS-agent
## Research Question
Understanding how the evals2 system is implemented, including its architecture, evaluation flow, scoring mechanisms, and integration points with the main codebase.
## Summary
Evals2 is a simplified evaluation system that tracks agent execution metrics and scores task completion quality. It's a complete rewrite of the original evaluation system with ~75% less code complexity (500 lines vs 2000+ lines). The system focuses on lightweight tool tracking, 4-dimension scoring, session management for conversation hierarchy, and minimal integration with only 2 hooks in the existing codebase.
## Detailed Findings
### Overall Architecture and Design
The evals2 system follows a modular, lightweight architecture with clear separation of concerns:
1. **Tool Metrics Collection** ([src/evals2/SimpleToolWrapper.ts](src/evals2/SimpleToolWrapper.ts))
- Wraps tools with duration tracking
- Stores metrics in ExecutionContext.toolMetrics Map
- No complex span management, just simple timing
2. **Scoring Engine** ([src/evals2/SimplifiedScorer.ts](src/evals2/SimplifiedScorer.ts))
- Analyzes message history to extract tool calls
- Calculates 4 dimension scores (down from 6 in v1)
- Can use LLM for goal/plan scoring or fallback to heuristics
3. **Session Management** ([src/evals2/SimpleBraintrustEventManager.ts](src/evals2/SimpleBraintrustEventManager.ts))
- Singleton pattern for conversation-wide tracking
- Maintains parent span for Braintrust hierarchy
- Tracks task scores for session averaging
4. **Result Reporting** ([src/evals2/SimpleBraintrustLogger.ts](src/evals2/SimpleBraintrustLogger.ts))
- Simple Braintrust integration
- Uploads scores without complex span management
- Lazy loads Braintrust SDK to avoid module issues
### How Evaluations are Defined and Structured
Evaluations are structured around two key data types defined in [src/evals2/types.ts](src/evals2/types.ts):
1. **ToolExecution**: Tracks individual tool calls
```typescript
{
toolName: string, // Name of the tool
duration: number, // Duration in milliseconds
success: boolean, // Whether tool succeeded
timestamp: number, // When tool was executed
args?: any, // Tool arguments
error?: string // Error message if failed
}
```
2. **ScoreResult**: Contains evaluation scores
```typescript
{
goalCompletion: number, // 0-1, weighted 40%
planCorrectness: number, // 0-1, weighted 30%
errorFreeExecution: number, // 0-1, weighted 15%
contextEfficiency: number, // 0-1, weighted 15%
weightedTotal: number, // Weighted average
details: {
toolCalls: number,
failedCalls: number,
retries: number,
reasoning?: string
}
}
```
The scoring weights are configured in [src/evals2/config.ts](src/evals2/config.ts:2-7):
- Goal Completion: 40% - Most important metric
- Plan Correctness: 30% - Quality of the execution plan
- Error-Free Execution: 15% - Error handling (renamed from "errorRatio")
- Context Efficiency: 15% - Efficient use of context/tokens
### Evaluation Execution Flow
The evaluation flow follows this sequence:
1. **Session Initialization** (NxtScape._initializeTelemetrySession)
- Checks if ENABLE_EVALS2 is true
- Creates SimpleBraintrustEventManager singleton
- Starts a parent session span for the conversation
2. **Tool Wrapping** (BrowserAgent tool execution)
- Each tool is wrapped with wrapToolForMetrics ([src/lib/agent/BrowserAgent.ts:341-344](src/lib/agent/BrowserAgent.ts:341-344))
- Metrics stored in ExecutionContext.toolMetrics Map
- Tracks duration, success, errors per tool call
3. **Message Processing & Scoring** (NxtScape.run after task completion)
- SimplifiedScorer.scoreFromMessages extracts tool calls from message history
- Combines toolMetrics Map data with message parsing
- Calculates 4 dimension scores
4. **Score Upload** (SimpleBraintrustLogger)
- Scores uploaded to Braintrust with parent span reference
- Session manager tracks scores for averaging
5. **Session End** (NxtScape._endTelemetrySession)
- Calculates average score across all tasks
- Logs session summary to Braintrust
### Key Components and Their Interactions
#### SimpleToolWrapper ([src/evals2/SimpleToolWrapper.ts](src/evals2/SimpleToolWrapper.ts))
- **Purpose**: Lightweight tool duration tracking
- **Integration Point**: BrowserAgent wraps tools before execution
- **Storage**: Uses ExecutionContext.toolMetrics Map
- **Output**: Console logs with timing (⚡ for success, ❌ for failure)
#### SimplifiedScorer ([src/evals2/SimplifiedScorer.ts](src/evals2/SimplifiedScorer.ts))
- **extractToolCalls** method ([lines 70-115](src/evals2/SimplifiedScorer.ts:70-115)):
- Iterates through messages to find AIMessage with tool_calls
- Matches with ToolMessage responses
- Merges with toolMetrics Map data for accurate durations
- **Scoring Methods**:
- **scoreGoalCompletion** ([lines 117-150](src/evals2/SimplifiedScorer.ts:117-150)): Uses LLM or checks for done_tool
- **scorePlanCorrectness** ([lines 152-187](src/evals2/SimplifiedScorer.ts:152-187)): Evaluates tool sequence efficiency
- **scoreErrorFreeExecution** ([lines 189-202](src/evals2/SimplifiedScorer.ts:189-202)): Success ratio minus penalties
- **scoreContextEfficiency** ([lines 204-219](src/evals2/SimplifiedScorer.ts:204-219)): Token usage estimation
#### SimpleBraintrustEventManager ([src/evals2/SimpleBraintrustEventManager.ts](src/evals2/SimpleBraintrustEventManager.ts))
- **Singleton Pattern**: Ensures single instance across conversation
- **Session Lifecycle**:
- startSession: Creates parent span ([lines 91-133](src/evals2/SimpleBraintrustEventManager.ts:91-133))
- addTaskScore: Accumulates scores ([lines 138-142](src/evals2/SimpleBraintrustEventManager.ts:138-142))
- endSession: Calculates averages and logs ([lines 147-191](src/evals2/SimpleBraintrustEventManager.ts:147-191))
#### SimpleBraintrustLogger ([src/evals2/SimpleBraintrustLogger.ts](src/evals2/SimpleBraintrustLogger.ts))
- **Lazy Initialization**: Loads Braintrust SDK only when needed
- **Simple API**: Single logTaskScore method
- **Score Upload** ([lines 45-90](src/evals2/SimpleBraintrustLogger.ts:45-90)):
- Logs input, output, scores, and metadata
- Uses parent span for hierarchy
- Silent failure to avoid breaking execution
### Results Collection and Reporting
Results are collected at multiple levels:
1. **Per-Tool Metrics**:
- Stored in ExecutionContext.toolMetrics Map
- Console output with timing information
- Included in scoring calculations
2. **Per-Task Scores**:
- Calculated after each task completion in NxtScape
- Uploaded to Braintrust as `evals2_task_score` events
- Added to session manager for averaging
3. **Per-Session Summary**:
- Average score across all tasks
- Session duration and task count
- Logged as `session_end` event in Braintrust
4. **Braintrust Dashboard**:
- Viewable at https://braintrust.dev/app/Felafax/p/browseros-agent-online/logs
- Events tagged with `evals2_task_score` and `agent_session`
### Configuration and Setup Requirements
The system requires minimal configuration:
1. **Environment Variables** ([src/evals2/config.ts:13-17](src/evals2/config.ts:13-17)):
- `ENABLE_EVALS2=true` - Enables the evaluation system
- `BRAINTRUST_API_KEY` - Required for score upload
- `OPENAI_MODEL_FOR_SCORING` - Optional, defaults to gpt-4o-mini
2. **Integration Points** (only 2 hooks):
- **NxtScape** ([src/lib/core/NxtScape.ts](src/lib/core/NxtScape.ts)):
- Session initialization at conversation start
- Scoring after each task
- Session end on cleanup
- **BrowserAgent** ([src/lib/agent/BrowserAgent.ts:341-344](src/lib/agent/BrowserAgent.ts)):
- Tool wrapping for duration tracking
3. **ExecutionContext Extension** ([src/lib/runtime/ExecutionContext.ts](src/lib/runtime/ExecutionContext.ts)):
- Added toolMetrics Map field
- Cleared on context reset
### Differences/Improvements from Version 1
Based on the README comparison ([src/evals2/README.md:73-82](src/evals2/README.md:73-82)):
| Aspect | Old Evals (v1) | Evals2 |
|--------|----------------|---------|
| **Code Size** | ~2000 lines | ~500 lines (75% reduction) |
| **Scoring Dimensions** | 6 complex | 4 simple |
| **Tool Tracking** | Braintrust wrapTraced | Map-based duration |
| **Session Management** | Complex telemetry | Simple parent span |
| **Dependencies** | Multiple heavy deps | Minimal, lazy-loaded |
| **Integration Complexity** | Many hooks throughout | 2 hooks total |
| **Performance Overhead** | Higher with spans | ~1ms per tool call |
Key improvements:
1. **Simplicity**: Drastically reduced complexity while maintaining functionality
2. **Performance**: Lightweight Map-based tracking vs heavy span management
3. **Maintainability**: Clear separation of concerns, modular design
4. **Flexibility**: Can work with or without LLM for scoring
5. **Minimal Disruption**: Only 2 integration points in existing code
## Architecture Insights
1. **Singleton Pattern for Session Management**: Ensures consistent session tracking across the entire conversation lifecycle without passing managers through multiple layers.
2. **Map-Based Tool Metrics**: Using ExecutionContext.toolMetrics Map provides O(1) lookup performance and avoids the complexity of span-based tracking.
3. **Lazy Loading Strategy**: Both Braintrust modules are lazy-loaded to avoid initialization issues and reduce startup overhead.
4. **Graceful Degradation**: The system continues to function even if:
- No API key is provided (local scoring only)
- LLM is unavailable (falls back to heuristics)
- Braintrust upload fails (silent failure)
5. **Separation of Scoring and Reporting**: SimplifiedScorer is completely independent of Braintrust, making it testable and reusable.
## Code References
### Core Implementation Files
- `src/evals2/SimpleToolWrapper.ts` - Tool duration tracking wrapper
- `src/evals2/SimplifiedScorer.ts:29-63` - Main scoring orchestration
- `src/evals2/SimpleBraintrustEventManager.ts:91-133` - Session initialization
- `src/evals2/SimpleBraintrustLogger.ts:45-90` - Score upload logic
### Integration Points
- `src/lib/core/NxtScape.ts:293-317` - Task scoring after execution
- `src/lib/core/NxtScape.ts:378-413` - Session initialization
- `src/lib/agent/BrowserAgent.ts:341-344` - Tool wrapping
- `src/lib/runtime/ExecutionContext.ts:60-65` - toolMetrics Map definition
### Configuration
- `src/config.ts:55` - ENABLE_EVALS2 flag
- `src/evals2/config.ts:2-7` - Scoring weights
- `src/evals2/types.ts` - Data structure definitions
## Testing Strategy
The system includes both unit and integration tests:
1. **Unit Tests** ([src/evals2/SimplifiedScorer.test.ts](src/evals2/SimplifiedScorer.test.ts)):
- Tests individual scoring dimensions
- Validates tool extraction from messages
- Checks scoring calculations
2. **Integration Tests** ([src/evals2/integration.test.ts](src/evals2/integration.test.ts)):
- Verifies tool wrapping functionality
- Tests scorer with real message structures
- Validates metrics collection
3. **Config Tests** ([src/evals2/config.test.ts](src/evals2/config.test.ts)):
- Ensures configuration constants are valid
- Validates scoring weight totals
## Open Questions
1. **Scoring Model Selection**: The system defaults to gpt-4o-mini for scoring. Is this the optimal choice for balance between cost and quality?
2. **Weight Optimization**: The current weights (40/30/15/15) seem reasonable but could benefit from empirical validation against human evaluations.
3. **Retry Detection Logic**: The current retry detection ([src/evals2/SimplifiedScorer.ts:221-229](src/evals2/SimplifiedScorer.ts:221-229)) uses consecutive same-tool calls. This might miss retries with intermediate steps.
4. **Token Estimation**: The 4 chars/token estimation ([src/evals2/SimplifiedScorer.ts:211](src/evals2/SimplifiedScorer.ts:211)) is rough. Consider using a proper tokenizer for accuracy.
5. **Session Persistence**: Sessions are only tracked in-memory. Consider persisting session data for crash recovery or long-running conversations.

View File

@@ -0,0 +1,151 @@
---
date: 2025-09-05T16:50:09-07:00
researcher: Claude
git_commit: 98d55e952578932b98f1b36bfe4e29728acaa1fa
branch: dev/evals2
repository: BrowserOS-agent
topic: "Environment Variables Handling in Chrome Extension"
tags: [research, codebase, webpack, environment-variables, chrome-extension, process-env]
status: complete
last_updated: 2025-09-05
last_updated_by: Claude
---
# Research: Environment Variables Handling in Chrome Extension
**Date**: 2025-09-05T16:50:09-07:00
**Researcher**: Claude
**Git Commit**: 98d55e952578932b98f1b36bfe4e29728acaa1fa
**Branch**: dev/evals2
**Repository**: BrowserOS-agent
## Research Question
How are environment variables handled in this Chrome extension codebase, and what's causing the "process is not defined" error with GOOGLE_GENAI_API_KEY and GEMINI_API_KEY?
## Summary
The codebase uses webpack's DefinePlugin to inject environment variables at build time by replacing `process.env.VARIABLE_NAME` with actual string values. The recent "process is not defined" error occurs because `GOOGLE_GENAI_API_KEY` and `GEMINI_API_KEY` are used in `src/config.ts` but are **NOT defined in webpack.config.js's DefinePlugin configuration**. This causes webpack to leave `process.env.GOOGLE_GENAI_API_KEY` as-is in the bundle, which fails at runtime since Chrome extensions don't have a `process` object.
## Detailed Findings
### Webpack DefinePlugin Pattern
The codebase follows a specific pattern for handling environment variables in webpack.config.js:
1. **Environment variables are loaded from .env file** ([webpack.config.js:13-15](webpack.config.js#L13-L15)):
```javascript
const env = dotenv.config()
envKeys = env.parsed || {}
```
2. **Variables are explicitly defined in processEnv object** ([webpack.config.js:23-36](webpack.config.js#L23-L36)):
```javascript
const processEnv = {
'process.env.POSTHOG_API_KEY': JSON.stringify(envKeys.POSTHOG_API_KEY || ''),
'process.env.KLAVIS_API_KEY': JSON.stringify(envKeys.KLAVIS_API_KEY || ''),
'process.env.NODE_ENV': JSON.stringify(process.env.NODE_ENV || 'development'),
// ... other variables
}
```
3. **DefinePlugin replaces process.env references at build time** ([webpack.config.js:164](webpack.config.js#L164)):
```javascript
new webpack.DefinePlugin(processEnv)
```
### Current Working Examples
Several environment variables are successfully used throughout the codebase:
1. **POSTHOG_API_KEY** - Defined in webpack, used in [src/lib/utils/Logging.ts:42](src/lib/utils/Logging.ts#L42)
2. **KLAVIS_API_KEY** - Defined in webpack, used in [src/lib/mcp/KlavisAPIManager.ts:18](src/lib/mcp/KlavisAPIManager.ts#L18)
3. **ENABLE_TELEMETRY** - Defined in webpack, used in [src/config.ts:72](src/config.ts#L72)
4. **BRAINTRUST_API_KEY** - Defined in webpack, used in [src/config.ts:74](src/config.ts#L74)
### The Problem with GOOGLE_GENAI_API_KEY and GEMINI_API_KEY
In [src/config.ts:79-80](src/config.ts#L79-80), these variables are used:
```typescript
export const GOOGLE_GENAI_API_KEY = process.env.GOOGLE_GENAI_API_KEY || '';
export const GEMINI_API_KEY = process.env.GEMINI_API_KEY || '';
```
However, **these variables are NOT defined in webpack.config.js's processEnv object**. This means:
1. Webpack doesn't replace `process.env.GOOGLE_GENAI_API_KEY` with a string value
2. The code `process.env.GOOGLE_GENAI_API_KEY` remains in the bundle
3. At runtime in the Chrome extension, `process` is undefined, causing the error
### How Chrome Extensions Handle JavaScript
Chrome extensions run in a browser environment where:
- There is no Node.js `process` global object
- Environment variables don't exist at runtime
- All configuration must be injected at build time or stored in extension storage
## Code References
- `webpack.config.js:23-36` - processEnv object definition where env vars are configured
- `webpack.config.js:164` - DefinePlugin usage
- `src/config.ts:79-80` - GOOGLE_GENAI_API_KEY and GEMINI_API_KEY usage (problematic)
- `src/config.ts:72-76` - Working examples of env var usage
- `src/evals2/SimplifiedScorer.ts:30` - Where these API keys are consumed
- `.env.example:1-16` - Documentation of expected env vars (missing Google/Gemini keys)
## Architecture Insights
1. **Build-time Replacement**: The codebase uses webpack's DefinePlugin to perform build-time string replacement, not runtime environment variable access.
2. **Explicit Declaration Required**: Every environment variable used in the codebase MUST be explicitly declared in webpack.config.js's processEnv object.
3. **String Serialization**: Values must be JSON.stringify'd to ensure they're properly formatted as string literals in the final bundle.
4. **No Dynamic Access**: You cannot dynamically access environment variables at runtime in a Chrome extension - all must be known at build time.
## The Correct Fix
To fix the "process is not defined" error, add the missing environment variables to webpack.config.js:
```javascript
const processEnv = {
'process.env.POSTHOG_API_KEY': JSON.stringify(envKeys.POSTHOG_API_KEY || ''),
'process.env.KLAVIS_API_KEY': JSON.stringify(envKeys.KLAVIS_API_KEY || ''),
'process.env.NODE_ENV': JSON.stringify(process.env.NODE_ENV || 'development'),
// Braintrust Telemetry Configuration
'process.env.ENABLE_TELEMETRY': JSON.stringify(envKeys.ENABLE_TELEMETRY || 'false'),
'process.env.ENABLE_EVALS2': JSON.stringify(envKeys.ENABLE_EVALS2 || 'false'),
'process.env.BRAINTRUST_API_KEY': JSON.stringify(envKeys.BRAINTRUST_API_KEY || ''),
'process.env.BRAINTRUST_PROJECT_UUID': JSON.stringify(envKeys.BRAINTRUST_PROJECT_UUID || ''),
'process.env.BRAINTRUST_PROJECT_NAME': JSON.stringify(envKeys.BRAINTRUST_PROJECT_NAME || 'browseros-agent-online'),
// OpenAI Configuration for Scoring
'process.env.OPENAI_API_KEY_FOR_SCORING': JSON.stringify(envKeys.OPENAI_API_KEY_FOR_SCORING || ''),
'process.env.OPENAI_MODEL_FOR_SCORING': JSON.stringify(envKeys.OPENAI_MODEL_FOR_SCORING || 'gpt-4o'),
// ADD THESE TWO LINES:
'process.env.GOOGLE_GENAI_API_KEY': JSON.stringify(envKeys.GOOGLE_GENAI_API_KEY || ''),
'process.env.GEMINI_API_KEY': JSON.stringify(envKeys.GEMINI_API_KEY || '')
}
```
Also update `.env.example` to document these new variables:
```
# Gemini/Google AI Configuration
GOOGLE_GENAI_API_KEY=""
GEMINI_API_KEY=""
```
## Recommendations
1. **Immediate Fix**: Add `GOOGLE_GENAI_API_KEY` and `GEMINI_API_KEY` to webpack.config.js's processEnv object.
2. **Update Documentation**: Add these keys to `.env.example` so developers know they're available.
3. **Build Process**: After making these changes, rebuild the extension with `npm run build` or `npm run build:dev`.
4. **Testing**: Verify the fix by checking that SimplifiedScorer can access these API keys without runtime errors.
5. **Pattern Consistency**: Always follow the pattern of adding new environment variables to BOTH:
- webpack.config.js processEnv object (for build-time replacement)
- .env.example (for documentation)
6. **Consider a Validation Step**: Add a build-time check to ensure all process.env references in src/ have corresponding entries in webpack's processEnv.
## Open Questions
- Should there be a linting rule or build step to catch undefined environment variables before runtime?
- Would it be beneficial to centralize all environment variable definitions in a single configuration file?

View File

@@ -23,7 +23,16 @@ if (!env.parsed) {
const processEnv = {
'process.env.POSTHOG_API_KEY': JSON.stringify(envKeys.POSTHOG_API_KEY || ''),
'process.env.KLAVIS_API_KEY': JSON.stringify(envKeys.KLAVIS_API_KEY || ''),
'process.env.NODE_ENV': JSON.stringify(process.env.NODE_ENV || 'development')
'process.env.NODE_ENV': JSON.stringify(process.env.NODE_ENV || 'development'),
// Braintrust Telemetry Configuration
'process.env.ENABLE_TELEMETRY': JSON.stringify(envKeys.ENABLE_TELEMETRY || 'false'),
'process.env.ENABLE_EVALS2': JSON.stringify(envKeys.ENABLE_EVALS2 || 'false'),
'process.env.BRAINTRUST_API_KEY': JSON.stringify(envKeys.BRAINTRUST_API_KEY || ''),
'process.env.BRAINTRUST_PROJECT_UUID': JSON.stringify(envKeys.BRAINTRUST_PROJECT_UUID || ''),
'process.env.BRAINTRUST_PROJECT_NAME': JSON.stringify(envKeys.BRAINTRUST_PROJECT_NAME || 'browseros-agent-online'),
// Gemini API keys for evals2 scoring
'process.env.GOOGLE_GENAI_API_KEY': JSON.stringify(envKeys.GOOGLE_GENAI_API_KEY || ''),
'process.env.GEMINI_API_KEY': JSON.stringify(envKeys.GEMINI_API_KEY || '')
}
console.log('API keys will be injected at build time (keys hidden for security)')
@@ -121,9 +130,10 @@ module.exports = {
],
},
plugins: [
// Limit chunks to only main entry points (4 total: sidepanel, background, glow-animation, newtab)
// Limit chunks to entry points only - prevents dynamic chunk creation
// This forces all imports (including dynamic) to be bundled into their parent entry
new webpack.optimize.LimitChunkCountPlugin({
maxChunks: 4
maxChunks: 4 // One chunk per entry point (sidepanel, background, glow-animation, newtab)
}),
new HtmlWebpackPlugin({
template: './src/sidepanel/index.html',

1247
yarn.lock

File diff suppressed because it is too large Load Diff