mirror of https://github.com/browseros-ai/BrowserOS.git synced 2026-05-18 19:16:22 +00:00

Files

Felarof 1abbee638a Braintrust basic evals (#87 )

* implement validator eval

* implement online eval foundation

* further implementing online evals

* enhance evaluation data logging

* implement LLM scoring, remove redundant EventEnricher

* cleanup

* fix build errs from merging, extend LLM scorer context

* settled evaluation framework

* update evals documentation

* fix evals screenshots

* fix typos

* Evals config moved to env variables and tested

* test

* Update manifest to 49.1

* Removed duplciate + button

* Just use previous way of registering tools as that is not required for evals

* Add claude commands for research, plan and implement

* evals2 research and plan

implementation plan

new implementation plan

* Evals2 implementation

test test

* Removed old eval hooks

Remove old evals hooks

* evals 2 added to env

* Eval2 enhancement plan

backup

* Make Braintrust project configurable

Make Braintrust project configurable

* Enhanced scorer -- using Gemini 2.5 pro for evaluation

backup v0.1

enhancement v0.2

v0.2

backup v0.3

backup v0.4

* Deleted old evals directory

* Clean up old evals code

* Bunch of fixes and improvements

backup

fixes 0.1

more fixes

fixes

more elaborate prompts

braintrust logger fix

* Renamed files

backup

2025-09-05 18:04:07 -07:00

12 KiB

Raw Blame History

Evals2 Implementation Documentation

Overview

Evals2 is a simplified evaluation framework for the Nxtscape browser automation system. It represents a complete rewrite of the original evaluation system, achieving a 75% reduction in code complexity (500 lines vs 2000+) while maintaining full functionality.

Architecture

Core Components

The evals2 system consists of four main components:

┌─────────────────────────────────────────────────────┐
│                    NxtScape                         │
│  (Session Lifecycle & Scoring Trigger)              │
└────────────────────┬────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────┐
│                 BrowserAgent                        │
│  (Tool Wrapping & Metrics Collection)               │
└────────────────────┬────────────────────────────────┘
                     │
        ┌────────────┴────────────┬─────────────┐
        ▼                         ▼             ▼
┌───────────────┐  ┌──────────────────┐  ┌──────────────┐
│SimpleToolWrapper│  │SimplifiedScorer  │  │SimpleBraintrust│
│               │  │                  │  │EventManager   │
│ Duration      │  │ 4-Dimension      │  │              │
│ Tracking      │  │ Scoring Engine   │  │ Session Mgmt │
└───────────────┘  └──────────────────┘  └──────────────┘
                            │
                   ┌────────▼─────────┐
                   │SimpleBraintrust  │
                   │Logger           │
                   │                 │
                   │ Score Reporting │
                   └─────────────────┘

Component Details

1. SimpleToolWrapper (`src/evals2/SimpleToolWrapper.ts`)

Purpose: Lightweight tool duration tracking
Implementation: Uses Map-based storage in ExecutionContext.toolMetrics
Performance: ~1ms overhead per tool call
Key Methods:
- wrapTool(): Wraps a tool with start/end timing logic
- Stores metrics as {toolName, startTime, endTime} in Map

2. SimplifiedScorer (`src/evals2/SimplifiedScorer.ts`)

Purpose: Multi-dimensional scoring of agent performance
Scoring Dimensions:
- Goal Completion (40%): Task achievement assessment
- Plan Correctness (30%): Execution efficiency evaluation
- Error-Free Execution (15%): Error handling quality
- Context Efficiency (15%): Token usage optimization
Features:
- LLM-based scoring with GPT-4o-mini (when available)
- Heuristic fallback for offline/no-API scenarios
- Returns structured scores with explanations

3. SimpleBraintrustEventManager (`src/evals2/SimpleBraintrustEventManager.ts`)

Purpose: Session lifecycle management
Key Features:
- Parent span creation for conversation sessions
- Lazy loading of Braintrust SDK
- Graceful handling of missing API keys
- Session ID tracking

4. SimpleBraintrustLogger (`src/evals2/SimpleBraintrustLogger.ts`)

Purpose: Score reporting to Braintrust platform
Implementation:
- Uploads scores as child spans
- Includes metadata (model, prompts, metrics)
- Handles connection failures gracefully

Execution Flow

1. Session Initialization

// In NxtScape.run()
if (process.env.ENABLE_EVALS2 === 'true') {
  await SimpleBraintrustEventManager.startConversationSession({
    sessionId: executionContext.sessionId,
    userId: 'user',
    initialMessage: userMessage
  });
}

2. Tool Wrapping

// In BrowserAgent.bindToolsToLLM()
if (process.env.ENABLE_EVALS2 === 'true') {
  const wrappedTools = tools.map(tool => 
    SimpleToolWrapper.wrapTool(tool, this.executionContext)
  );
}

3. Metrics Collection

During execution, tool durations are automatically collected:

// Stored in ExecutionContext.toolMetrics Map
Map<string, {
  toolName: string;
  startTime: number;
  endTime: number;
}>

4. Scoring After Task

// In NxtScape.run() after agent.execute()
const scores = await SimplifiedScorer.scoreMessages({
  messages: executionContext.messageManager.messages,
  toolMetrics: executionContext.toolMetrics,
  userMessage: userMessage,
  finalResponse: result
});

5. Score Reporting

await SimpleBraintrustLogger.logScores({
  scores,
  metadata: {
    model: llmSettings.model,
    provider: llmSettings.provider,
    sessionId: executionContext.sessionId
  },
  parentSpan: SimpleBraintrustEventManager.getParentSpan()
});

Scoring Methodology

Four-Dimension Scoring System

Goal Completion (40% weight)
- Evaluates if the agent achieved the user's requested task
- Scored 0-10 based on completion level
- Considers partial completions and alternative solutions
Plan Correctness (30% weight)
- Assesses the efficiency of the execution plan
- Evaluates tool selection and sequencing
- Penalizes unnecessary steps or redundant actions
Error-Free Execution (15% weight)
- Tracks error handling and recovery
- Scores based on error frequency and severity
- Rewards graceful degradation
Context Efficiency (15% weight)
- Measures token usage optimization
- Evaluates message conciseness
- Rewards efficient context management

Scoring Implementation

// LLM-based scoring (preferred)
if (process.env.OPENAI_MODEL_FOR_SCORING) {
  const llmScore = await this.scoreWithLLM(messages, userMessage);
  return llmScore;
}

// Heuristic fallback
return this.scoreWithHeuristics(messages, toolMetrics);

Configuration

Environment Variables

# Enable evals2 system
ENABLE_EVALS2=true

# Braintrust API key for reporting
BRAINTRUST_API_KEY=your-braintrust-api-key

# Optional: OpenAI model for scoring
OPENAI_MODEL_FOR_SCORING=gpt-4o-mini

# Optional: OpenAI API key (if different from main)
OPENAI_API_KEY=your-openai-api-key

Integration Points

The system requires minimal integration with only two hooks:

NxtScape (src/lib/core/NxtScape.ts):
- Session start/end lifecycle
- Scoring trigger after task completion
BrowserAgent (src/lib/agent/BrowserAgent.ts):
- Tool wrapping for metrics collection

Key Improvements from V1

Code Simplification

75% reduction in codebase size (500 lines vs 2000+)
Removed complex span tree management
Simplified to Map-based tracking

Performance

~1ms overhead per tool call (vs 10-20ms in v1)
Map lookups instead of span traversal
Lazy loading of dependencies

Reliability

Graceful degradation when APIs unavailable
Works offline with heuristic scoring
No blocking operations

Maintainability

Clear separation of concerns
Testable components
Minimal coupling with main codebase

Usage Examples

Basic Usage

// Automatic - just set environment variable
process.env.ENABLE_EVALS2 = 'true';

// The system will automatically:
// 1. Track all tool executions
// 2. Score after each task
// 3. Report to Braintrust (if configured)

Programmatic Access

// Access scores directly
const scores = await SimplifiedScorer.scoreMessages({
  messages: messageHistory,
  toolMetrics: toolMetricsMap,
  userMessage: "Book a flight to Paris",
  finalResponse: agentResponse
});

console.log(`Goal Completion: ${scores.goalCompletion}/10`);
console.log(`Overall Score: ${scores.overallScore}/10`);

Custom Tool Wrapping

// Wrap a custom tool
const wrappedTool = SimpleToolWrapper.wrapTool(
  myCustomTool,
  executionContext
);

// Metrics automatically collected in executionContext.toolMetrics

Testing

Unit Tests

# Run evals2 specific tests
npm test -- src/evals2/

# Test individual components
npm test -- SimplifiedScorer.test.ts

Integration Testing

# Enable evals2 and run full integration
ENABLE_EVALS2=true npm test -- integration/

Monitoring & Debugging

Debug Output

// Enable debug logging
process.env.DEBUG_EVALS2 = 'true';

// Logs will show:
// - Tool wrapping events
// - Scoring calculations
// - Braintrust upload status

Metrics Access

// Access raw metrics during execution
const metrics = executionContext.toolMetrics;
metrics.forEach((metric, id) => {
  console.log(`Tool: ${metric.toolName}`);
  console.log(`Duration: ${metric.endTime - metric.startTime}ms`);
});

Future Improvements

Planned Enhancements

Real-time scoring - Score during execution, not just after
Custom scoring dimensions - Allow user-defined scoring criteria
Batch uploading - Aggregate scores before uploading
Local storage - Cache scores locally for offline analysis

Open Questions

Should scoring be synchronous or async with the main flow?
How to handle multi-turn conversations vs single tasks?
Should we support custom scoring providers beyond OpenAI?
How to visualize scores in the UI?

Troubleshooting

Common Issues

Evals2 not running:

Check ENABLE_EVALS2=true is set
Verify environment variables are loaded

Scores not uploading:

Verify BRAINTRUST_API_KEY is valid
Check network connectivity
Look for error logs in console

LLM scoring failing:

Verify OPENAI_MODEL_FOR_SCORING is set
Check OpenAI API key and quota
System falls back to heuristics automatically

High overhead:

Check for duplicate tool wrapping
Verify Maps are being cleared after sessions
Monitor memory usage

API Reference

SimplifiedScorer

interface ScoreResult {
  goalCompletion: number;      // 0-10
  planCorrectness: number;      // 0-10
  errorFreeExecution: number;   // 0-10
  contextEfficiency: number;    // 0-10
  overallScore: number;         // Weighted average
  explanation?: string;         // LLM reasoning
}

class SimplifiedScorer {
  static async scoreMessages(params: {
    messages: Message[];
    toolMetrics: Map<string, ToolMetric>;
    userMessage: string;
    finalResponse: any;
  }): Promise<ScoreResult>;
}

SimpleToolWrapper

class SimpleToolWrapper {
  static wrapTool(
    tool: DynamicStructuredTool,
    executionContext: ExecutionContext
  ): DynamicStructuredTool;
}

SimpleBraintrustEventManager

class SimpleBraintrustEventManager {
  static async startConversationSession(params: {
    sessionId: string;
    userId: string;
    initialMessage: string;
  }): Promise<void>;
  
  static async endConversationSession(): Promise<void>;
  static getParentSpan(): any;
}

SimpleBraintrustLogger

class SimpleBraintrustLogger {
  static async logScores(params: {
    scores: ScoreResult;
    metadata: any;
    parentSpan?: any;
  }): Promise<void>;
}

Conclusion

Evals2 represents a significant improvement in evaluation system design, prioritizing simplicity, performance, and reliability. The system's modular architecture and minimal integration requirements make it easy to maintain and extend while providing comprehensive evaluation capabilities for the Nxtscape browser automation system.

12 KiB Raw Blame History