8.4 KiB
Enhanced HTTP Get Tool with Playwright Implementation Plan
Overview
Enhance the existing http_get tool to support Single Page Applications (SPAs) by integrating Playwright with Chromium. The current implementation uses fetch, which cannot execute JavaScript or handle dynamic content. The enhanced version will:
- Use Playwright to render JavaScript-heavy websites
- Convert HTML responses to Markdown for better LLM consumption
- Bypass bot detection mechanisms using realistic browser fingerprints and behaviors
- Maintain backward compatibility with the existing
fetch-based approach for simple requests
User Review Required
Important
Fallback Strategy
The plan includes using
fetchas the primary method and falling back to Playwright only when needed (e.g., on 403 errors, detected SPAs, or explicit user request). This minimizes resource usage and latency.Alternative: Always use Playwright for all requests. This would be slower but more consistent. Please confirm which approach you prefer.
Important
HTML to Markdown Conversion Library
The plan proposes using
turndown(popular, well-maintained) for HTML-to-Markdown conversion.Alternatives:
html-to-md(lighter but less feature-rich)node-html-markdown(newer, good TypeScript support)Please confirm if
turndownis acceptable or if you prefer a different library.
Important
Bot Detection Bypass Techniques
The plan includes:
- Realistic user agents
- Randomized viewport sizes
- Stealth plugin for Playwright
- Realistic mouse movements and delays
These techniques may not work for all sites with advanced bot detection (e.g., Cloudflare Turnstile, reCAPTCHA). Please confirm if this level of bypass is sufficient or if you need more advanced techniques.
Proposed Changes
Component 1: Dependencies
Add required npm packages for Playwright and HTML-to-Markdown conversion.
[MODIFY] package.json
- Add
playwrightto dependencies for browser automation - Add
playwright-extraandpuppeteer-extra-plugin-stealthfor bot detection bypass - Add
turndownfor HTML-to-Markdown conversion - Add corresponding TypeScript type definitions to devDependencies
Component 2: HTML to Markdown Converter
Create a utility service to convert HTML content to clean, readable Markdown.
[NEW] html-to-markdown.ts
- Implement
htmlToMarkdown(html: string, options?: ConversionOptions): stringfunction - Configure Turndown to preserve important elements (links, images, code blocks, tables)
- Strip unnecessary elements (scripts, styles, navigation, footers)
- Handle edge cases (malformed HTML, empty content)
- Export configuration options for customization
Component 3: Playwright Browser Service
Create a service to manage Playwright browser instances and page interactions.
[NEW] browser-service.ts
- Implement singleton pattern for browser instance management (reuse browser across requests)
- Configure Chromium with stealth plugin to bypass bot detection
- Implement realistic user agent rotation
- Implement randomized viewport sizes
- Add methods:
fetchWithBrowser(url: string, options?: BrowserFetchOptions): Promise<{ status: number, html: string, finalUrl: string }>close(): Promise<void>for cleanup
- Handle timeouts and errors gracefully
- Add realistic delays and mouse movements to mimic human behavior
Component 4: Enhanced HTTP Get Tool
Update the existing http_get tool to intelligently choose between fetch and Playwright.
[MODIFY] tool-host.ts
- Update
httpGetToolto accept new optional parameters:useBrowser?: boolean- Force Playwright usageconvertToMarkdown?: boolean- Convert HTML to Markdown (default: true for HTML responses)
- Implement smart fallback logic:
- Try
fetchfirst (fast path) - If 403/401 or user specified
useBrowser, use Playwright - Detect HTML content type and convert to Markdown if requested
- Try
- Return enhanced response:
{ status: number, body: string, // HTML or Markdown contentType: string, finalUrl: string, // After redirects method: 'fetch' | 'browser' // Which method was used } - Handle errors from both methods and provide clear error messages
Component 5: Bot Detection Bypass Configuration
Create configuration for user agents, viewport sizes, and stealth settings.
[NEW] browser-config.ts
- Export array of realistic user agents (Chrome, Firefox, Safari on various OS)
- Export array of common viewport sizes (desktop and mobile)
- Export stealth plugin configuration
- Export function to randomly select user agent and viewport
- Document bot detection bypass techniques used
Component 6: Planner Prompt Update
Update the planner prompt to document the enhanced http_get capabilities.
[MODIFY] planner.ts
- Update
http_gettool documentation to include new parameters - Add examples showing when to use
useBrowser: true - Add examples showing Markdown conversion for web scraping tasks
- Document that HTML responses are automatically converted to Markdown
Component 7: Configuration
Add browser service configuration to the main config.
[MODIFY] config.json
- Add
browserServicesection with:headless: true- Run browser in headless modetimeout: 30000- Page load timeout in msenableStealth: true- Enable bot detection bypassreuseContext: true- Reuse browser context for performance
[MODIFY] config.ts
- Add TypeScript types for
browserServiceconfiguration - Add validation for browser service config
Verification Plan
Automated Tests
-
Unit test for HTML to Markdown converter
npm test src/utils/__tests__/html-to-markdown.test.ts- Test conversion of various HTML elements (headings, lists, tables, code blocks)
- Test stripping of unwanted elements (scripts, styles)
- Test handling of malformed HTML
-
Integration test for browser service
npm test src/services/__tests__/browser-service.test.ts- Test fetching a simple static page
- Test fetching a JavaScript-heavy SPA (e.g., React app)
- Test timeout handling
- Test browser instance reuse
- Test cleanup on shutdown
-
Integration test for enhanced http_get tool
npm test src/services/__tests__/tool-host.test.ts- Test
fetchfallback to Playwright on 403 - Test explicit
useBrowser: trueparameter - Test HTML to Markdown conversion
- Test response format with both methods
- Test
Manual Verification
-
Test with SPA website
- Start the orchestrator:
npm run dev:orchestrator - Send message via Telegram: "Fetch https://react-example-app.com and summarize the content"
- Verify that the tool uses Playwright (check logs)
- Verify that content is properly extracted and converted to Markdown
- Start the orchestrator:
-
Test bot detection bypass
- Test with a site known to block bots (e.g., some news sites)
- Send message: "Fetch https://site-with-bot-detection.com with browser"
- Verify that the request succeeds (status 200)
- Verify that content is properly extracted
-
Test fallback mechanism
- Test with a simple static site (should use
fetch) - Test with a site that returns 403 (should fallback to Playwright)
- Check logs to confirm which method was used
- Verify response times (fetch should be faster)
- Test with a simple static site (should use
-
Test Markdown conversion
- Fetch a Wikipedia article
- Verify that the Markdown output is clean and readable
- Verify that links, headings, and lists are properly formatted
- Verify that navigation, scripts, and styles are stripped
-
Test error handling
- Test with invalid URL
- Test with timeout (very slow site)
- Test with network error
- Verify that error messages are clear and helpful