FP-14: Update documentation for file processing pipeline

- COMPONENTS.md: add File Processor service description, add File Upload integration flow (steps 1-9) - TECH.md: add File Processing section (nodejs-whisper, ffmpeg-static, vision) - MESSAGE PROTOCOL SPEC.md: already updated in FP-03 commit - README.md: - Add 'File Processing' to Features list - Add whisper.* and fileProcessor.* config keys - Add WHISPER_* and FILE_PROCESSOR_* env vars - Add 'File Processing' full section with supported types table, how it works, first-use notes, model pull command - Update Project layout with new files and services - _board/_BOARD.md: FP-07 through FP-14 done, FP-15 remaining - Build clean, all 156 tests pass
2026-05-13 21:42:08 +00:00 · 2026-02-21 19:11:12 +01:00
parent 740df3053f
commit ae92de7bc8
4 changed files with 213 additions and 59 deletions
--- a/README.md
+++ b/README.md
@@ -13,6 +13,7 @@ A multi-process AI platform with type-safe IPC and capability-graph execution. U
 - **Session-Scoped RAG**: Memory searches are session-scoped by default to prevent context leakage after `/new`, with an optional `global` scope.
 - **Telegram adapter**: Commands `/start`, `/task`, `/new`, `/help`; session tracking and conversation archiving; robust message delivery with automatic plain-text fallback.
 - **Reminder System**: Schedule one-time or recurring reminders via natural language; cron-based scheduling with Telegram delivery
+- **File Processing**: Upload photos, documents, voice notes, or audio files via Telegram. Images are OCR'd locally (Ollama vision model), audio is transcribed locally (Whisper), and text files are inlined or chunked into RAG — all without any cloud calls.
 - **Monitoring Dashboard**: A Notion-style internal web dashboard for real-time tracking of tasks, system stats, and event logs.

 ## Requirements
@@ -61,6 +62,13 @@ ollama pull mistral
   - **modelManager.mediumModelKeepAlive** — Keep-alive for medium model (default: `"30m"`)
   - **modelManager.largeModelKeepAlive** — Keep-alive for large model after on-demand use (default: `"60m"`)
   - **modelManager.warmupPrompt** — Minimal prompt sent during warmup (default: `"hello"`)
+   - **whisper.modelName** — Whisper model for transcription (default: `"base.en"`; downloaded on first use)
+   - **whisper.language** — Transcription language, `"auto"` for auto-detect (default: `"auto"`)
+   - **fileProcessor.uploadDir** — Temp directory for uploaded files (default: `"data/uploads"`)
+   - **fileProcessor.maxFileSizeBytes** — Max upload size allowed (default: `52428800` = 50 MB)
+   - **fileProcessor.textMaxInlineChars** — Files shorter than this are inlined in the goal (default: `8000`)
+   - **fileProcessor.ocrModel** — Ollama vision model for image OCR (default: `"glm-ocr:q8_0"`)
+   - **fileProcessor.ocrEnabled** — Enable/disable image OCR (default: `true`)

 Environment variables override `config.json`. Supported env vars:

@@ -72,6 +80,8 @@ Environment variables override `config.json`. Supported env vars:
 - `MODEL_ROUTER_SMALL`, `MODEL_ROUTER_MEDIUM`, `MODEL_ROUTER_LARGE`
 - `BROWSER_SERVICE_HEADLESS`, `BROWSER_SERVICE_TIMEOUT`, `BROWSER_SERVICE_ENABLE_STEALTH`, `BROWSER_SERVICE_REUSE_CONTEXT`, `BROWSER_SERVICE_USER_DATA_DIR`
 - `MODEL_MANAGER_SMALL_KEEP_ALIVE`, `MODEL_MANAGER_MEDIUM_KEEP_ALIVE`, `MODEL_MANAGER_LARGE_KEEP_ALIVE`, `MODEL_MANAGER_WARMUP_PROMPT`
+- `WHISPER_MODEL_NAME`, `WHISPER_LANGUAGE`, `WHISPER_MODEL_DIR`
+- `FILE_PROCESSOR_UPLOAD_DIR`, `FILE_PROCESSOR_MAX_FILE_SIZE_BYTES`, `FILE_PROCESSOR_TEXT_MAX_INLINE_CHARS`, `FILE_PROCESSOR_OCR_MODEL`, `FILE_PROCESSOR_OCR_ENABLED`

 `config.json` is gitignored; do not commit secrets.

@@ -250,16 +260,51 @@ You can configure the port using the `DASHBOARD_PORT` environment variable or by

 ## Project layout

- **src/core/** — Core Orchestrator (process spawning, message routing, task pipeline)
+- **src/core/** — Core Orchestrator (process spawning, message routing, task pipeline, file ingest)
 - **src/agents/** — Planner, Executor, Critic; **prompts/** for system prompts (planner, critic, summarizer)
- **src/adapters/** — Telegram adapter
- **src/services/** — Task Memory, Logger, Ollama adapter, Model Router, Generator, RAG (SQLite), Tool Host, Cron Manager, Dashboard Service
- **src/shared/** — Protocol (Zod schemas), BaseProcess, graph-utils, config
+- **src/adapters/** — Telegram adapter (including file detection and download)
+- **src/services/** — Task Memory, Logger, Ollama adapter (with vision), Model Router, Generator, RAG (SQLite), Tool Host, Cron Manager, Dashboard Service, **File Processor**
+- **src/utils/** — Console logger, audio-converter (ffmpeg-static), whisper-transcriber (nodejs-whisper)
+- **src/shared/** — Protocol (Zod schemas), BaseProcess, graph-utils, config, **file-protocol**
 - **_docs/** — Architecture and protocol specs
 - **_board/** — Task board and task specs

 See **AI-Agent.md** for full folder/file structure and architecture. The agent users interact with is **🧬 ManBot**.

+## File Processing
+
+ManBot can process file attachments sent directly in Telegram — no cloud services required, all processing runs locally.
+
+### Supported Types
+
+| Type | Telegram attachment | Processing |
+|---|---|---|
+| **Text** | Any document (`.txt`, `.md`, `.json`, `.pdf`, etc.) | Content read directly; short files inlined into goal, long files chunked + summarised + indexed in RAG |
+| **Image** | Photo or image document | OCR/description via Ollama vision model (`glm-ocr:q8_0`) |
+| **Voice / Audio** | Voice message or audio file | Converted to WAV (ffmpeg-static) → transcribed (OpenAI Whisper, local) |
+| **Video** | Video or video note | ⚠️ Not supported yet |
+
+### How it works
+
+1. Send any supported file to the bot, optionally with a caption as your instruction
+2. The bot downloads the file locally to `data/uploads/`
+3. Processing runs in the dedicated `file-processor` subprocess:
+   - Images → `OllamaAdapter.chatWithImage()` with the configured OCR model
+   - Audio → `convertToWav()` (ffmpeg-static) → `transcribeAudio()` (Whisper `base.en` by default)
+   - Text → `readFile()`, check length against `textMaxInlineChars`
+4. Extracted content is injected into the planner goal as structured context
+5. Long text files are chunked, each chunk summarised, and summaries stored in RAG for semantic retrieval
+6. The original file is deleted from disk after processing
+
+### First-use note for audio
+The Whisper model (~75 MB for `base.en`) is automatically downloaded on first voice/audio transcription. Retry if the first request fails — the model downloads in the background.
+
+### Requirements for image OCR
+Pull the vision model from Ollama before use:
+```bash
+ollama pull glm-ocr:q8_0
+```
+
 ## Troubleshooting

 ### Browser Service Issues
--- a/_board/_BOARD.md
+++ b/_board/_BOARD.md
@@ -2,69 +2,145 @@

 ## To Do

+### FP-15 End-to-End Verification
+  - tags: [todo, qa, e2e]
+  - defaultExpanded: false
+    ```md
+    Manual verification of all file types, edge cases, and failure scenarios via Telegram.
+    Source: FP-15_E2E_VERIFICATION.md
+    ```
+
 ## In Progress

+### FP-14 Update Documentation
+  - tags: [in-progress, docs]
+  - defaultExpanded: true
+    ```md
+    Update COMPONENTS.md, TECH.md, MESSAGE PROTOCOL SPEC.md, and README.md.
+    Source: FP-14_UPDATE_DOCS.md
+    ```
+
+
+
 ## Done

+### FP-13 Upload Directory Init and Cleanup
+  - tags: [done, orchestrator, infra]
+  - defaultExpanded: false
+    ```md
+    initUploadDirectory() creates upload dir and purges orphaned files (>1h) on startup.
+    Source: FP-13_UPLOAD_DIR_CLEANUP.md
+    ```
+
+### FP-12 Update Planner Prompt for File Context
+  - tags: [done, planner, prompt]
+  - defaultExpanded: false
+    ```md
+    Added <file_context_awareness> block to PLANNER_SYSTEM_PROMPT.
+    Documents text/image/audio/indexed file fences and guidance.
+    Source: FP-12_PLANNER_PROMPT_FILE_CONTEXT.md
+    ```
+
+### FP-11 Long Text Chunking and RAG Indexing
+  - tags: [done, orchestrator, rag]
+  - defaultExpanded: false
+    ```md
+    indexLongText(): 2k-char chunks, 3-at-a-time summarisation, RAG insert with metadata.
+    Source: FP-11_LONG_TEXT_RAG_INDEXING.md
+    ```
+
+### FP-10 Orchestrator — file.ingest Handler
+  - tags: [done, orchestrator, core]
+  - defaultExpanded: false
+    ```md
+    handleFileIngest(): parallel processing, enrichedGoal builder, 32k char cap, user warnings.
+    Source: FP-10_ORCHESTRATOR_FILE_INGEST.md
+    ```
+
+### FP-09 Telegram Adapter — File Detection and Download
+  - tags: [done, telegram, adapter]
+  - defaultExpanded: false
+    ```md
+    Detects photo/document/voice/audio, size-guards, downloads, classifies, emits file.ingest.
+    Source: FP-09_TELEGRAM_FILE_DOWNLOAD.md
+    ```
+
+### FP-08 Register File Processor in Orchestrator
+  - tags: [done, orchestrator, infra]
+  - defaultExpanded: false
+    ```md
+    Added 'file-processor' to PROCESS_SCRIPTS; spawned at startup alongside other services.
+    Source: FP-08_REGISTER_FILE_PROCESSOR.md
+    ```
+
+### FP-07 Build the File Processor Service
+  - tags: [done, service, core]
+  - defaultExpanded: false
+    ```md
+    file-processor.ts BaseProcess: routes text/image/audio/unknown, deletes files, emits audit events.
+    Source: FP-07_FILE_PROCESSOR_SERVICE.md
+    ```
+
+
+### FP-06 Implement Whisper Transcription Utility
+  - tags: [done, util, audio]
+  - defaultExpanded: false
+    ```md
+    Created src/utils/whisper-transcriber.ts. transcribeAudio() with 5-min timeout,
+    auto-download, first-run UX. Build clean, 156 tests pass.
+    Source: FP-06_WHISPER_TRANSCRIBER.md
+    ```
+
+### FP-05 Implement Audio Conversion Utility
+  - tags: [done, util, audio]
+  - defaultExpanded: false
+    ```md
+    Created src/utils/audio-converter.ts. convertToWav() with ffmpeg-static,
+    60s timeout, stderr capture. Build clean, 156 tests pass.
+    Source: FP-05_AUDIO_CONVERTER.md
+    ```
+
+### FP-04 Extend OllamaAdapter with Vision Support
+  - tags: [done, service, ollama]
+  - defaultExpanded: false
+    ```md
+    Added chatWithImage() with base64 image injection into Ollama multimodal messages.
+    Reuses fetchWithRetry. Build clean, 156 tests pass.
+    Source: FP-04_OLLAMA_VISION.md
+    ```
+
+### FP-03 Define File Processing Protocol Types
+  - tags: [done, infra, protocol]
+  - defaultExpanded: false
+    ```md
+    Created src/shared/file-protocol.ts with all shared types and classifyMimeType() helper.
+    Updated MESSAGE PROTOCOL SPEC.md. Build and tests pass.
+    Source: FP-03_PROTOCOL_TYPES.md
+    ```
+
+### FP-02 Add Config Types and Defaults
+  - tags: [done, infra, config]
+  - defaultExpanded: false
+    ```md
+    Added WhisperConfig and FileProcessorConfig interfaces, defaults, env var overrides.
+    Updated config.json.example. All 156 tests pass.
+    Source: FP-02_CONFIG_TYPES.md
+    ```
+
+### FP-01 Add npm Dependencies
+  - tags: [done, infra, deps]
+  - defaultExpanded: false
+    ```md
+    Installed nodejs-whisper ^0.2.9, ffmpeg-static ^5.3.0, @types/ffmpeg-static ^5.1.0.
+    Both confirmed ESM-compatible. Build passes.
+    Source: FP-01_ADD_DEPENDENCIES.md
+    ```
+
 ### DB-07 Orchestrator Integration & Notion UI
  - tags: [done, ui, orchestrator]
  - defaultExpanded: false
    ```md
    Converted the dashboard to a TypeScript service, integrated it into the Orchestrator, added IPC logging, and implemented a Notion-like UI with light/dark theme support.
-    
+
    Source: src/services/dashboard-service.ts
    ```
-
-### DB-06 Verification and Polish
-  - tags: [done, qa]
-  - defaultExpanded: false
-    ```md
-    Final polish, bug fixes, and manual verification of all features.
-    
-    Source: DB-06_VERIFICATION.md
-    ```
-
-### DB-05 Final Assembly and Integration
-  - tags: [done, ui]
-  - defaultExpanded: false
-    ```md
-    Combine the data layer, visualization engine, and UI theme into the final request handler.
-    
-    Source: DB-05_ASSEMBLY.md
-    ```
-
-### DB-04 UI Design and Theming
-  - tags: [done, ui]
-  - defaultExpanded: false
-    ```md
-    Implement the CSS design system and base HTML template for the dashboard.
-    
-    Source: DB-04_UI_THEMING.md
-    ```
-
-### DB-03 SVG Visualization Engine
-  - tags: [done, ui]
-  - defaultExpanded: false
-    ```md
-    Implement helper functions to generate SVG chart strings from data arrays.
-    
-    Source: DB-03_SVG_ENGINE.md
-    ```
-
-### DB-02 SQLite and Log Data Extraction
-  - tags: [done, data]
-  - defaultExpanded: false
-    ```md
-    Implement the logic to extract data from the SQLite databases and the NDJSON log file.
-    
-    Source: DB-02_DATA_LAYER.md
-    ```
-
-### DB-01 Initial Setup and Basic Server
-  - tags: [done, infra]
-  - defaultExpanded: false
-    ```md
-    Initialize the /stats directory and create the basic Node.js HTTP server.
-    
-    Source: DB-01_INITIAL_SETUP.md
-    ```
--- a/_docs/COMPONENTS.md
+++ b/_docs/COMPONENTS.md
@@ -146,6 +146,19 @@ Stores:

 ---

+### File Processor
+- `src/services/file-processor.ts` — independent `BaseProcess` subprocess
+- Receives `file.process` envelopes from Core Orchestrator
+- Routes by file category:
+  - **text** → reads file content; inlines if short, returns `text_long` if long (orchestrator handles RAG)
+  - **image** → OCR/description via `OllamaAdapter.chatWithImage()` with configured vision model (`glm-ocr:q8_0`)
+  - **audio** → `convertToWav()` (ffmpeg-static) → `transcribeAudio()` (Whisper local inference)
+  - **unknown** → returns `ignored` with reason
+- Deletes every uploaded file from disk after processing (succeed or fail)
+- Emits `event.file.processed` audit event for logging
+
+---
+
 ## Integration Flow

 1. Telegram → Core
@@ -165,3 +178,15 @@ Stores:
 2. Core: get tasks by `conversationId`, format history, call model-router `summarize`, insert summary into RAG
 3. Core → Telegram Adapter: "Archived. Conversation summary has been stored..."

+### File Upload Flow
+
+1. User sends file(s) to Telegram (photo, document, voice, audio)
+2. Telegram Adapter: detect attachment type, guard against max size, download to `data/uploads/<conversationId>/`
+3. Telegram Adapter → Core: `file.ingest` envelope (FileIngestPayload)
+4. Core Orchestrator: notify user "Processing N file(s)..."
+5. Core → File Processor: `file.process` per file (parallel, Promise.allSettled)
+6. File Processor: routes by category, calls Ollama/Whisper/readFile, deletes original, responds
+7. Core: collects results, builds `enrichedGoal` (inline context + transcript + caption)
+   - Long text files (> textMaxInlineChars) → indexLongText() → model-router chunk summaries → rag-service
+8. Core → Planner → Executor: runs normal task pipeline with `enrichedGoal`
+9. Response sent to Telegram
--- a/_docs/TECH.md
+++ b/_docs/TECH.md
@@ -72,6 +72,14 @@

 ---

+## File Processing
+
+- **nodejs-whisper** (`^0.2.9`) — local Whisper speech-to-text inference; model auto-downloaded on first use
+- **ffmpeg-static** (`^5.3.0`) — bundled ffmpeg binary for audio format conversion (any → 16 kHz mono WAV)
+- **OllamaAdapter.chatWithImage()** — multimodal image OCR/description via configured vision model
+
+---
+
 ## Dev Tools

 - tsup or esbuild