mirror of
https://github.com/larchanka/manbot.git
synced 2026-05-13 21:42:08 +00:00
9.8 KiB
9.8 KiB
marp, theme, class
| marp | theme | class | |
|---|---|---|---|
| true | default |
|
I built my own AI-agent. Why?
A journey from reading about AI to building a custom agent framework
layout: image-left image: ./_images/Gemini_Generated_Image_x6y0h0x6y0h0x6y0.png
Mikhail Larchanka
- Principal Software Engineer at Sytac
- https://larchanka.com
- https://youtube.com/@larchanka
- https://github.com/larchanka
- https://x.com/mlarchanka
🌍 The AI Catch-Up
- The AI landscape is evolving at breakneck speed every single day.
- New models, new frameworks (LangChain, AutoGen), new methodologies.
- It feels like the revolution is passing by.
- The Challenge: I do not work with AI in my daily job. Staying actively involved requires intentional effort beyond standard day-to-day tasks.
layout: image-right image: ./_images/Gemini_Generated_Image_vu2lluvu2lluvu2l.png
📚 The Trap of "Reading vs. Doing"
- I read a lot of papers, articles, and documentation.
- The Reality Check: Reading builds awareness, but not genuine knowledge or intuition.
- Without hands-on practice, you don't discover the edge cases, the latency issues, or the prompt fragility.
- I spent time checking what others were building in the space, and understanding their pain points.
layout: image-right image: ./_images/dontlike.png backgroundSize: contain
💡 The Catalyst: My Own Ideas
- While observing existing solutions, I realized I had different ideas on how agents should operate.
- Existing frameworks often felt either too bloated, too confusing, or too rigid.
- I wanted to build something tailored to my intuition of how a system should reason and interact with an environment.
layout: image-right image: ./_images/ai-prices.png backgroundSize: contain
💸 Cost-Driven Architecture
- The Goal: Make learning and relentless experimentation "cheap."
- Relying on cloud APIs (GPT-4, Claude) for agentic loops—which run autonomously making dozens of calls and mistakes—gets expensive quickly.
- The Solution: Local LLMs.
- Complete freedom to experiment, fail, retry, and loop infinitely without worrying about API bills.
layout: image-left image: ./_images/llm-inference.webp backgroundSize: contain
🤖 Evaluating Local LLMs
- Not all models are created equal for agentic tasks.
- Benchmarking for my agent:
- Need strong coding and reasoning capabilities.
- Need reliable JSON/tool-calling formatting.
- Need fast inference speed (tokens/sec) for autonomous, multi-step loops.
- Explored running models locally using tools like Ollama or Lemonade.
- Tested how models handle context degradation on local machine hardware.
🔀 Dynamic Model Routing
- Running a massive model (like Mixtral) for every tiny task is slow and overkill.
- I built a Model Router that dynamically selects models based on task complexity.
- The Flow:
- Planner: Evaluates the user's intent and forces the plan into a
complexitybucket ("small","medium", or"large"). - Injection: The Executor injects
_complexityinto the payload for every node. - Router Resolution: The
GeneratorServicechecks the complexity and maps it purely viaconfig.json. - Small -> Llama3:8B (Fast system loops) | Medium -> Qwen2.5 (Standard) | Large -> Mixtral (Deep research).
- Planner: Evaluates the user's intent and forces the plan into a
🛡️ Context & Token Safety
- I didn't want to calculate exact tokens with heavy libraries (like
tiktoken) on every single loop. - My Strategy (Heuristics & Compression):
- Safety Truncation: If a tool (like a massive
http_getweb scrape) returns over 30,000 characters, theExecutorAgentaggressively truncates it. - Prompt Summarization: Instead of keeping an infinitely growing chat history, the Planner produces a
summarizenode using a specializedSUMMARIZER_SYSTEM_PROMPTto compress old context dynamically. - Tracking the standard
usagevectors from OpenAI-compatible tools (prompt_tokens,total_tokens) for observability rather than strict hard-blocking.
- Safety Truncation: If a tool (like a massive
🏗️ The Ultimate Testing Playground
- The framework wasn't just a final product; it was a testbed.
- Objectives:
- Test how to actually code with agents in different structural scenarios.
- Experiment with system architecture and modularity design.
- Learn how to construct proper, dynamic task-planning prompts.
- Create functional applications autonomously based on structured tasks.
flowchart LR
Planner[Planner] --> Executor[Executor]
Executor --> Tools[Tools]
Tools --> Executor
Executor --> Planner
⚙️ The Core Loop: Solving Communication
- The Problem: LLMs naturally output raw text. Agents need structured, executable actions.
- The Implementation:
- Forcing the local LLM to output valid JSON representations of tool calls.
- Handling parsing errors seamlessly through self-correction loops.
- Designing a robust schema that the LLM understands and adheres to.
- Distinguishing between "Thinking" (reasoning) and "Acting" (tool execution).
⚙️ The Core Loop: Solving Communication
Request
{
"id": "uuid",
"from": "core",
"to": "planner",
"type": "plan.create",
"version": "1.0",
"timestamp": 1704067200000,
"payload": {}
}
⚙️ The Core Loop: Solving Communication
Response
{
"id": "same-as-request",
"from": "planner",
"to": "core",
"type": "response",
"version": "1.0",
"timestamp": 1704067200000,
"payload": {
"status": "success",
"result": {}
}
}
layout: image image: ./_images/SCR-20260305-jmkx.png
🛠️ Equipping the Agent: Tools & Skills
- Agents are useless without hands.
- I built a modular tool host system.
- Standardized interfaces for tools:
name,description,parameters,execute(). - Grouping tools into highly specialized "Skills" (e.g., File System, Terminal, Browser).
- Optimization: Injecting only relevant tool schemas into the prompt to preserve context.
layout: image image: ./_images/SCR-20260305-jmme.png
🔌 Standardizing with MCP
- Building MCP (Model Context Protocol) Integration:
- Why reinvent the wheel for every custom tool or data source?
- Implementing MCP allowed my agent to connect to external, standardized tools seamlessly.
- Learned how to expose local environment capabilities (files, API connections) to an agent through standardized, secure boundaries.
🏛️ The Layered Memory Architecture
To prevent context contamination and keep prompt sizes manageable, I separated memory into distinct tiers:
flowchart LR
classDef st fill:#fef3c7,stroke:#b45309,stroke-width:2px,color:#000
classDef mt fill:#dbeafe,stroke:#1d4ed8,stroke-width:2px,color:#000
classDef lt fill:#dcfce7,stroke:#15803d,stroke-width:2px,color:#000
subgraph ST ["Short-Term (In-Context)"]
direction TB
Conv["💬 Conversation"]:::st
Session["📝 Scratchpad"]:::st
end
subgraph MT ["Mid-Term (SQLite Task Store)"]
direction TB
Task["⚙️ DAG State"]:::mt
Reflect["<22> Reflections"]:::mt
end
subgraph LT ["Long-Term (Persistent)"]
direction TB
RAG["🔮 Semantic RAG"]:::lt
Struct["💾 Structured Data"]:::lt
end
ST -->|Initiates tasks| MT
MT -->|Queries knowledge| LT
LT -.->|Injects context| ST
🕒 When is each memory used?
- Short-Term (Conversation & Session):
- When: Active chatting, holding the immediate goal, fast active reasoning.
- Lifecycle: Evicted rapidly to save prompt context.
- Mid-Term (Task Memory & State - SQLite):
- When: Tracking multi-step execution graphs (DAGs), pausing/resuming tasks, storing critic reflections & retry counts.
- Lifecycle: Persists across agent loops; prevents the agent from getting stuck in circles.
- Long-Term (Vector DB & File System):
- When: Finding unseen documents or entire codebase structures based on semantic meaning.
- Lifecycle: Permanent; grows over time.
🧠 Long-Term Memory: RAG from Scratch
- Local LLMs have finite (and hardware-bound) context windows.
- You can't simply fit an entire large codebase into a localized 8k context window.
- Building RAG (Retrieval-Augmented Generation):
- Used
sqlite-vssfor K-Nearest Neighbors (KNN) vector search natively inside SQLite. - Implementing fallback to dot-product calculations if the VSS extension is unavailable.
- Generating and storing embeddings locally to retrieve only the relevant functions immediately needed.
- Used
graph LR
Ctx["📄 Context"] --> Chunk["✂️ Chunking"]
Chunk --> Embed["🔢 Embedding"]
Embed --> DB["🗄️ Vector DB"]
DB --> Search["🔍 Search"]
Search --> Retrieve["🧠 Retrieval"]
⚖️ Mastering Context Window Management
- The hardest technical challenge: Managing prompt size dynamically.
- Combining RAG retrieval with the agent's conversational history.
- Implementing mechanics to handle max-tokens:
- Sliding windows for conversation history.
- Context summarization.
- Deciding what to precisely evict from memory without making the agent "forget" its core objective.
📊 Dashboard
🚀 The Result: Bridging Theory and Practice
- I built an entire development framework myself from the ground up.
- Moved from passive reading about AI architectures to actively solving their core engineering constraints.
- Built a system based entirely on my own ideas, uniquely tailored to my development flow.
- Resulted in a fully functional, cost-free, local agentic framework.
layout: image-right image: ./_images/Gemini_Generated_Image_x6y0h0x6y0h0x6y0.png
Thank You!
Questions & Discussion


