diff --git a/.agents/skills/openclaw-test-heap-leaks/SKILL.md b/.agents/skills/openclaw-test-heap-leaks/SKILL.md index a2ab2878430..3a0e9c1a648 100644 --- a/.agents/skills/openclaw-test-heap-leaks/SKILL.md +++ b/.agents/skills/openclaw-test-heap-leaks/SKILL.md @@ -1,11 +1,11 @@ --- name: openclaw-test-heap-leaks -description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, distinguish transformed-module retention from real data leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests. +description: Investigate `pnpm test` memory growth, Vitest worker OOMs, and suspicious RSS increases in OpenClaw using the `scripts/test-parallel.mjs` heap snapshot tooling. Use when Codex needs to reproduce test-lane memory growth, collect repeated `.heapsnapshot` files, compare snapshots from the same worker PID, triage likely transformed-module retention versus likely runtime leaks, and fix or reduce the impact by patching cleanup logic or isolating hotspot tests. --- # OpenClaw Test Heap Leaks -Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. +Use this skill for test-memory investigations. Do not guess from RSS alone when heap snapshots are available. Treat snapshot-name deltas as triage evidence, not proof, until retainers or dominators support the call. ## Workflow @@ -14,19 +14,23 @@ Use this skill for test-memory investigations. Do not guess from RSS alone when - `pnpm canvas:a2ui:bundle && OPENCLAW_TEST_MEMORY_TRACE=1 OPENCLAW_TEST_HEAPSNAPSHOT_INTERVAL_MS=60000 OPENCLAW_TEST_HEAPSNAPSHOT_DIR=.tmp/heapsnap OPENCLAW_TEST_WORKERS=2 OPENCLAW_TEST_MAX_OLD_SPACE_SIZE_MB=6144 pnpm test` - Keep `OPENCLAW_TEST_MEMORY_TRACE=1` enabled so the wrapper prints per-file RSS summaries alongside the snapshots. - If the report is about a specific shard or worker budget, preserve that shape. + - Before you analyze snapshots, identify the real lane names from `[test-parallel] start ...` lines or `pnpm test --plan`. Do not assume a single `unit-fast` lane; local plans often split into `unit-fast-batch-*`. 2. Wait for repeated snapshots before concluding anything. - Take at least two intervals from the same lane. - - Compare snapshots from the same PID inside one lane directory such as `.tmp/heapsnap/unit-fast/`. - - Use `scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory. + - Compare snapshots from the same PID inside the real lane directory such as `.tmp/heapsnap/unit-fast-batch-2/`. + - Use `.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs` to compare either two files directly or the earliest/latest pair per PID in one lane directory. + - If the helper suggests transformed-module retention, confirm the top entries in DevTools retainers/dominators before calling it solved. 3. Classify the growth before choosing a fix. - - If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as retained module graph growth in long-lived workers. + - If growth is dominated by Vite/Vitest transformed source strings, `Module`, `system / Context`, bytecode, descriptor arrays, or property maps, treat it as likely retained module graph growth in long-lived workers. - If growth is dominated by app objects, caches, buffers, server handles, timers, mock state, sqlite state, or similar runtime objects, treat it as a likely cleanup or lifecycle leak. + - If the names are ambiguous, stop short of a confident label and inspect retainers/dominators in DevTools for the top deltas. 4. Fix the right layer. - - For retained transformed-module growth in shared workers: - - Move hotspot files out of `unit-fast` by updating `test/fixtures/test-parallel.behavior.json`. + - For likely retained transformed-module growth in shared workers: + - Prefer timing and hotspot-driven scheduling fixes first. Check whether the file is already represented in `test/fixtures/test-timings.unit.json` and whether `scripts/test-update-memory-hotspots.mjs` should refresh the measured hotspot manifest before hand-editing behavior overrides. + - Move hotspot files out of the real shared lane by updating `test/fixtures/test-parallel.behavior.json` only when timing-driven peeling is insufficient. - Prefer `singletonIsolated` for files that are safe alone but inflate shared worker heaps. - If the file should already have been peeled out by timings but is absent from `test/fixtures/test-timings.unit.json`, call that out explicitly. Missing timings are a scheduling blind spot. - For real leaks: @@ -40,24 +44,24 @@ Use this skill for test-memory investigations. Do not guess from RSS alone when ## Heuristics -- Do not call everything a leak. In this repo, large `unit-fast` growth can be a worker-lifetime problem rather than an application object leak. +- Do not call everything a leak. In this repo, large `unit-fast` or `unit-fast-batch-*` growth can be a worker-lifetime problem rather than an application object leak. - `scripts/test-parallel.mjs` and `scripts/test-parallel-memory.mjs` are the primary control points for wrapper diagnostics. - The lane names printed by `[test-parallel] start ...` and `[test-parallel][mem] summary ...` tell you where to focus. - When one or two files account for most of the delta and they are missing from timings, reducing impact by isolating them is usually the first pragmatic fix. -- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition. +- When the same retained object families grow across multiple intervals in the same worker PID, trust the snapshots over intuition, then confirm ambiguous calls with retainer evidence. ## Snapshot Comparison - Direct comparison: - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs before.heapsnapshot after.heapsnapshot` - Auto-select earliest/latest snapshots per PID within one lane: - - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast` + - `node .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs --lane-dir .tmp/heapsnap/unit-fast-batch-2` - Useful flags: - `--top 40` - `--min-kb 32` - `--pid 16133` -Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. +Read the top positive deltas first. Large positive growth in module-transform artifacts suggests lane isolation; large positive growth in runtime objects suggests a real leak. If the names alone do not settle it, open the same snapshot pair in DevTools and inspect retainers/dominators for the top rows before declaring root cause. ## Output Expectations @@ -66,6 +70,6 @@ When using this skill, report: - The exact reproduce command. - Which lane and PID were compared. - The dominant retained object families from the snapshot delta. -- Whether the issue is a real leak or shared-worker retained module growth. +- Whether the issue is a likely real leak or likely shared-worker retained module growth, plus whether retainers/dominators confirmed it. - The concrete fix or impact-reduction patch. - What you verified, and what snapshot overhead prevented you from verifying. diff --git a/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs b/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs index ccb705c4c82..0ff4ddc6eeb 100644 --- a/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs +++ b/.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs @@ -64,6 +64,243 @@ function parseArgs(argv) { return options; } +class JsonStreamScanner { + constructor(filePath) { + this.stream = fs.createReadStream(filePath, { + encoding: "utf8", + highWaterMark: 1024 * 1024, + }); + this.iterator = this.stream[Symbol.asyncIterator](); + this.buffer = ""; + this.offset = 0; + this.done = false; + } + + compactBuffer() { + if (this.offset > 65536) { + this.buffer = this.buffer.slice(this.offset); + this.offset = 0; + } + } + + async ensureAvailable(count = 1) { + while (!this.done && this.buffer.length - this.offset < count) { + const next = await this.iterator.next(); + if (next.done) { + this.done = true; + break; + } + this.buffer += next.value; + } + } + + async peek() { + await this.ensureAvailable(1); + return this.buffer[this.offset] ?? null; + } + + async next() { + await this.ensureAvailable(1); + if (this.offset >= this.buffer.length) { + return null; + } + const char = this.buffer[this.offset]; + this.offset += 1; + this.compactBuffer(); + return char; + } + + async skipWhitespace() { + while (true) { + const char = await this.peek(); + if (char === null || !/\s/u.test(char)) { + return; + } + await this.next(); + } + } + + async expectChar(expected) { + const char = await this.next(); + if (char !== expected) { + fail(`Expected ${expected} but found ${char ?? ""}`); + } + } + + async find(sequence) { + let matched = 0; + while (true) { + const char = await this.next(); + if (char === null) { + fail(`Could not find ${sequence}`); + } + if (char === sequence[matched]) { + matched += 1; + if (matched === sequence.length) { + return; + } + continue; + } + matched = char === sequence[0] ? 1 : 0; + if (matched === sequence.length) { + return; + } + } + } + + async readBalancedObject() { + const start = await this.next(); + if (start !== "{") { + fail(`Expected { but found ${start ?? ""}`); + } + let text = "{"; + let depth = 1; + let inString = false; + let escaped = false; + while (depth > 0) { + const char = await this.next(); + if (char === null) { + fail("Unexpected EOF while reading JSON object"); + } + text += char; + if (inString) { + if (escaped) { + escaped = false; + } else if (char === "\\") { + escaped = true; + } else if (char === '"') { + inString = false; + } + continue; + } + if (char === '"') { + inString = true; + } else if (char === "{") { + depth += 1; + } else if (char === "}") { + depth -= 1; + } + } + return text; + } + + async parseNumberArray(onValue) { + await this.skipWhitespace(); + await this.expectChar("["); + await this.skipWhitespace(); + if ((await this.peek()) === "]") { + await this.next(); + return; + } + + let token = ""; + let index = 0; + const flush = () => { + if (token.length === 0) { + fail("Unexpected empty number token"); + } + const value = Number.parseInt(token, 10); + if (!Number.isFinite(value)) { + fail(`Invalid numeric token: ${token}`); + } + onValue(value, index); + index += 1; + token = ""; + }; + + while (true) { + const char = await this.next(); + if (char === null) { + fail("Unexpected EOF while reading number array"); + } + if (char === "]") { + flush(); + return; + } + if (char === ",") { + flush(); + continue; + } + if (/\s/u.test(char)) { + continue; + } + token += char; + } + } + + async readJsonString() { + await this.expectChar('"'); + let value = ""; + while (true) { + const char = await this.next(); + if (char === null) { + fail("Unexpected EOF while reading JSON string"); + } + if (char === '"') { + return value; + } + if (char !== "\\") { + value += char; + continue; + } + const escaped = await this.next(); + if (escaped === null) { + fail("Unexpected EOF while reading JSON string escape"); + } + if (escaped === "u") { + let hex = ""; + for (let index = 0; index < 4; index += 1) { + const hexChar = await this.next(); + if (hexChar === null) { + fail("Unexpected EOF while reading JSON unicode escape"); + } + hex += hexChar; + } + value += String.fromCharCode(Number.parseInt(hex, 16)); + continue; + } + value += + escaped === "b" + ? "\b" + : escaped === "f" + ? "\f" + : escaped === "n" + ? "\n" + : escaped === "r" + ? "\r" + : escaped === "t" + ? "\t" + : escaped; + } + } + + async parseStringArray(onValue) { + await this.skipWhitespace(); + await this.expectChar("["); + await this.skipWhitespace(); + if ((await this.peek()) === "]") { + await this.next(); + return; + } + + let index = 0; + while (true) { + const value = await this.readJsonString(); + onValue(value, index); + index += 1; + await this.skipWhitespace(); + const separator = await this.next(); + if (separator === "]") { + return; + } + if (separator !== ",") { + fail(`Expected , or ] but found ${separator ?? ""}`); + } + await this.skipWhitespace(); + } + } +} + function parseHeapFilename(filePath) { const base = path.basename(filePath); const match = base.match( @@ -151,38 +388,89 @@ function resolvePair(options) { }; } -function loadSummary(filePath) { - const data = JSON.parse(fs.readFileSync(filePath, "utf8")); - const meta = data.snapshot?.meta; +async function parseSnapshotMeta(scanner) { + await scanner.find('"snapshot":'); + await scanner.skipWhitespace(); + const metaObjectText = await scanner.readBalancedObject(); + const parsed = JSON.parse(metaObjectText); + return parsed?.meta ?? null; +} + +async function buildSummary(filePath) { + const scanner = new JsonStreamScanner(filePath); + const meta = await parseSnapshotMeta(scanner); if (!meta) { fail(`Invalid heap snapshot: ${filePath}`); } const nodeFieldCount = meta.node_fields.length; const typeNames = meta.node_types[0]; - const strings = data.strings; const typeIndex = meta.node_fields.indexOf("type"); const nameIndex = meta.node_fields.indexOf("name"); const selfSizeIndex = meta.node_fields.indexOf("self_size"); + if (typeIndex === -1 || nameIndex === -1 || selfSizeIndex === -1) { + fail(`Unsupported heap snapshot schema: ${filePath}`); + } - const summary = new Map(); - for (let offset = 0; offset < data.nodes.length; offset += nodeFieldCount) { - const type = typeNames[data.nodes[offset + typeIndex]]; - const name = strings[data.nodes[offset + nameIndex]]; - const selfSize = data.nodes[offset + selfSizeIndex]; - const key = `${type}\t${name}`; - const current = summary.get(key) ?? { - type, - name, + const summaryByIndex = new Map(); + let nodeCount = 0; + let currentTypeId = 0; + let currentNameId = 0; + let currentSelfSize = 0; + await scanner.find('"nodes":'); + await scanner.parseNumberArray((value, index) => { + const fieldIndex = index % nodeFieldCount; + if (fieldIndex === typeIndex) { + currentTypeId = value; + return; + } + if (fieldIndex === nameIndex) { + currentNameId = value; + return; + } + if (fieldIndex === selfSizeIndex) { + currentSelfSize = value; + } + if (fieldIndex !== nodeFieldCount - 1) { + return; + } + const key = `${currentTypeId}\t${currentNameId}`; + const current = summaryByIndex.get(key) ?? { + typeId: currentTypeId, + nameId: currentNameId, selfSize: 0, count: 0, }; - current.selfSize += selfSize; + current.selfSize += currentSelfSize; current.count += 1; - summary.set(key, current); + summaryByIndex.set(key, current); + nodeCount += 1; + }); + + const requiredNameIds = new Set( + Array.from(summaryByIndex.values(), (entry) => entry.nameId).filter((value) => value >= 0), + ); + const nameStrings = new Map(); + await scanner.find('"strings":'); + await scanner.parseStringArray((value, index) => { + if (requiredNameIds.has(index)) { + nameStrings.set(index, value); + } + }); + + const summary = new Map(); + for (const entry of summaryByIndex.values()) { + const key = `${typeNames[entry.typeId] ?? "unknown"}\t${nameStrings.get(entry.nameId) ?? ""}`; + summary.set(key, { + type: typeNames[entry.typeId] ?? "unknown", + name: nameStrings.get(entry.nameId) ?? "", + selfSize: entry.selfSize, + count: entry.count, + }); } + return { - nodeCount: data.snapshot.node_count, + nodeCount, summary, }; } @@ -205,11 +493,11 @@ function truncate(text, maxLength) { return text.length <= maxLength ? text : `${text.slice(0, maxLength - 1)}…`; } -function main() { +async function main() { const options = parseArgs(process.argv.slice(2)); const pair = resolvePair(options); - const before = loadSummary(pair.before); - const after = loadSummary(pair.after); + const before = await buildSummary(pair.before); + const after = await buildSummary(pair.after); const minBytes = options.minKb * 1024; const rows = []; @@ -262,4 +550,4 @@ function main() { } } -main(); +await main();