test(docker): add observability smoke

Add Docker aggregate observability coverage for QA-lab OTEL and Prometheus diagnostics.
This commit is contained in:
Vincent Koc
2026-04-26 16:43:56 -07:00
committed by GitHub
parent 560ddd2f9b
commit 5d7c6e6bda
7 changed files with 281 additions and 1 deletions

View File

@@ -19,6 +19,7 @@ Docs: https://docs.openclaw.ai
- Providers/Ollama: honor `/api/show` capabilities when registering local models so non-tool Ollama models no longer receive the agent tool surface, and keep native Ollama thinking opt-in instead of enabling it by default. Fixes #64710 and duplicate #65343. Thanks @yuan-b, @netherby, @xilopaint, and @Diyforfun2026.
- Providers/Ollama: expose native Ollama thinking effort levels so `/think max` is accepted for reasoning-capable Ollama models and maps to Ollama's highest supported `think` effort. Fixes #71584. Thanks @g0st1n.
- Agents/Ollama: validate explicit `--thinking max` against catalog-discovered Ollama reasoning metadata so local agent runs accept the same native thinking levels shown in the model catalog. Fixes #71584. Thanks @g0st1n.
- Docker/QA: add observability coverage to the normal Docker aggregate so QA-lab OTEL and Prometheus diagnostics run inside Docker. Thanks @vincentkoc.
- Auto-reply: poison inbound message dedupe after replay-unsafe provider/runtime failures so retries stay safe before visible progress but cannot duplicate messages after block output, tool side effects, or session progress. Fixes #69303; keeps #58549 and #64606 as duplicate validation. Thanks @martingarramon, @NikolaFC, and @zeroth-blip.
- Agents/model fallback: jump directly to a known later live-session model redirect instead of walking unrelated fallback candidates, while preserving the already-landed live-session/fallback loop guard. Fixes #57471; related loop family already closed via #58496. Thanks @yuxiaoyang2007-prog.
- Gateway/Bonjour: keep @homebridge/ciao cancellation handlers registered across advertiser restarts so late probing cancellations cannot crash Linux and other mDNS-churned gateways. Thanks @codex.

View File

@@ -65,6 +65,14 @@ model calls must not export `StreamAbandoned` on successful turns; raw diagnosti
`openclaw.content.*` attributes must stay out of the trace. It writes
`otel-smoke-summary.json` next to the QA suite artifacts.
The normal Docker aggregate also runs an observability lane. It builds or
reuses a source-backed Docker observability image, runs the OTEL trace smoke
inside the container, then runs the `docker-prometheus-smoke` QA scenario with the
`diagnostics-prometheus` plugin enabled. Set
`OPENCLAW_DOCKER_OBSERVABILITY_LOOPS=<count>` to repeat both checks inside one
Docker run while preserving per-loop artifacts under
`.artifacts/docker-observability/...`.
For a transport-real Matrix smoke lane, run:
```bash

View File

@@ -617,6 +617,7 @@ The live-model Docker runners also bind-mount only the needed CLI auth homes (or
- CLI backend smoke: `pnpm test:docker:live-cli-backend` (script: `scripts/test-live-cli-backend-docker.sh`)
- Codex app-server harness smoke: `pnpm test:docker:live-codex-harness` (script: `scripts/test-live-codex-harness-docker.sh`)
- Gateway + dev agent: `pnpm test:docker:live-gateway` (script: `scripts/test-live-gateway-models-docker.sh`)
- Docker observability smoke: included in `pnpm test:docker:all` and `pnpm test:docker:local:all` (script: `scripts/e2e/docker-observability-smoke.sh`). It runs QA-lab OTEL and Prometheus diagnostics checks inside a source-backed Docker image. Set `OPENCLAW_DOCKER_OBSERVABILITY_LOOPS=<count>` to repeat both checks in one container run.
- Open WebUI live smoke: `pnpm test:docker:openwebui` (script: `scripts/e2e/openwebui-docker.sh`)
- Onboarding wizard (TTY, full scaffolding): `pnpm test:docker:onboard` (script: `scripts/e2e/onboard-docker.sh`)
- Npm tarball onboarding/channel/agent smoke: `pnpm test:docker:npm-onboard-channel-agent` installs the packed OpenClaw tarball globally in Docker, configures OpenAI via env-ref onboarding plus Telegram by default, verifies doctor repairs activated plugin runtime deps, and runs one mocked OpenAI agent turn. Reuse a prebuilt tarball with `OPENCLAW_CURRENT_PACKAGE_TGZ=/path/to/openclaw-*.tgz`, skip the host rebuild with `OPENCLAW_NPM_ONBOARD_HOST_BUILD=0`, or switch channel with `OPENCLAW_NPM_ONBOARD_CHANNEL=discord`.

View File

@@ -0,0 +1,156 @@
# Docker Prometheus smoke
```yaml qa-scenario
id: docker-prometheus-smoke
title: Docker Prometheus smoke
surface: telemetry
coverage:
primary:
- telemetry.prometheus
secondary:
- harness.qa-lab
- docker.e2e
objective: Verify a QA-lab gateway run emits protected, bounded Prometheus diagnostics metrics through the diagnostics-prometheus plugin.
successCriteria:
- The diagnostics-prometheus plugin exposes the protected scrape route.
- An unauthenticated scrape is rejected.
- A minimal QA-channel agent turn completes.
- The authenticated scrape includes release-critical diagnostics metric families.
- Prometheus output omits prompt content, session keys, auth tokens, raw ids, and file paths.
plugins:
- diagnostics-prometheus
gatewayConfigPatch:
diagnostics:
enabled: true
docsRefs:
- docs/gateway/prometheus.md
- docs/concepts/qa-e2e-automation.md
codeRefs:
- extensions/diagnostics-prometheus/src/service.ts
- src/diagnostics/internal-diagnostics.ts
- extensions/qa-lab/src/suite.ts
execution:
kind: flow
summary: Complete a minimal QA-lab turn and scrape the protected Prometheus route.
config:
prompt: Reply exactly DOCKER-PROMETHEUS-OK. Do not repeat DOCKER-PROMETHEUS-SECRET.
secretNeedle: DOCKER-PROMETHEUS-SECRET
```
```yaml qa-flow
steps:
- name: emits protected low-cardinality prometheus metrics
actions:
- call: waitForGatewayHealthy
args:
- ref: env
- 60000
- call: waitForQaChannelReady
args:
- ref: env
- 60000
- call: reset
- set: startCursor
value:
expr: state.getSnapshot().messages.length
- call: runAgentPrompt
args:
- ref: env
- sessionKey: agent:qa:docker-prometheus-smoke
message:
expr: config.prompt
timeoutMs:
expr: liveTurnTimeoutMs(env, 30000)
- call: waitForCondition
saveAs: outbound
args:
- lambda:
expr: "state.getSnapshot().messages.slice(startCursor).filter((candidate) => candidate.direction === 'outbound' && candidate.conversation.id === 'qa-operator' && String(candidate.text ?? '').trim().length > 0).at(-1)"
- expr: liveTurnTimeoutMs(env, 30000)
- expr: "env.providerMode === 'mock-openai' ? 100 : 250"
- assert:
expr: "String(outbound.text ?? '').trim().length > 0"
message: "expected non-empty qa output before scraping metrics"
- set: prometheusUrl
value:
expr: "`${env.gateway.baseUrl}/api/diagnostics/prometheus`"
- set: gatewayToken
value:
expr: "String(env.gateway.token ?? env.gateway.runtimeEnv.OPENCLAW_GATEWAY_TOKEN ?? '')"
- assert:
expr: "gatewayToken.length > 0"
message: "expected QA gateway token to be available for protected scrape"
- set: unauthenticatedScrape
value:
expr: |-
(async () => {
const response = await fetch(prometheusUrl);
await response.text().catch(() => "");
return { status: response.status };
})()
- assert:
expr: "unauthenticatedScrape.status === 401 || unauthenticatedScrape.status === 403"
message:
expr: "`expected unauthenticated prometheus scrape to be rejected, got ${unauthenticatedScrape.status}`"
- set: authenticatedScrape
value:
expr: |-
(async () => {
const response = await fetch(prometheusUrl, {
headers: { authorization: `Bearer ${gatewayToken}` },
});
const text = await response.text();
return {
status: response.status,
contentType: response.headers.get("content-type") ?? "",
text,
};
})()
- assert:
expr: "authenticatedScrape.status === 200"
message:
expr: "`expected authenticated prometheus scrape to return 200, got ${authenticatedScrape.status}`"
- assert:
expr: "authenticatedScrape.contentType.includes('text/plain')"
message:
expr: "`expected prometheus text content type, got ${authenticatedScrape.contentType}`"
- set: prometheusText
value:
expr: "String(authenticatedScrape.text ?? '')"
- assert:
expr: "prometheusText.includes('# TYPE openclaw_run_completed_total counter')"
message: "missing run completion counter"
- assert:
expr: "prometheusText.includes('# TYPE openclaw_run_duration_seconds histogram')"
message: "missing run duration histogram"
- assert:
expr: "prometheusText.includes('# TYPE openclaw_model_call_total counter')"
message: "missing model call counter"
- assert:
expr: "prometheusText.includes('# TYPE openclaw_harness_run_total counter')"
message: "missing harness run counter"
- assert:
expr: "!prometheusText.includes(config.secretNeedle)"
message: "prometheus output leaked prompt sentinel"
- assert:
expr: "!prometheusText.includes('DOCKER-PROMETHEUS-OK')"
message: "prometheus output leaked response content"
- assert:
expr: "!prometheusText.includes('agent:qa:docker-prometheus-smoke')"
message: "prometheus output leaked the session key"
- assert:
expr: "!prometheusText.includes(gatewayToken)"
message: "prometheus output leaked the gateway token"
- assert:
expr: "!/runId|sessionId|sessionKey|callId|toolCallId|messageId|providerRequestId/.test(prometheusText)"
message: "prometheus output leaked raw diagnostic identifiers"
- assert:
expr: "!/\\/tmp\\/|\\/private\\/tmp\\/|\\/app\\//.test(prometheusText)"
message: "prometheus output leaked a local file path"
- assert:
expr: "!prometheusText.includes('openclaw.content.')"
message: "prometheus output leaked content attributes"
- assert:
expr: "!/openclaw_prometheus_series_dropped_total(?:\\{[^}]*\\})?\\s+(?!0(?:\\.0+)?(?:\\s|$))/.test(prometheusText)"
message: "prometheus dropped series during the smoke"
```

View File

@@ -0,0 +1,55 @@
# syntax=docker/dockerfile:1.7
FROM node:24-bookworm-slim@sha256:e8e2e91b1378f83c5b2dd15f0247f34110e2fe895f6ca7719dbb780f929368eb AS observability-runner
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates git \
&& rm -rf /var/lib/apt/lists/*
RUN corepack enable
RUN useradd --create-home --shell /bin/bash appuser \
&& mkdir -p /app \
&& chown appuser:appuser /app
ENV HOME="/home/appuser"
ENV NODE_OPTIONS="--disable-warning=ExperimentalWarning"
ENV OPENCLAW_DISABLE_BONJOUR="1"
USER appuser
WORKDIR /app
COPY --chown=appuser:appuser package.json pnpm-lock.yaml pnpm-workspace.yaml .npmrc ./
COPY --chown=appuser:appuser ui/package.json ./ui/package.json
COPY --chown=appuser:appuser patches ./patches
COPY --chown=appuser:appuser scripts/postinstall-bundled-plugins.mjs scripts/preinstall-package-manager-warning.mjs scripts/npm-runner.mjs scripts/windows-cmd-helpers.mjs ./scripts/
RUN --mount=type=bind,source=extensions,target=/tmp/extensions,readonly \
find /tmp/extensions -mindepth 2 -maxdepth 2 -name package.json -print | \
while IFS= read -r manifest; do \
dest="${manifest#/tmp/}"; \
mkdir -p "$(dirname "$dest")"; \
cp "$manifest" "$dest"; \
done
RUN --mount=type=cache,id=openclaw-pnpm-store,target=/home/appuser/.local/share/pnpm/store,sharing=locked \
pnpm install --frozen-lockfile
COPY --chown=appuser:appuser .oxlintrc.json tsconfig.json tsconfig.plugin-sdk.dts.json tsconfig.oxlint*.json tsdown.config.ts vitest.config.ts openclaw.mjs ./
COPY --chown=appuser:appuser src ./src
COPY --chown=appuser:appuser test ./test
COPY --chown=appuser:appuser scripts ./scripts
COPY --chown=appuser:appuser docs ./docs
COPY --chown=appuser:appuser packages ./packages
COPY --chown=appuser:appuser qa ./qa
COPY --chown=appuser:appuser skills ./skills
COPY --chown=appuser:appuser ui ./ui
COPY --chown=appuser:appuser extensions ./extensions
COPY --chown=appuser:appuser vendor/a2ui/renderers/lit ./vendor/a2ui/renderers/lit
COPY --chown=appuser:appuser apps/shared/OpenClawKit/Sources/OpenClawKit/Resources ./apps/shared/OpenClawKit/Sources/OpenClawKit/Resources
COPY --chown=appuser:appuser apps/shared/OpenClawKit/Tools/CanvasA2UI ./apps/shared/OpenClawKit/Tools/CanvasA2UI
RUN pnpm build
RUN mkdir -p dist/control-ui \
&& printf '%s\n' '<!doctype html><title>OpenClaw Control UI</title>' > dist/control-ui/index.html
CMD ["bash"]

View File

@@ -0,0 +1,52 @@
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
source "$ROOT_DIR/scripts/lib/docker-e2e-image.sh"
IMAGE_NAME="$(docker_e2e_resolve_image "openclaw-docker-observability-e2e:local" OPENCLAW_DOCKER_OBSERVABILITY_E2E_IMAGE)"
SKIP_BUILD="${OPENCLAW_DOCKER_OBSERVABILITY_E2E_SKIP_BUILD:-0}"
LOOPS="${OPENCLAW_DOCKER_OBSERVABILITY_LOOPS:-1}"
OUTPUT_DIR="${OPENCLAW_DOCKER_OBSERVABILITY_OUTPUT_DIR:-$ROOT_DIR/.artifacts/docker-observability/$(date +%Y%m%d-%H%M%S)}"
if ! [[ "$LOOPS" =~ ^[1-9][0-9]*$ ]]; then
echo "OPENCLAW_DOCKER_OBSERVABILITY_LOOPS must be a positive integer, got: $LOOPS" >&2
exit 1
fi
mkdir -p "$OUTPUT_DIR"
docker_e2e_build_or_reuse "$IMAGE_NAME" docker-observability "$ROOT_DIR/scripts/e2e/Dockerfile.observability" "$ROOT_DIR" "" "$SKIP_BUILD"
echo "Running Docker observability smoke with $LOOPS loop(s)..."
run_logged docker-observability docker run --rm \
-e "OPENCLAW_DOCKER_OBSERVABILITY_LOOPS=$LOOPS" \
-v "$OUTPUT_DIR:/app/.artifacts/docker-observability-current" \
"$IMAGE_NAME" \
bash -lc '
set -euo pipefail
loops="${OPENCLAW_DOCKER_OBSERVABILITY_LOOPS:-1}"
artifact_root=".artifacts/docker-observability-current"
mkdir -p "$artifact_root"
for i in $(seq 1 "$loops"); do
iteration_dir="$artifact_root/loop-$i"
mkdir -p "$iteration_dir"
echo "== docker observability loop $i/$loops: otel =="
pnpm qa:otel:smoke \
--provider-mode mock-openai \
--output-dir "$iteration_dir/otel"
echo "== docker observability loop $i/$loops: prometheus =="
pnpm openclaw qa suite \
--provider-mode mock-openai \
--scenario docker-prometheus-smoke \
--concurrency 1 \
--fast \
--output-dir "$iteration_dir/prometheus"
done
'
echo "Docker observability smoke passed. Artifacts: $OUTPUT_DIR"

View File

@@ -25,7 +25,10 @@ function lane(name, command, options = {}) {
return {
cacheKey: options.cacheKey,
command,
e2eImageKind: options.e2eImageKind ?? (options.live ? undefined : "functional"),
e2eImageKind:
options.e2eImageKind === false
? undefined
: (options.e2eImageKind ?? (options.live ? undefined : "functional")),
estimateSeconds: options.estimateSeconds,
live: options.live === true,
name,
@@ -181,6 +184,10 @@ export const mainLanes = [
{ resources: ["service"], weight: 3 },
),
serviceLane("gateway-network", "OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:gateway-network"),
serviceLane("observability", "bash scripts/e2e/docker-observability-smoke.sh", {
e2eImageKind: false,
weight: 3,
}),
serviceLane(
"agents-delete-shared-workspace",
"OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:agents-delete-shared-workspace",