mirror of
https://github.com/moltbot/moltbot.git
synced 2026-05-13 15:47:28 +00:00
fix(discord): make realtime barge-in guard tunable
This commit is contained in:
@@ -46,6 +46,7 @@ Docs: https://docs.openclaw.ai
|
||||
- Discord/voice: make duplicate same-guild auto-join entries resolve to the last configured channel so moving an agent between voice channels does not keep joining the stale channel.
|
||||
- Discord/voice: add realtime `/vc` modes so Discord voice channels can run as STT/TTS, a realtime talk buffer with the OpenClaw agent brain, or a bidi realtime session with `openclaw_agent_consult`.
|
||||
- Discord/voice: add bounded realtime gateway logs for voice channel joins, realtime model/voice selection, transcripts, consult routing/answers, and playback start, allow OpenAI realtime Discord sessions to disable input-triggered response interruption for echo-heavy rooms while keeping explicit Discord barge-in available for new and already-active speakers, and allow voice turns to target an existing Discord channel agent session.
|
||||
- Discord/voice: add `voice.realtime.minBargeInAudioEndMs` and let the realtime provider own playback clearing, so speaker echo no longer cuts OpenAI realtime model audio at `audioEndMs=0` while low-echo rooms can opt back into immediate barge-in with `0`.
|
||||
- Discord/voice: include a bounded one-line STT transcript preview in verbose voice logs so live voice debugging shows what speakers said before the agent reply.
|
||||
- Codex app-server: pin the managed Codex harness and Codex CLI smoke package to `@openai/codex@0.129.0`, defer OpenClaw integration dynamic tools behind Codex tool search by default, and accept current Codex service-tier values so legacy `fast` settings survive the stable harness upgrade as `priority`.
|
||||
- Codex app-server: annotate message-tool-only direct chat turns in the dynamic `message` tool spec so visible replies are sent through `message(action="send")` instead of staying private. (#79704)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
216abd8ed137e6e92d615d88afb0d9fe0b1428cddee292439b39138ee03f9a10 config-baseline.json
|
||||
632c00a35e0ed2413604ff28f5b4df0718131492208863c5d39576d76a9b7c88 config-baseline.json
|
||||
7ac9eadabe0119deba4418dbaadc478092fa32617fab3f9618e0a14210720e4b config-baseline.core.json
|
||||
c3e8742922d4e5ece408dd3590382285927ef86252d1a2f6f922566ea21531bb config-baseline.channel.json
|
||||
42264b147fb29e0ba7017b4ec018a0793bb9cd23e58bf5fb796d6b33bf9ca829 config-baseline.channel.json
|
||||
df93bfde8e3de8d6f80dbf1b0ae43ad250f216f2fc0244c5d9a19afca50806f6 config-baseline.plugin.json
|
||||
|
||||
@@ -1206,6 +1206,7 @@ Notes:
|
||||
- In `stt-tts` mode, STT uses `tools.media.audio`; `voice.model` does not affect transcription.
|
||||
- In realtime modes, `voice.realtime.provider`, `voice.realtime.model`, and `voice.realtime.voice` configure the realtime audio session. For OpenAI Realtime 2 plus the Codex brain, use `voice.realtime.model: "gpt-realtime-2"` and `voice.model: "openai-codex/gpt-5.5"`.
|
||||
- `voice.realtime.bargeIn` controls whether Discord speaker-start events interrupt active realtime playback. If unset, it follows the realtime provider's input-audio interruption setting.
|
||||
- `voice.realtime.minBargeInAudioEndMs` controls the minimum assistant playback duration before an OpenAI realtime barge-in truncates audio. Default: `250`. Set `0` for immediate interruption in low-echo rooms, or raise it for echo-heavy speaker setups.
|
||||
- For an OpenAI voice on Discord playback, set `voice.tts.provider: "openai"` and choose a Text-to-speech voice under `voice.tts.openai.voice` or `voice.tts.providers.openai.voice`. `cedar` is a good masculine-sounding choice on the current OpenAI TTS model.
|
||||
- Per-channel Discord `systemPrompt` overrides apply to voice transcript turns for that voice channel.
|
||||
- Voice transcript turns derive owner status from Discord `allowFrom` (or `dm.allowFrom`); non-owner speakers cannot access owner-only tools (for example `gateway` and `cron`).
|
||||
@@ -1217,7 +1218,7 @@ Notes:
|
||||
- `voice.connectTimeoutMs` controls the initial `@discordjs/voice` Ready wait for `/vc join` and auto-join attempts. Default: `30000`.
|
||||
- `voice.reconnectGraceMs` controls how long OpenClaw waits for a disconnected voice session to begin reconnecting before destroying it. Default: `15000`.
|
||||
- In `stt-tts` mode, voice playback does not stop just because another user starts speaking. To avoid feedback loops, OpenClaw ignores new voice capture while TTS is playing; speak after playback finishes for the next turn. Realtime modes forward speaker starts as barge-in signals to the realtime provider.
|
||||
- In realtime modes, echo from speakers into an open mic can look like barge-in and interrupt playback. For echo-heavy Discord rooms, set `voice.realtime.providers.openai.interruptResponseOnInputAudio: false` to keep OpenAI from auto-interrupting on input audio. Add `voice.realtime.bargeIn: true` if you still want Discord speaker-start events to interrupt active playback.
|
||||
- In realtime modes, echo from speakers into an open mic can look like barge-in and interrupt playback. For echo-heavy Discord rooms, set `voice.realtime.providers.openai.interruptResponseOnInputAudio: false` to keep OpenAI from auto-interrupting on input audio. Add `voice.realtime.bargeIn: true` if you still want Discord speaker-start events to interrupt active playback. The OpenAI realtime bridge ignores playback truncations shorter than `voice.realtime.minBargeInAudioEndMs` as likely echo/noise and logs them as skipped instead of clearing Discord playback.
|
||||
- `voice.captureSilenceGraceMs` controls how long OpenClaw waits after Discord reports a speaker has stopped before finalizing that audio segment for STT. Default: `2500`; raise this if Discord splits normal pauses into choppy partial transcripts.
|
||||
- When ElevenLabs is the selected TTS provider, Discord voice playback uses streaming TTS and starts from the provider response stream. Providers without streaming support fall back to the synthesized temp-file path.
|
||||
- OpenClaw also watches receive decrypt failures and auto-recovers by leaving/rejoining the voice channel after repeated failures in a short window.
|
||||
@@ -1345,6 +1346,7 @@ Echo-heavy OpenAI Realtime example:
|
||||
model: "gpt-realtime-2",
|
||||
voice: "cedar",
|
||||
bargeIn: true,
|
||||
minBargeInAudioEndMs: 500,
|
||||
consultPolicy: "always",
|
||||
providers: {
|
||||
openai: {
|
||||
@@ -1358,16 +1360,17 @@ Echo-heavy OpenAI Realtime example:
|
||||
}
|
||||
```
|
||||
|
||||
Use this when the model hears its own Discord playback through an open mic, but you still want to interrupt it by speaking. OpenClaw keeps OpenAI from auto-interrupting on raw input audio, while `bargeIn: true` lets Discord speaker-start events and already-active speaker audio cancel active realtime responses before the next captured turn reaches OpenAI.
|
||||
Use this when the model hears its own Discord playback through an open mic, but you still want to interrupt it by speaking. OpenClaw keeps OpenAI from auto-interrupting on raw input audio, while `bargeIn: true` lets Discord speaker-start events and already-active speaker audio cancel active realtime responses before the next captured turn reaches OpenAI. Very early barge-in signals with `audioEndMs` below `minBargeInAudioEndMs` are treated as likely echo/noise and ignored so the model does not cut off at the first playback frame.
|
||||
|
||||
Expected voice logs:
|
||||
|
||||
- On join: `discord voice: joining ... voiceSession=... supervisorSession=... agentSessionMode=... voiceModel=... realtimeModel=...`
|
||||
- On realtime start: `discord voice: realtime bridge starting ... interruptResponse=false bargeIn=true`
|
||||
- On realtime start: `discord voice: realtime bridge starting ... interruptResponse=false bargeIn=true minBargeInAudioEndMs=...`
|
||||
- On realtime consult: `discord voice: realtime consult requested ... voiceSession=... supervisorSession=... question=...`
|
||||
- On agent answer: `discord voice: agent turn answer ...`
|
||||
- On same-speaker interruption: `discord voice: realtime barge-in from active speaker audio ...`
|
||||
- On realtime interruption: `discord voice: realtime model interrupt requested client:response.cancel reason=barge-in`, followed by either `discord voice: realtime model audio truncated client:conversation.item.truncate reason=barge-in audioEndMs=...` or `discord voice: realtime model interrupt confirmed server:response.done status=cancelled ...`
|
||||
- On ignored echo/noise: `discord voice: realtime model interrupt ignored client:conversation.item.truncate.skipped reason=barge-in audioEndMs=0 minAudioEndMs=250`
|
||||
- On disabled barge-in: `discord voice: realtime capture ignored during playback (barge-in disabled) ...`
|
||||
|
||||
Credentials are resolved per component: LLM route auth for `voice.model`, STT auth for `tools.media.audio`, TTS auth for `messages.tts`/`voice.tts`, and realtime provider auth for `voice.realtime.providers` or the provider's normal auth config.
|
||||
|
||||
@@ -175,6 +175,7 @@ describe("discord config schema", () => {
|
||||
toolPolicy: "safe-read-only",
|
||||
consultPolicy: "always",
|
||||
bargeIn: true,
|
||||
minBargeInAudioEndMs: 500,
|
||||
providers: {
|
||||
openai: {
|
||||
apiKey: "sk-test",
|
||||
@@ -193,6 +194,7 @@ describe("discord config schema", () => {
|
||||
expect(cfg.voice?.realtime?.toolPolicy).toBe("safe-read-only");
|
||||
expect(cfg.voice?.realtime?.consultPolicy).toBe("always");
|
||||
expect(cfg.voice?.realtime?.bargeIn).toBe(true);
|
||||
expect(cfg.voice?.realtime?.minBargeInAudioEndMs).toBe(500);
|
||||
});
|
||||
|
||||
it("rejects invalid Discord realtime voice modes", () => {
|
||||
@@ -201,6 +203,8 @@ describe("discord config schema", () => {
|
||||
{ mode: "bidi", realtime: { toolPolicy: "dangerous" } },
|
||||
{ mode: "talk-buffer", realtime: { consultPolicy: "substantive" } },
|
||||
{ mode: "talk-buffer", realtime: { debounceMs: 10_001 } },
|
||||
{ mode: "talk-buffer", realtime: { minBargeInAudioEndMs: -1 } },
|
||||
{ mode: "talk-buffer", realtime: { minBargeInAudioEndMs: 10_001 } },
|
||||
{ agentSession: { mode: "target" } },
|
||||
]) {
|
||||
expectInvalidDiscordConfig({ voice });
|
||||
|
||||
@@ -217,6 +217,10 @@ export const discordChannelConfigUiHints = {
|
||||
label: "Discord Realtime Barge-In",
|
||||
help: "Allow Discord speaker-start events to interrupt active realtime playback. Set true to keep manual interruption when provider input-audio interruption is disabled for echo control.",
|
||||
},
|
||||
"voice.realtime.minBargeInAudioEndMs": {
|
||||
label: "Discord Realtime Minimum Barge-In Audio (ms)",
|
||||
help: "Minimum assistant playback duration before a Discord barge-in truncates realtime audio. Default: 250; set 0 for immediate interruption in low-echo rooms.",
|
||||
},
|
||||
"voice.realtime.providers": {
|
||||
label: "Discord Realtime Provider Settings",
|
||||
help: "Provider-specific realtime voice settings keyed by provider id.",
|
||||
|
||||
@@ -587,7 +587,7 @@ describe("DiscordVoiceManager", () => {
|
||||
).handleSpeakingStart(entry, "u1");
|
||||
|
||||
expect(realtimeSessionMock.handleBargeIn).toHaveBeenCalled();
|
||||
expect(player.stop).toHaveBeenCalledWith(true);
|
||||
expect(player.stop).not.toHaveBeenCalled();
|
||||
expect(connection.receiver.subscribe).toHaveBeenCalledWith(
|
||||
"u1",
|
||||
expect.objectContaining({ end: expect.any(Object) }),
|
||||
@@ -642,7 +642,7 @@ describe("DiscordVoiceManager", () => {
|
||||
turn?.sendInputAudio(Buffer.alloc(3840));
|
||||
|
||||
expect(realtimeSessionMock.handleBargeIn).toHaveBeenCalled();
|
||||
expect(player.stop).toHaveBeenCalledWith(true);
|
||||
expect(player.stop).not.toHaveBeenCalled();
|
||||
expect(realtimeSessionMock.sendAudio).toHaveBeenCalled();
|
||||
});
|
||||
|
||||
@@ -684,7 +684,7 @@ describe("DiscordVoiceManager", () => {
|
||||
turn?.sendInputAudio(Buffer.alloc(3840));
|
||||
|
||||
expect(realtimeSessionMock.handleBargeIn).toHaveBeenCalled();
|
||||
expect(player.stop).toHaveBeenCalledWith(true);
|
||||
expect(player.stop).not.toHaveBeenCalled();
|
||||
expect(realtimeSessionMock.sendAudio).toHaveBeenCalled();
|
||||
});
|
||||
|
||||
@@ -964,6 +964,7 @@ describe("DiscordVoiceManager", () => {
|
||||
realtime: {
|
||||
model: "gpt-realtime-2",
|
||||
voice: "cedar",
|
||||
minBargeInAudioEndMs: 500,
|
||||
providers: {
|
||||
openai: { model: "provider-default", voice: "marin" },
|
||||
},
|
||||
@@ -981,7 +982,11 @@ describe("DiscordVoiceManager", () => {
|
||||
providerConfigs: expect.objectContaining({
|
||||
openai: { model: "provider-default", voice: "marin" },
|
||||
}),
|
||||
providerConfigOverrides: { model: "gpt-realtime-2", voice: "cedar" },
|
||||
providerConfigOverrides: {
|
||||
model: "gpt-realtime-2",
|
||||
voice: "cedar",
|
||||
minBargeInAudioEndMs: 500,
|
||||
},
|
||||
}),
|
||||
);
|
||||
});
|
||||
|
||||
@@ -41,6 +41,7 @@ const DISCORD_REALTIME_TALKBACK_DEBOUNCE_MS = 350;
|
||||
const DISCORD_REALTIME_FALLBACK_TEXT = "I hit an error while checking that. Please try again.";
|
||||
const DISCORD_REALTIME_PENDING_SPEAKER_CONTEXT_LIMIT = 32;
|
||||
const DISCORD_REALTIME_LOG_PREVIEW_CHARS = 500;
|
||||
const DISCORD_REALTIME_DEFAULT_MIN_BARGE_IN_AUDIO_END_MS = 250;
|
||||
|
||||
export type DiscordVoiceMode = "stt-tts" | "talk-buffer" | "bidi";
|
||||
|
||||
@@ -69,6 +70,9 @@ function formatRealtimeInterruptionLog(event: RealtimeVoiceBridgeEvent): string
|
||||
if (event.type === "response.cancel") {
|
||||
return `discord voice: realtime model interrupt requested ${event.direction}:${event.type}${detail}`;
|
||||
}
|
||||
if (event.type === "conversation.item.truncate.skipped") {
|
||||
return `discord voice: realtime model interrupt ignored ${event.direction}:${event.type}${detail}`;
|
||||
}
|
||||
if (event.type === "conversation.item.truncate") {
|
||||
return `discord voice: realtime model audio truncated ${event.direction}:${event.type}${detail}`;
|
||||
}
|
||||
@@ -260,7 +264,7 @@ export class DiscordRealtimeVoiceSession implements VoiceRealtimeSession {
|
||||
realtimeConfig: this.realtimeConfig,
|
||||
providerId: resolved.provider.id,
|
||||
},
|
||||
)}`,
|
||||
)} minBargeInAudioEndMs=${resolveDiscordRealtimeMinBargeInAudioEndMs(this.realtimeConfig)}`,
|
||||
);
|
||||
const voiceSdk = loadDiscordVoiceSdk();
|
||||
this.params.entry.player.on(voiceSdk.AudioPlayerStatus.Idle, this.playerIdleHandler);
|
||||
@@ -323,7 +327,6 @@ export class DiscordRealtimeVoiceSession implements VoiceRealtimeSession {
|
||||
return;
|
||||
}
|
||||
this.bridge?.handleBargeIn({ audioPlaybackActive: true });
|
||||
this.clearOutputAudio();
|
||||
}
|
||||
|
||||
isBargeInEnabled(): boolean {
|
||||
@@ -516,10 +519,21 @@ function buildProviderConfigOverrides(
|
||||
const overrides = {
|
||||
...(realtimeConfig?.model ? { model: realtimeConfig.model } : {}),
|
||||
...(realtimeConfig?.voice ? { voice: realtimeConfig.voice } : {}),
|
||||
...(typeof realtimeConfig?.minBargeInAudioEndMs === "number"
|
||||
? { minBargeInAudioEndMs: realtimeConfig.minBargeInAudioEndMs }
|
||||
: {}),
|
||||
};
|
||||
return Object.keys(overrides).length > 0 ? overrides : undefined;
|
||||
}
|
||||
|
||||
function resolveDiscordRealtimeMinBargeInAudioEndMs(
|
||||
realtimeConfig: DiscordRealtimeVoiceConfig,
|
||||
): number {
|
||||
return typeof realtimeConfig?.minBargeInAudioEndMs === "number"
|
||||
? realtimeConfig.minBargeInAudioEndMs
|
||||
: DISCORD_REALTIME_DEFAULT_MIN_BARGE_IN_AUDIO_END_MS;
|
||||
}
|
||||
|
||||
function buildDiscordRealtimeInstructions(params: {
|
||||
mode: Exclude<DiscordVoiceMode, "stt-tts">;
|
||||
instructions?: string;
|
||||
|
||||
@@ -703,7 +703,7 @@ describe("buildOpenAIRealtimeVoiceProvider", () => {
|
||||
}),
|
||||
),
|
||||
);
|
||||
bridge.setMediaTimestamp(1240);
|
||||
bridge.setMediaTimestamp(1300);
|
||||
|
||||
bridge.handleBargeIn?.({ audioPlaybackActive: true });
|
||||
|
||||
@@ -714,7 +714,7 @@ describe("buildOpenAIRealtimeVoiceProvider", () => {
|
||||
type: "conversation.item.truncate",
|
||||
item_id: "item_1",
|
||||
content_index: 0,
|
||||
audio_end_ms: 240,
|
||||
audio_end_ms: 300,
|
||||
});
|
||||
});
|
||||
|
||||
@@ -904,6 +904,7 @@ describe("buildOpenAIRealtimeVoiceProvider", () => {
|
||||
"message",
|
||||
Buffer.from(JSON.stringify({ type: "response.created", response: { id: "resp_1" } })),
|
||||
);
|
||||
bridge.setMediaTimestamp(1000);
|
||||
socket.emit(
|
||||
"message",
|
||||
Buffer.from(
|
||||
@@ -914,6 +915,7 @@ describe("buildOpenAIRealtimeVoiceProvider", () => {
|
||||
}),
|
||||
),
|
||||
);
|
||||
bridge.setMediaTimestamp(1300);
|
||||
|
||||
bridge.handleBargeIn?.({ audioPlaybackActive: true });
|
||||
bridge.handleBargeIn?.({ audioPlaybackActive: true });
|
||||
@@ -927,7 +929,106 @@ describe("buildOpenAIRealtimeVoiceProvider", () => {
|
||||
expect(onEvent).toHaveBeenCalledWith({
|
||||
direction: "client",
|
||||
type: "conversation.item.truncate",
|
||||
detail: "reason=barge-in audioEndMs=0",
|
||||
detail: "reason=barge-in audioEndMs=300",
|
||||
});
|
||||
});
|
||||
|
||||
it("ignores zero-length playback barge-in without clearing audio", async () => {
|
||||
const provider = buildOpenAIRealtimeVoiceProvider();
|
||||
const onClearAudio = vi.fn();
|
||||
const onEvent = vi.fn();
|
||||
const bridge = provider.createBridge({
|
||||
providerConfig: { apiKey: "sk-test" }, // pragma: allowlist secret
|
||||
onAudio: vi.fn(),
|
||||
onClearAudio,
|
||||
onEvent,
|
||||
});
|
||||
const connecting = bridge.connect();
|
||||
const socket = FakeWebSocket.instances[0];
|
||||
if (!socket) {
|
||||
throw new Error("expected bridge to create a websocket");
|
||||
}
|
||||
|
||||
socket.readyState = FakeWebSocket.OPEN;
|
||||
socket.emit("open");
|
||||
socket.emit("message", Buffer.from(JSON.stringify({ type: "session.updated" })));
|
||||
await connecting;
|
||||
bridge.setMediaTimestamp(1000);
|
||||
socket.emit(
|
||||
"message",
|
||||
Buffer.from(JSON.stringify({ type: "response.created", response: { id: "resp_1" } })),
|
||||
);
|
||||
socket.emit(
|
||||
"message",
|
||||
Buffer.from(
|
||||
JSON.stringify({
|
||||
type: "response.audio.delta",
|
||||
item_id: "item_1",
|
||||
delta: Buffer.from("assistant audio").toString("base64"),
|
||||
}),
|
||||
),
|
||||
);
|
||||
|
||||
bridge.handleBargeIn?.({ audioPlaybackActive: true });
|
||||
|
||||
expect(onClearAudio).not.toHaveBeenCalled();
|
||||
expect(parseSent(socket)).not.toContainEqual({ type: "response.cancel" });
|
||||
expect(parseSent(socket).some((event) => event.type === "conversation.item.truncate")).toBe(
|
||||
false,
|
||||
);
|
||||
expect(onEvent).toHaveBeenCalledWith({
|
||||
direction: "client",
|
||||
type: "conversation.item.truncate.skipped",
|
||||
detail: "reason=barge-in audioEndMs=0 minAudioEndMs=250",
|
||||
});
|
||||
});
|
||||
|
||||
it("allows immediate playback barge-in when the minimum audio window is zero", async () => {
|
||||
const provider = buildOpenAIRealtimeVoiceProvider();
|
||||
const onClearAudio = vi.fn();
|
||||
const bridge = provider.createBridge({
|
||||
providerConfig: {
|
||||
apiKey: "sk-test", // pragma: allowlist secret
|
||||
minBargeInAudioEndMs: 0,
|
||||
},
|
||||
onAudio: vi.fn(),
|
||||
onClearAudio,
|
||||
});
|
||||
const connecting = bridge.connect();
|
||||
const socket = FakeWebSocket.instances[0];
|
||||
if (!socket) {
|
||||
throw new Error("expected bridge to create a websocket");
|
||||
}
|
||||
|
||||
socket.readyState = FakeWebSocket.OPEN;
|
||||
socket.emit("open");
|
||||
socket.emit("message", Buffer.from(JSON.stringify({ type: "session.updated" })));
|
||||
await connecting;
|
||||
bridge.setMediaTimestamp(1000);
|
||||
socket.emit(
|
||||
"message",
|
||||
Buffer.from(JSON.stringify({ type: "response.created", response: { id: "resp_1" } })),
|
||||
);
|
||||
socket.emit(
|
||||
"message",
|
||||
Buffer.from(
|
||||
JSON.stringify({
|
||||
type: "response.audio.delta",
|
||||
item_id: "item_1",
|
||||
delta: Buffer.from("assistant audio").toString("base64"),
|
||||
}),
|
||||
),
|
||||
);
|
||||
|
||||
bridge.handleBargeIn?.({ audioPlaybackActive: true });
|
||||
|
||||
expect(onClearAudio).toHaveBeenCalledTimes(1);
|
||||
expect(parseSent(socket)).toContainEqual({ type: "response.cancel" });
|
||||
expect(parseSent(socket)).toContainEqual({
|
||||
type: "conversation.item.truncate",
|
||||
item_id: "item_1",
|
||||
content_index: 0,
|
||||
audio_end_ms: 0,
|
||||
});
|
||||
});
|
||||
|
||||
|
||||
@@ -59,6 +59,7 @@ type OpenAIRealtimeVoiceProviderConfig = {
|
||||
silenceDurationMs?: number;
|
||||
prefixPaddingMs?: number;
|
||||
interruptResponseOnInputAudio?: boolean;
|
||||
minBargeInAudioEndMs?: number;
|
||||
azureEndpoint?: string;
|
||||
azureDeployment?: string;
|
||||
azureApiVersion?: string;
|
||||
@@ -73,6 +74,7 @@ type OpenAIRealtimeVoiceBridgeConfig = RealtimeVoiceBridgeCreateRequest & {
|
||||
silenceDurationMs?: number;
|
||||
prefixPaddingMs?: number;
|
||||
interruptResponseOnInputAudio?: boolean;
|
||||
minBargeInAudioEndMs?: number;
|
||||
azureEndpoint?: string;
|
||||
azureDeployment?: string;
|
||||
azureApiVersion?: string;
|
||||
@@ -84,6 +86,7 @@ const OPENAI_REALTIME_ACTIVE_RESPONSE_ERROR_PREFIX =
|
||||
"Conversation already has an active response in progress:";
|
||||
const OPENAI_REALTIME_NO_ACTIVE_RESPONSE_CANCEL_ERROR =
|
||||
"Cancellation failed: no active response found";
|
||||
const OPENAI_REALTIME_DEFAULT_MIN_BARGE_IN_AUDIO_END_MS = 250;
|
||||
|
||||
type RealtimeEvent = {
|
||||
type: string;
|
||||
@@ -177,12 +180,18 @@ function normalizeProviderConfig(
|
||||
typeof raw?.interruptResponseOnInputAudio === "boolean"
|
||||
? raw.interruptResponseOnInputAudio
|
||||
: undefined,
|
||||
minBargeInAudioEndMs: asNonNegativeInteger(raw?.minBargeInAudioEndMs),
|
||||
azureEndpoint: trimToUndefined(raw?.azureEndpoint),
|
||||
azureDeployment: trimToUndefined(raw?.azureDeployment),
|
||||
azureApiVersion: trimToUndefined(raw?.azureApiVersion),
|
||||
};
|
||||
}
|
||||
|
||||
function asNonNegativeInteger(value: unknown): number | undefined {
|
||||
const number = asFiniteNumber(value);
|
||||
return number === undefined || number < 0 ? undefined : Math.floor(number);
|
||||
}
|
||||
|
||||
type OpenAIRealtimeApiKeyResolution =
|
||||
| { status: "available"; value: string }
|
||||
| { status: "missing" };
|
||||
@@ -815,6 +824,19 @@ class OpenAIRealtimeVoiceBridge implements RealtimeVoiceBridge {
|
||||
responseStartTimestamp !== null &&
|
||||
assistantItemId !== null &&
|
||||
(this.markQueue.length > 0 || options?.audioPlaybackActive === true);
|
||||
const audioEndMs = shouldInterruptProvider
|
||||
? Math.max(0, this.latestMediaTimestamp - responseStartTimestamp)
|
||||
: null;
|
||||
const minBargeInAudioEndMs =
|
||||
this.config.minBargeInAudioEndMs ?? OPENAI_REALTIME_DEFAULT_MIN_BARGE_IN_AUDIO_END_MS;
|
||||
if (audioEndMs !== null && audioEndMs < minBargeInAudioEndMs) {
|
||||
this.config.onEvent?.({
|
||||
direction: "client",
|
||||
type: "conversation.item.truncate.skipped",
|
||||
detail: `reason=barge-in audioEndMs=${audioEndMs} minAudioEndMs=${minBargeInAudioEndMs}`,
|
||||
});
|
||||
return;
|
||||
}
|
||||
if (
|
||||
options?.audioPlaybackActive === true &&
|
||||
this.responseActive &&
|
||||
@@ -824,8 +846,6 @@ class OpenAIRealtimeVoiceBridge implements RealtimeVoiceBridge {
|
||||
this.responseCancelInFlight = true;
|
||||
}
|
||||
if (shouldInterruptProvider) {
|
||||
const elapsedMs = this.latestMediaTimestamp - responseStartTimestamp;
|
||||
const audioEndMs = Math.max(0, elapsedMs);
|
||||
this.sendEvent(
|
||||
{
|
||||
type: "conversation.item.truncate",
|
||||
@@ -1074,6 +1094,7 @@ export function buildOpenAIRealtimeVoiceProvider(): RealtimeVoiceProviderPlugin
|
||||
prefixPaddingMs: config.prefixPaddingMs,
|
||||
interruptResponseOnInputAudio:
|
||||
req.interruptResponseOnInputAudio ?? config.interruptResponseOnInputAudio,
|
||||
minBargeInAudioEndMs: config.minBargeInAudioEndMs,
|
||||
azureEndpoint: config.azureEndpoint,
|
||||
azureDeployment: config.azureDeployment,
|
||||
azureApiVersion: config.azureApiVersion,
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -150,6 +150,8 @@ export type DiscordVoiceRealtimeConfig = {
|
||||
consultPolicy?: DiscordVoiceRealtimeConsultPolicy;
|
||||
/** Allow Discord speaker-start events to interrupt active realtime playback. */
|
||||
bargeIn?: boolean;
|
||||
/** Minimum assistant playback duration before a barge-in truncates audio. Default: 250ms; set 0 for immediate interruption. */
|
||||
minBargeInAudioEndMs?: number;
|
||||
/** Debounce window before buffered transcripts are sent to the OpenClaw agent. */
|
||||
debounceMs?: number;
|
||||
/** Provider-specific realtime voice config keyed by provider id. */
|
||||
|
||||
@@ -551,6 +551,7 @@ const DiscordVoiceRealtimeSchema = z
|
||||
toolPolicy: DiscordVoiceRealtimeToolPolicySchema.optional(),
|
||||
consultPolicy: DiscordVoiceRealtimeConsultPolicySchema.optional(),
|
||||
bargeIn: z.boolean().optional(),
|
||||
minBargeInAudioEndMs: z.number().int().min(0).max(10_000).optional(),
|
||||
debounceMs: z.number().int().positive().max(10_000).optional(),
|
||||
providers: z.record(z.string(), z.record(z.string(), z.unknown()).optional()).optional(),
|
||||
})
|
||||
|
||||
Reference in New Issue
Block a user