feat: Add STT utterance_end_latency metric for streaming STT#4966
feat: Add STT utterance_end_latency metric for streaming STT#4966kimdwkimdw wants to merge 2 commits intolivekit:mainfrom
utterance_end_latency metric for streaming STT#4966Conversation
Track audio push timeline in RecognizeStream.push_frame() and compute wall-clock delay from audio push to FINAL_TRANSCRIPT receipt. Emitted as STTMetrics.utterance_end_latency for pure STT-engine-only latency measurement, analogous to LLM ttft and TTS ttfb.
There was a problem hiding this comment.
π‘ Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41f770b803
βΉοΈ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with π.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| audio_pos = end_time - self._start_time_offset | ||
| push_wall_clock = self._lookup_push_time(audio_pos) | ||
| if push_wall_clock is not None: | ||
| utterance_end_latency = max(0.0, time.time() - push_wall_clock) | ||
| self._prune_push_timestamps(audio_pos) |
There was a problem hiding this comment.
Bound push timeline when end timestamps are missing
The new latency timeline is only pruned in this FINAL_TRANSCRIPT path when end_time > 0 and lookup succeeds, while push_frame() appends on every audio frame. That means long-running streams with no final timestamps (or long silence before any final transcript) keep growing _audio_push_wall_times/_audio_push_timestamps without bound, which can steadily increase memory usage in production sessions. Add a pruning/capping path that does not depend on end_time availability.
Useful? React with πΒ / π.
| if self._last_user_final_stt_request_id: | ||
| utterance_end_latency = self._stt_utterance_latency_by_request_id.pop( | ||
| self._last_user_final_stt_request_id, | ||
| None, |
There was a problem hiding this comment.
Purge stale keyed STT latencies beyond the last transcript id
This logic pops only the latest transcript request ID, so earlier keyed latency entries collected in the same user turn are never removed from _stt_utterance_latency_by_request_id. Streaming providers can emit multiple final transcripts with different IDs before turn completion (for example, NVIDIA assigns request_id from each response object), so the dict can grow over time and retain stale per-request state. Clear or bound older keyed entries when a turn is committed.
Useful? React with πΒ / π.
| utterance_end_latency: float | None = None | ||
| if ev.alternatives: | ||
| end_time = ev.alternatives[0].end_time | ||
| if end_time > 0.0: | ||
| audio_pos = end_time - self._start_time_offset | ||
| push_wall_clock = self._lookup_push_time(audio_pos) | ||
| if push_wall_clock is not None: |
There was a problem hiding this comment.
π‘ Cumulative audio timeline not reset on STT stream retry causes incorrect utterance_end_latency
After a streaming STT retry/reconnect in _main_task, the _cumulative_audio_seconds counter and the push timeline lists are never reset. The STT provider's new connection reports end_time starting from 0 (relative to the new session), plus start_time_offset. The computation audio_pos = end_time - self._start_time_offset at livekit-agents/livekit/agents/stt/stt.py:429 recovers the provider's raw position (a small number), but _audio_push_wall_times contains cumulative values from all audio ever pushed (a much larger number).
Root Cause and Impact
Consider a retry scenario:
- 10s of audio pushed β
_cumulative_audio_seconds = 10.0, timeline =[0.2, 0.4, ..., 10.0] _run()fails withAPIError, retry triggered atlivekit-agents/livekit/agents/stt/stt.py:330-358_start_time_offsetis increased by elapsed wall-clock time- New
_run()starts, provider receives only new audio and reportsend_timerelative to the new session (e.g.3.0) plusstart_time_offset audio_pos = end_time - start_time_offset = 3.0β but the timeline entries start at ~10.0+bisect_left([10.2, 10.4, ...], 3.0)returns 0, mapping to the push timestamp of the very first frameutterance_end_latency = now() - very_old_timestampβ a spuriously large value (many seconds)
Impact: After any STT stream retry, the reported utterance_end_latency will be wildly incorrect (inflated) rather than None. The timeline and _cumulative_audio_seconds should be reset (or an offset adjustment applied) when _main_task retries.
Prompt for agents
In livekit-agents/livekit/agents/stt/stt.py, the _cumulative_audio_seconds counter and the _audio_push_wall_times / _audio_push_timestamps lists are never reset when the stream retries via _main_task (lines 330-358). After a retry, the STT provider's reported end_time (minus start_time_offset) reflects the position in the NEW connection (starting near 0), but _cumulative_audio_seconds still holds the total from all audio ever pushed. Fix this by resetting the timeline state at the beginning of each retry iteration in _main_task. Specifically, inside the while loop at line 330 (before calling self._run()), add: self._cumulative_audio_seconds = 0.0, self._audio_push_wall_times.clear(), self._audio_push_timestamps.clear(). Alternatively, record the cumulative audio offset at the start of each _run() and adjust the audio_pos calculation accordingly.
Was this helpful? React with π or π to provide feedback.
| if isinstance(ev, STTMetrics) and ev.utterance_end_latency is not None: | ||
| if ev.request_id: | ||
| self._stt_utterance_latency_by_request_id[ev.request_id] = ev.utterance_end_latency | ||
| else: | ||
| self._last_unkeyed_stt_utterance_latency = ev.utterance_end_latency |
There was a problem hiding this comment.
π‘ _stt_utterance_latency_by_request_id dict accumulates unpopped entries across turns
Within a single user turn, multiple FINAL_TRANSCRIPT events may be received (each producing an STTMetrics with utterance_end_latency), but on_final_transcript at livekit-agents/livekit/agents/voice/agent_activity.py:1390 overwrites _last_user_final_stt_request_id with only the latest request_id. When _user_turn_completed_task runs, it only pops the entry for the last request_id at line 1612, leaving all earlier entries orphaned in _stt_utterance_latency_by_request_id.
Detailed Explanation
For STT providers that use distinct request_id values per FINAL_TRANSCRIPT (rather than a stable session-level ID), the flow is:
- FINAL_TRANSCRIPT Setting up packagesΒ #1 (
request_id="r1") β_on_metrics_collectedstores{"r1": latency1};on_final_transcriptsets_last_user_final_stt_request_id = "r1" - FINAL_TRANSCRIPT More processor implementationsΒ #2 (
request_id="r2") β_on_metrics_collectedstores{"r1": latency1, "r2": latency2};on_final_transcriptoverwrites to_last_user_final_stt_request_id = "r2" _user_turn_completed_taskpops only"r2", leaving"r1"in the dict forever
Over many turns, the dict grows without bound. Each entry is small (str β float), so the memory impact is slow but unbounded.
Impact: Slow memory leak proportional to the number of intermediate FINAL_TRANSCRIPT events across all turns, for providers that use per-response request IDs.
Prompt for agents
In livekit-agents/livekit/agents/voice/agent_activity.py, the _stt_utterance_latency_by_request_id dict (line 132) accumulates entries that are never cleaned up when multiple FINAL_TRANSCRIPT events occur per turn. To fix, either: (1) clear the entire dict after each turn in _user_turn_completed_task around line 1627 where _last_user_final_stt_request_id is reset to None (add self._stt_utterance_latency_by_request_id.clear()), or (2) store only the latest keyed latency (as a single value) instead of a dict, since only the last FINAL_TRANSCRIPT per turn is used.
Was this helpful? React with π or π to provide feedback.
Motivation
LiveKit's
STTMetricsincludes adurationfield, but the official observability docs state:This leaves a gap: every other pipeline component has a responsiveness metric (
ttftfor LLM,ttfbfor TTS), but streaming STT has none. Operators cannot measure, monitor, or compare streaming STT engine performance.Solution
This PR adds
utterance_end_latencytoSTTMetricsβ the wall-clock delay from when the audio at the transcript'send_timeis locally enqueued inRecognizeStreamto when theFINAL_TRANSCRIPTis received.ttftttfbutterance_end_latencyThis metric is provider-agnostic β it works for any streaming STT plugin that populates
end_timeon speech alternatives.How it works
Why bisect on
end_time, not "last pushed frame"?In streaming STT, audio flows continuously and silence frames are often enqueued after the utterance already ended. By the time
FINAL_TRANSCRIPTarrives, later silence chunks may already exist. Using the latest enqueue time (wall_t5) would mostly measure "recent silence enqueue -> FINAL receive", which is not useful. The bisect approach maps FINAL to the chunk containingend_time(chunk3here), which is the correct anchor for this metric.Implementation details:
push_frame()appends(cumulative_audio_seconds, wall_clock_time)after local enqueue to_input_chFINAL_TRANSCRIPT, usesbisect_leftto find the wall-clock time matching the transcript'send_timeutterance_end_latency = now() β matched_push_timeend_timeto keep memory boundedutterance_end_latencyvstranscription_delayThese metrics measure fundamentally different things:
transcription_delayutterance_end_latencyagent_activity.py(voice pipeline)stt.py(baseRecognizeStream)transcription_delaytells you how long the user waited after they stopped speaking.utterance_end_latencytells you how quickly FINAL is returned after the relevant audio is locally enqueued, minimizing VAD/EOU coupling.Changes
stt/stt.pypush_frame(), bisect-based lookup on FINAL_TRANSCRIPT, metric emission, timeline pruningmetrics/base.pyutterance_end_latency: float | Nonefield onSTTMetrics(defaultNone)metrics/utils.pyutterance_end_latencyin structured metrics output when presentvoice/agent_activity.pyutterance_end_latencyfrom STT metrics events, attach to per-turnMetricsReportllm/chat_context.pyutterance_end_latencyfield onMetricsReportTypedDictcli/cli.pystt_utt_endin console mode turn metricstests/test_agent_session.pystt_metricsevent withstreamed=TrueProvider plugin fixes in this branch
livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.pySpeechData.end_timemapping to use the last word end timestamp.livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.pyuse_realtime=True, populate FINALSpeechData.start_time/end_timefromaudio_start_ms/audio_end_ms.Manual validation (console)
Validation was performed using the provided
AgentSessionharness and provider switching.Validated combinations:
google.STT(use_streaming=True)deepgram.STTrtzr.STTopenai.STT(use_realtime=True, model="gpt-4o-mini-transcribe")Validation Example