Fix word-level timestamp overflow in Whisper chunked transcription by neonwatty · Pull Request #1486 · huggingface/transformers.js

neonwatty · 2025-12-15T17:13:30Z

This PR aims to fix #1357

The Problem

#1357 reports a problem I've had in usage as well - that when using chunk_length_s: 30 with return_timestamps: "word", timestamps can exceed the actual audio duration. For example, a 60s audio file produces timestamps up to 69.98s.

Root cause:* Digging in here's what I found: the model outputs timestamps up to ~29.98s (the maximum representable value given time_precision = 30/1500 = 0.02). For final chunks shorter than 30s, these raw timestamps are added to the accumulated time_offset, causing overflow.

Proposed Solution

I think this simple solution works: clamp word-level timestamps to the actual chunk_len from stride metadata before applying time_offset:

if (current_chunk_len !== null) {
    raw_start = Math.min(raw_start, current_chunk_len);
    raw_end = Math.min(raw_end, current_chunk_len);
}

Why not match Python's approach?

There's a similar reported issue upstream with Python transformers; this fix crops cross-attention matrices before DTW alignment (PR #25607).

To my knowledge, here in JS we receive pre-computed token_timestamps from ONNX models, so we cannot modify the DTW computation.

Hence clamping at the tokenizer level seems to be the appropriate fix here.

Testing

Added unit test for issue whisper-large-v3-turbo_timestamped has broken timestamps #1357
Validated edge cases: exactly 30s, <30s, and long multi-chunk audio
All 1891 existing tests pass
Segment-level timestamps (return_timestamps: true) unaffected

…uggingface#1357) Clamp word-level timestamps to the actual chunk_len to prevent timestamps from exceeding audio duration when the model outputs timestamps near the 30s boundary for shorter final chunks.

Fix word-level timestamp overflow in Whisper chunked transcription (h…

b75d729

…uggingface#1357) Clamp word-level timestamps to the actual chunk_len to prevent timestamps from exceeding audio duration when the model outputs timestamps near the 30s boundary for shorter final chunks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix word-level timestamp overflow in Whisper chunked transcription#1486

Fix word-level timestamp overflow in Whisper chunked transcription#1486
neonwatty wants to merge 1 commit intohuggingface:mainfrom
neonwatty:main

neonwatty commented Dec 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neonwatty commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

Proposed Solution

Why not match Python's approach?

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neonwatty commented Dec 15, 2025 •

edited

Loading