Skip to content

Conversation

@lordofriver
Copy link
Contributor

Description

Fixes incorrect segment splitting when using max_len parameter with multi-byte UTF-8 characters (e.g., Chinese, Japanese, Arabic).

Problem

The current implementation in whisper_wrap_segment() uses strlen() to count bytes, not UTF-8 characters. When splitting segments at max_len, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as (U+FFFD replacement character).

Example (Chinese text)

Before fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应�"},
{"Text": "�者面试结束之后,面试人立即整理记录,根据求"}

After fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应聘"},
{"Text": "者面试结束之后,面试人立即整理记录,根据求"}

In Addition

This does change the meaning of the "max_len" parameter.
And to be honest, I just find out the problem, code modification is Claude's recommendation and I tested it.

The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).
@ggerganov ggerganov merged commit f53dc74 into ggml-org:master Jan 16, 2026
55 of 66 checks passed
bygreencn added a commit to bygreencn/whisper.cpp that referenced this pull request Jan 20, 2026
* ggerganov/master: (121 commits)
  whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (ggml-org#3592)
  release : v1.8.3
  benches : update
  sync : ggml
  CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800)
  vulkan: change memory_logger to be controlled by an env var (llama/18769)
  vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678)
  vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763)
  Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749)
  talk-llama : sync llama.cpp
  sync : ggml
  opencl: add SOFTPLUS op support (llama/18726)
  HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666)
  cmake : update blas logic (llama/18205)
  Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724)
  opencl: add EXPM1 op (llama/18704)
  Updates to webgpu get_memory (llama/18707)
  llama: use host memory if device reports 0 memory (llama/18587)
  ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628)
  ggml webgpu: initial flashattention implementation (llama/18610)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants