Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

lordofriver · 2026-01-05T10:19:07Z

Description

Fixes incorrect segment splitting when using max_len parameter with multi-byte UTF-8 characters (e.g., Chinese, Japanese, Arabic).

Problem

The current implementation in whisper_wrap_segment() uses strlen() to count bytes, not UTF-8 characters. When splitting segments at max_len, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as � (U+FFFD replacement character).

Example (Chinese text)

Before fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应�"},
{"Text": "�者面试结束之后,面试人立即整理记录,根据求"}

After fix:

{"Text": "这个时候面试官会给应聘者一定的时间,由应聘"},
{"Text": "者面试结束之后,面试人立即整理记录,根据求"}

In Addition

This does change the meaning of the "max_len" parameter.
And to be honest, I just find out the problem, code modification is Claude's recommendation and I tested it.

The current implementation in `whisper_wrap_segment()` uses `strlen()` to count bytes, not UTF-8 characters. When splitting segments at `max_len`, this can break multi-byte UTF-8 characters, resulting in invalid sequences displayed as `�` (U+FFFD replacement character).

* ggerganov/master: (121 commits) whisper : Fix UTF-8 character boundary issue in segment wrapping (max_len) (ggml-org#3592) release : v1.8.3 benches : update sync : ggml CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (llama/18800) vulkan: change memory_logger to be controlled by an env var (llama/18769) vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (llama/18678) vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (llama/18763) Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (llama/18749) talk-llama : sync llama.cpp sync : ggml opencl: add SOFTPLUS op support (llama/18726) HIP: adjust RDNA3.5 MMQ kernel selction logic (llama/18666) cmake : update blas logic (llama/18205) Corrected: changed s13 = src1->nb[3] instead of nb[2] (llama/18724) opencl: add EXPM1 op (llama/18704) Updates to webgpu get_memory (llama/18707) llama: use host memory if device reports 0 memory (llama/18587) ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (llama/18628) ggml webgpu: initial flashattention implementation (llama/18610) ...

ggerganov approved these changes Jan 15, 2026

View reviewed changes

ggerganov merged commit f53dc74 into ggml-org:master Jan 16, 2026
55 of 66 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Uh oh!

lordofriver commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Fix: UTF-8 character cut off to two "�" in segment wrapping (max_len) #3592

Uh oh!

Conversation

lordofriver commented Jan 5, 2026

Description

Problem

Example (Chinese text)

In Addition

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants