fix(text-splitters): prevent start_index=-1 in token-based splitters by shtse8 · Pull Request #35199 · langchain-ai/langchain

Kyle Tse (shtse8) · 2026-02-12T23:51:22Z

Description

TextSplitter.create_documents() calculates a character offset for text.find() using self._chunk_overlap. For token-based splitters like TokenTextSplitter, chunk_overlap is expressed in tokens rather than characters, so the offset can overshoot the actual chunk position and text.find() returns -1.

Root Cause

In create_documents():

offset = index + previous_chunk_len - self._chunk_overlap
index = text.find(chunk, max(0, offset))

previous_chunk_len is in characters (from len(chunk)), but self._chunk_overlap is in tokens for TokenTextSplitter. The mismatch means the search window can start after where the chunk actually appears, causing text.find() to miss and return -1.

Fix

Add a fallback: when the initial text.find() returns -1, retry the search from the beginning of the text:

index = text.find(chunk, max(0, offset))
if index == -1:
    index = text.find(chunk)

This is safe because chunks are always substrings of the source text, and searching from position 0 guarantees we find the first occurrence.

Tests

Added test_token_text_splitter_start_index_no_negative which:

Uses TokenTextSplitter with chunk_size=10, chunk_overlap=5, add_start_index=True
Verifies all start_index values are >= 0
Verifies each start_index correctly locates the chunk in the source text

Fixes langchain-ai#29884 `TextSplitter.create_documents` calculates a character offset for `text.find()` using `self._chunk_overlap`. For token-based splitters like `TokenTextSplitter`, chunk_overlap is expressed in tokens rather than characters, so the offset can overshoot the actual position and `text.find()` returns -1. Add a fallback: when the initial `text.find()` misses (returns -1), retry from the beginning of the text. This ensures every chunk gets a valid `start_index` that correctly locates it within the source.

- Move TokenTextSplitter import to top-level imports - Remove unused tiktoken variable assignment (use pytest.importorskip directly) - Remove __import__('pytest') workaround (pytest already imported at top)

github-actions bot added external text-splitters Related to the package `text-splitters` fix For PRs that implement a fix labels Feb 12, 2026

fix: resolve lint failures in test (F841, PLC0415)

3871573

- Move TokenTextSplitter import to top-level imports - Remove unused tiktoken variable assignment (use pytest.importorskip directly) - Remove __import__('pytest') workaround (pytest already imported at top)

github-actions bot added the size: XS < 50 LOC label Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(text-splitters): prevent start_index=-1 in token-based splitters#35199

fix(text-splitters): prevent start_index=-1 in token-based splitters#35199
Kyle Tse (shtse8) wants to merge 2 commits intolangchain-ai:masterfrom
shtse8:fix/text-splitter-start-index

Kyle Tse (shtse8) commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kyle Tse (shtse8) commented Feb 12, 2026

Description

Root Cause

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant