Skip to content

fix(text-splitters): prevent start_index=-1 in token-based splitters#35199

Open
Kyle Tse (shtse8) wants to merge 2 commits intolangchain-ai:masterfrom
shtse8:fix/text-splitter-start-index
Open

fix(text-splitters): prevent start_index=-1 in token-based splitters#35199
Kyle Tse (shtse8) wants to merge 2 commits intolangchain-ai:masterfrom
shtse8:fix/text-splitter-start-index

Conversation

@shtse8
Copy link

Description

Fixes #29884

TextSplitter.create_documents() calculates a character offset for text.find() using self._chunk_overlap. For token-based splitters like TokenTextSplitter, chunk_overlap is expressed in tokens rather than characters, so the offset can overshoot the actual chunk position and text.find() returns -1.

Root Cause

In create_documents():

offset = index + previous_chunk_len - self._chunk_overlap
index = text.find(chunk, max(0, offset))

previous_chunk_len is in characters (from len(chunk)), but self._chunk_overlap is in tokens for TokenTextSplitter. The mismatch means the search window can start after where the chunk actually appears, causing text.find() to miss and return -1.

Fix

Add a fallback: when the initial text.find() returns -1, retry the search from the beginning of the text:

index = text.find(chunk, max(0, offset))
if index == -1:
    index = text.find(chunk)

This is safe because chunks are always substrings of the source text, and searching from position 0 guarantees we find the first occurrence.

Tests

Added test_token_text_splitter_start_index_no_negative which:

  • Uses TokenTextSplitter with chunk_size=10, chunk_overlap=5, add_start_index=True
  • Verifies all start_index values are >= 0
  • Verifies each start_index correctly locates the chunk in the source text

Fixes langchain-ai#29884

`TextSplitter.create_documents` calculates a character offset for
`text.find()` using `self._chunk_overlap`.  For token-based splitters
like `TokenTextSplitter`, chunk_overlap is expressed in tokens rather
than characters, so the offset can overshoot the actual position and
`text.find()` returns -1.

Add a fallback: when the initial `text.find()` misses (returns -1),
retry from the beginning of the text.  This ensures every chunk gets a
valid `start_index` that correctly locates it within the source.
@github-actions github-actions bot added external text-splitters Related to the package `text-splitters` fix For PRs that implement a fix labels Feb 12, 2026
- Move TokenTextSplitter import to top-level imports
- Remove unused tiktoken variable assignment (use pytest.importorskip directly)
- Remove __import__('pytest') workaround (pytest already imported at top)
@github-actions github-actions bot added the size: XS < 50 LOC label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external fix For PRs that implement a fix size: XS < 50 LOC text-splitters Related to the package `text-splitters`

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TokenTextSplitter start indices are sometimes -1

1 participant