Avoid full vocab clone in get_vocab_size() by eunseo9311 · Pull Request #2074 · huggingface/tokenizers

eunseo9311 · 2026-05-27T03:57:26Z

get_vocab_size(true) was calling get_vocab(true).len(), which clones the entire model vocabulary into a new HashMap just to count entries.
For large vocabularies (e.g. LLaMA-3 128k tokens) this allocates ~10MB on every call.

Fix by computing base + added.len() - overlapping directly, where overlapping counts added tokens already present in the model via token_to_id. Zero allocation.

Applies the same fix to the Node binding, which had the identical pattern.

Adds a test covering the overlap scenario (token in both model vocab and added_vocabulary).

HuggingFaceDocBuilderDev · 2026-05-27T13:02:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Avoid full vocab clone in get_vocab_size()

0fa29b9

eunseo9311 force-pushed the fix/get-vocab-size-perf branch from 59d1e77 to 0fa29b9 Compare May 28, 2026 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid full vocab clone in get_vocab_size()#2074

Avoid full vocab clone in get_vocab_size()#2074
eunseo9311 wants to merge 1 commit into
huggingface:mainfrom
eunseo9311:fix/get-vocab-size-perf

eunseo9311 commented May 27, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eunseo9311 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eunseo9311 commented May 27, 2026 •

edited

Loading