Skip to content

Avoid full vocab clone in get_vocab_size()#2074

Open
eunseo9311 wants to merge 1 commit into
huggingface:mainfrom
eunseo9311:fix/get-vocab-size-perf
Open

Avoid full vocab clone in get_vocab_size()#2074
eunseo9311 wants to merge 1 commit into
huggingface:mainfrom
eunseo9311:fix/get-vocab-size-perf

Conversation

@eunseo9311
Copy link
Copy Markdown

@eunseo9311 eunseo9311 commented May 27, 2026

get_vocab_size(true) was calling get_vocab(true).len(), which clones the entire model vocabulary into a new HashMap just to count entries.
For large vocabularies (e.g. LLaMA-3 128k tokens) this allocates ~10MB on every call.

Fix by computing base + added.len() - overlapping directly, where overlapping counts added tokens already present in the model via token_to_id. Zero allocation.

Applies the same fix to the Node binding, which had the identical pattern.

Adds a test covering the overlap scenario (token in both model vocab and added_vocabulary).

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eunseo9311 eunseo9311 force-pushed the fix/get-vocab-size-perf branch from 59d1e77 to 0fa29b9 Compare May 28, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants