-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
When ShardedNphdIndex.add() is called with a batch of keys where one key already exists in the index, usearch raises RuntimeError: Duplicate keys not allowed in high-level wrappers. The error is propagated but:
- All new keys in the batch are silently dropped — none are added
- The internal
sizecounter increments for keys processed before the duplicate was detected
This leaves the index in an inconsistent state where size is wrong.
Versions
iscc-usearch==0.6.0usearch==2.24.0- Python 3.12, Windows
Reproduction
import tempfile
import numpy as np
from iscc_usearch import ShardedNphdIndex
with tempfile.TemporaryDirectory() as tmpdir:
idx = ShardedNphdIndex(path=tmpdir, max_dim=256)
# Add key 50 and 100
idx.add([50, 100], [
np.random.randint(0, 256, size=32, dtype=np.uint8),
np.random.randint(0, 256, size=32, dtype=np.uint8),
])
print(f"After initial add: size={idx.size}") # 2
# Batch: [10, 20, 50, 30, 100] — keys 50 and 100 already exist
keys = [10, 20, 50, 30, 100]
vecs = [np.random.randint(0, 256, size=32, dtype=np.uint8) for _ in range(5)]
try:
idx.add(keys, vecs)
except RuntimeError as e:
print(f"Error: {e}")
print(f"Size after failure: {idx.size}") # 4 (WRONG — should be 2)
for k in [10, 20, 50, 30, 100]:
print(f" Key {k} exists: {idx.contains(k)}")
# Only 50 and 100 exist — 10, 20, 30 were NOT addedOutput:
After initial add: size=2
Error: Duplicate keys not allowed in high-level wrappers
Size after failure: 4
Key 10 exists: False
Key 20 exists: False
Key 50 exists: True
Key 30 exists: False
Key 100 exists: True
Expected Behavior
Options (in order of preference):
- Best:
add()succeeds for new keys, raises only for actual duplicates — add keys 10, 20, 30 and reject only 50 and 100. This matchesadd_once()semantics but for the error case. - OK: Atomic rollback — if any key is a duplicate, reject the entire batch AND keep
sizeconsistent (don't increment for rejected keys). - Minimum: Consistent
size— if the batch is rejected, at least don't corrupt the size counter.
Context
This was discovered while investigating a missing vector in iscc-search. The current add_assets() flow works around this by calling remove(ALL_keys) before add(ALL_keys) on every batch, even for first-time adds. This is wasteful but necessary to avoid the RuntimeError. Ideally we'd only need to remove keys that actually exist (or use upsert() / add_once() depending on semantics).
Upstream
The RuntimeError originates from usearch's _add_to_compiled → compiled.add_many(). The size counter corruption likely happens in ShardedIndex.add() where bloom filter / tombstone bookkeeping runs before the error propagates.