Skip to content

ShardedNphdIndex.add() batch with duplicate key: entire batch lost + size counter corrupted #21

@titusz

Description

@titusz

Summary

When ShardedNphdIndex.add() is called with a batch of keys where one key already exists in the index, usearch raises RuntimeError: Duplicate keys not allowed in high-level wrappers. The error is propagated but:

  1. All new keys in the batch are silently dropped — none are added
  2. The internal size counter increments for keys processed before the duplicate was detected

This leaves the index in an inconsistent state where size is wrong.

Versions

  • iscc-usearch==0.6.0
  • usearch==2.24.0
  • Python 3.12, Windows

Reproduction

import tempfile
import numpy as np
from iscc_usearch import ShardedNphdIndex

with tempfile.TemporaryDirectory() as tmpdir:
    idx = ShardedNphdIndex(path=tmpdir, max_dim=256)

    # Add key 50 and 100
    idx.add([50, 100], [
        np.random.randint(0, 256, size=32, dtype=np.uint8),
        np.random.randint(0, 256, size=32, dtype=np.uint8),
    ])
    print(f"After initial add: size={idx.size}")  # 2

    # Batch: [10, 20, 50, 30, 100] — keys 50 and 100 already exist
    keys = [10, 20, 50, 30, 100]
    vecs = [np.random.randint(0, 256, size=32, dtype=np.uint8) for _ in range(5)]

    try:
        idx.add(keys, vecs)
    except RuntimeError as e:
        print(f"Error: {e}")
        print(f"Size after failure: {idx.size}")  # 4 (WRONG — should be 2)
        for k in [10, 20, 50, 30, 100]:
            print(f"  Key {k} exists: {idx.contains(k)}")
            # Only 50 and 100 exist — 10, 20, 30 were NOT added

Output:

After initial add: size=2
Error: Duplicate keys not allowed in high-level wrappers
Size after failure: 4
  Key 10 exists: False
  Key 20 exists: False
  Key 50 exists: True
  Key 30 exists: False
  Key 100 exists: True

Expected Behavior

Options (in order of preference):

  1. Best: add() succeeds for new keys, raises only for actual duplicates — add keys 10, 20, 30 and reject only 50 and 100. This matches add_once() semantics but for the error case.
  2. OK: Atomic rollback — if any key is a duplicate, reject the entire batch AND keep size consistent (don't increment for rejected keys).
  3. Minimum: Consistent size — if the batch is rejected, at least don't corrupt the size counter.

Context

This was discovered while investigating a missing vector in iscc-search. The current add_assets() flow works around this by calling remove(ALL_keys) before add(ALL_keys) on every batch, even for first-time adds. This is wasteful but necessary to avoid the RuntimeError. Ideally we'd only need to remove keys that actually exist (or use upsert() / add_once() depending on semantics).

Upstream

The RuntimeError originates from usearch's _add_to_compiledcompiled.add_many(). The size counter corruption likely happens in ShardedIndex.add() where bloom filter / tombstone bookkeeping runs before the error propagates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions