This document defines the fingerprint contract, duplicate taxonomy, and registry schema used for cross-run dedupe of claims/findings.
- Phase 1 scope is claim-level duplicate decisions only.
- Cross-run registry behavior is not implemented in this task.
- Existing evidence requirements and report gate semantics remain unchanged.
fingerprint_version:"claim-fp-v1".- Hash algorithm: SHA-256 lowercase hex digest.
- Preimage envelope (canonical JSON):
fingerprint_version(string)claim(canonicalized object)
Canonicalization rules for claim-fp-v1:
- Deterministic object normalization:
- Keys are sorted lexicographically.
- Non-string keys are stringified.
- Excluded volatile keys are removed.
- Deterministic list normalization:
- Each item is recursively normalized.
- Normalized list is sorted by canonical JSON text of each item.
- Float normalization:
- Round to 6 decimals (matches
src/aiedge/determinism.py).
- Round to 6 decimals (matches
- JSON serialization:
sort_keys=trueseparators=(",", ":")ensure_ascii=true(ASCII-only preimage)
- Excluded volatile/non-portable fields (minimum set):
- Timestamps and run/session identifiers:
created_at,updated_at,started_at,finished_at,timestamp,run_id,stage_run_id,trace_id,session_id - Run-variant paths and path lists:
path,paths,evidence_ref,evidence_refs,evidence_path,evidence_paths,file,files - Raw binary payload fields:
blob,blobs,raw_blob,raw_blobs,binary,binary_blob,raw_bytes - Any key ending with
_at,_ts,_timestamp,_path,_paths,_blob,_bytes
- Timestamps and run/session identifiers:
Reference implementation for this contract lives in src/aiedge/fingerprinting.py.
Taxonomy version: duplicate-taxonomy-v1
Machine-checkable rule:
- Input:
current_claim,existing_registry_record - Compute
fp_current = sha256(canonical_preimage(current_claim, claim-fp-v1)) - Compare against
existing_registry_record.fingerprint - Classify as
exact_fingerprint_duplicateiff all conditions are true:existing_registry_record.fingerprint_version == "claim-fp-v1"existing_registry_record.fingerprint == fp_current
Machine-checkable placeholder rule:
- Classify as
near_duplicateonly when a future near-match scorer emits:near_duplicate_scorein[0.0, 1.0]near_duplicate_score >= near_duplicate_thresholdfp_current != existing_registry_record.fingerprint
- Phase 1 behavior: always
not_evaluated.
Machine-checkable rule:
- Let
fp_currentbe exact-match equal to an existing fingerprint. - Let
ctx_currentandctx_last_seenbe deterministic context digests (future Task 3 novelty context payload). - Classify as
context_changed_reopeniff:- exact fingerprint match is true, and
ctx_current != ctx_last_seen
- Operational semantics: force retriage even though claim fingerprint is unchanged.
Phase 1 note: taxonomy is defined now; context digest production and reopen automation are implemented later.
Schema version: duplicate-registry-v1
Top-level fields:
schema_version(required, constduplicate-registry-v1)created_at(required, RFC3339 date-time)records(required object keyed by fingerprint)
Record object fields (minimum audit set):
fingerprint(required, 64-char lowercase hex SHA-256)fingerprint_version(required, constclaim-fp-v1)first_seen_run_id(required)last_seen_at(required, RFC3339 date-time)sources(required, non-empty array)- each source includes at minimum
run_id - optional
finding_id,claim_path
- each source includes at minimum
last_classification(optional enum):exact_fingerprint_duplicatenear_duplicatecontext_changed_reopen
The reference JSON Schema object is exported as DUPLICATE_REGISTRY_JSON_SCHEMA from src/aiedge/fingerprinting.py.
Phase 2+ implementation should enforce bounded storage while preserving auditability:
- Keep one primary record per fingerprint key (no key fan-out).
- Keep
sourcesas capped history (for example most recent N entries). - Periodically compact stale records using retention policy windows.
- Preserve aggregate counters and first-seen metadata before compaction.
- Run compaction deterministically (stable key order, deterministic cutoff criteria).