This document defines how memory-path-engine should prove that its current retrieval and memory features are effective.
The short version:
- public benchmarks prove external comparability
- repository-owned structured benchmarks prove the architecture claims of this project
- private real-world datasets prove that the same claims still matter on realistic documents
Use this file as the strategy layer. For concrete metrics, retriever modes, and benchmark runner behavior, see evaluation.md. For current repository-owned fixtures, see ../benchmarks/structured_memory/README.md.
The project already has enough moving parts that qualitative demos are no longer sufficient:
- multiple retriever modes
- typed benchmark models
- path, semantic, contradiction, and activation diagnostics
- static vs dynamic memory experiment modes
That means the next useful question is no longer "does the demo look interesting?" but "which design claims are actually supported by repeatable evidence?"
This strategy is anchored to the research hypotheses in hypotheses.md:
H1: graph-aware retrieval should beat flattop-kretrieval on multi-hop questionsH2: weighting and anomaly-aware scoring should improve critical evidence recall without unacceptable latencyH3: replayableMemoryPathoutput should be causally useful, not cosmetic
The current benchmark stack should also prove an additional v0.5 claim:
- dynamic memory state should produce measurable retrieval differences across ordered query sequences
Purpose:
- give the project an external comparison point
- show that graph-aware memory ideas do not collapse on standard retrieval or memory tasks
- make it possible to compare against benchmark-first projects
What this layer is good for:
- general long-context or multi-hop QA
- evidence retrieval sanity checks
- broad memory-system positioning
What this layer is not good for:
- typed edge correctness
- exception or contradiction semantics
- replayable path-shape validation
- dynamic priming effects specific to this project
Recommended use:
- treat public benchmarks as external sanity checks, not as the only source of truth
- report answer metrics, evidence metrics when available, and latency
- be explicit about what each dataset does not validate
Purpose:
- directly test the architecture claims unique to
memory-path-engine - keep evaluation aligned with the domain model and
MemoryPath - support TDD, regressions, and explainable miss analysis
This is the most important layer for the current stage because it can measure things public datasets usually cannot:
- path validity
- semantic-role coverage
- contradiction and exception surfacing
- activation-trace behavior
- static vs dynamic divergence on ordered cases
Current examples live in ../benchmarks/structured_memory:
example_contract_benchmark.jsonexample_runbook_benchmark.jsonexception_override_benchmark.jsonmulti_hop_chain_benchmark.jsondynamic_memory_priming_benchmark.jsoncontract_exception_priming_benchmark.json
Purpose:
- validate external realism
- test whether synthetic conclusions survive real document noise
- support domain-specific decisions without forcing sensitive data into the public repo
This layer is where you prove the system matters on actual contracts, SOPs, runbooks, policy docs, or internal knowledge bases.
Private datasets should not replace Layer B. They complement it:
- Layer B proves the mechanism
- Layer C proves the mechanism still matters in practice
Best for:
- multi-document retrieval
- bridge-style multi-hop questions
- evidence-support evaluation
What it validates:
- whether graph-aware retrieval helps find linked evidence
What it does not validate:
- typed graph semantics
- dynamic memory
- exact
MemoryPathcorrectness
Integration difficulty: medium to high
Recommended reported metrics:
- answer EM/F1
- evidence recall or support-sentence recall
- latency
Best for:
- harder compositional multi-hop retrieval
- robustness under distractors
What it validates:
- whether retrieval still works when reasoning requires several linked facts
What it does not validate:
- exception or contradiction logic
- path replay fidelity
- sequential memory effects
Integration difficulty: high
Recommended reported metrics:
- answer EM/F1
- evidence recall
- breakdown by hop count or question type
Best for:
- explicit multi-hop retrieval patterns
- relation-following style questions
What it validates:
- whether graph-style traversal gives a meaningful advantage over flat retrieval
What it does not validate:
- internal semantic roles
- dynamic priming behavior
Integration difficulty: medium to high
Recommended reported metrics:
- answer EM/F1
- evidence recall
- latency
Useful for:
- hierarchical evidence retrieval
- multi-step claim verification
Useful but incomplete because:
- it is closer to evidence verification than to replayable memory paths
Useful for:
- long conversational memory
- timeline and history-sensitive recall
Useful but incomplete because:
- it does not naturally validate edge types, path shapes, or exception semantics
Useful for:
- broad memory-system sanity checks
- external positioning against long-memory systems
Useful but incomplete because:
- it is much broader than your current architectural claims
- it reports retrieval sanity and external positioning, not path/semantic/dynamic correctness
These are worth using only as secondary background checks:
BEIRorMS MARCO: good for flat retrieval baselines, weak for graph/path claimsFEVER: useful for evidence retrieval, but not for your main graph-memory storySQuAD,Natural Questions, similar QA sets: weak fit for path and structured-memory validation
Use evaluation.md as the source of truth for the metric definitions. At strategy level, the project should consistently report:
answer_recallor task-native answer score on public datasetsevidence_recallorevidence_hit_ratepath_hit_rateon repository-owned and private structured datasetssemantic_hit_ratewhere semantic roles mattercontradiction_hit_ratewhere exception or conflict cases existactivation_trace_hit_ratefor spreading-based experimentsavg_latency_ms
Current implemented public benchmark adapters:
HotpotQAretrieval-only evidence benchmark with local/nightly full-dev supportLongMemEvalretrieval-only session benchmark (R@5,R@10,NDCG@10) for external positioning
At minimum, every benchmark report should make it easy to compare:
lexical_baselineembedding_baselinestructure_onlyweighted_graphactivation_spreading_v1weighted_graph_staticweighted_graph_dynamicactivation_spreading_staticactivation_spreading_dynamic
Private datasets should be built in two buckets:
This is the high-quality, manually labeled evaluation set.
Properties:
- small to medium size
- carefully reviewed
- stable over time
- used for milestone decisions and release comparisons
Suggested first size:
20-50cases per document family for the first real version
This is a larger, lower-touch set used for replay and drift detection.
Properties:
- less detailed labeling
- easier to refresh
- useful for regression replay
- not necessarily suitable for publishable claims
Start with document types that match your intended memory use cases:
- master service agreements
- policy or compliance documents
- operational runbooks
- incident postmortems
- support procedures
Keep each dataset focused. Do not mix unrelated document families into one benchmark unless cross-document retrieval is the explicit point.
Before writing cases, freeze a stable version of the source documents:
- assign a dataset version
- assign stable file names
- keep document hashes or archival copies internally
This avoids silent benchmark drift.
Follow the same pattern already used in the repo:
{document_stem}:{unit_number}
Examples:
msa_v3:12incident_runbook_api:7
Every gold label should point to these stable IDs, not to raw prose fragments.
Each case should include:
queryevidence_node_ids- optional
path - optional
required_edge_types - optional
required_semantic_roles - optional
required_contradiction_pairs - optional
activation_trace
That keeps the benchmark aligned with memory-path-engine rather than reducing everything to answer strings.
Use evidence_node_ids for the minimum nodes that must support the answer.
Guidelines:
- label direct evidence, not every related node
- for multi-hop questions, include every essential support node
- use
minimum_evidence_matchesto control whether all or only some evidence must be found
Use path when "finding the right evidence" is not enough and "walking the right route" matters.
Good uses:
- runbook next-step chains
- contract exception override chains
- dependency chains
Avoid path labels when multiple path shapes are equally valid and the benchmark would become brittle.
Use required_semantic_roles when you want to confirm that the system surfaced the right semantic class:
exceptionremedyconditionescalation
Use required_contradiction_pairs when the value of the case depends on the system surfacing a tension between nodes:
- general rule vs exception
- baseline obligation vs override
- normal flow vs emergency override
For dynamic-memory cases, place several prime-* cases before a final probe-* case.
Guidelines:
- the prime cases should reinforce one region of the graph
- the probe should test later behavior, not restate the same question
- compare static and dynamic modes on the same ordered dataset
To keep private gold labels trustworthy:
- require at least two annotators for high-value cases
- resolve disagreements with a short written rationale
- write a one-page internal annotation guide before scaling
- use tags like
multi_hop,exception,contradiction,sequential,probe - include hard negative cases with similar wording but different evidence
{
"case_id": "private-msa-termination-001",
"query": "If the customer does not cure a material breach after notice, what happens next?",
"tags": ["contract", "termination", "multi_hop"],
"expectation": {
"evidence_node_ids": ["acme_msa:14", "acme_msa:15"],
"minimum_evidence_matches": 1,
"path_scope": "best_path",
"path": {
"match_mode": "prefix",
"steps": [
{ "node_id": "acme_msa:15", "via_edge_type": null },
{ "node_id": "acme_msa:14", "via_edge_type": "depends_on" }
]
}
}
}{
"case_id": "private-payment-exception-001",
"query": "Does the standard 30-day payment rule still apply when delivered goods are defective?",
"tags": ["contract", "exception", "contradiction"],
"expectation": {
"evidence_node_ids": ["supply_terms:3", "supply_terms:4"],
"minimum_evidence_matches": 2,
"required_semantic_roles": ["exception", "remedy"],
"required_contradiction_pairs": [
["supply_terms:2", "supply_terms:3"]
]
}
}{
"case_id": "private-runbook-probe-001",
"query": "What comes after beta cache diagnostics?",
"tags": ["runbook", "sequential", "probe"],
"expectation": {
"evidence_node_ids": ["incident_beta_runbook:7"],
"minimum_evidence_matches": 1,
"path_scope": "best_path",
"path": {
"match_mode": "prefix",
"steps": [
{ "node_id": "incident_beta_runbook:7", "via_edge_type": null },
{ "node_id": "incident_beta_runbook:6", "via_edge_type": "depends_on" }
]
}
}
}A practical first benchmark package would be:
1-2public benchmarks for external sanity checking- all current repository-owned structured benchmarks as the architectural regression suite
20-50private gold cases across one contract family and one runbook family
That is enough to support a credible claim that the system is both:
- benchmarked in public
- tested on architecture-specific fixtures
- validated on realistic private material
The table below is a default rolling schedule for a small team. Adjust dates to your calendar; keep the ordering: Layer B gates before Layer A scale, and Layer C in parallel once Layer B is stable.
| Week | Primary layer | What to run | Gate (pass / fail) | Notes |
|---|---|---|---|---|
| 0 | B | Full unit suite + load all benchmarks/structured_memory/*.json |
CI green; every fixture loads | Baseline regression lock |
| 1 | B | Same + run_suite on key modes for each fixture (see below) |
evidence_hit_rate and path_hit_rate each no worse than -2% absolute vs last tagged baseline; avg_latency_ms no worse than +20% |
Use docs/evaluation.md mode list |
| 2 | A (HotpotQA v0) | Adapter + 2–5 hand-checked examples in tests (no full download) | Unit tests pass; gold supporting_facts → node_id mapping verified |
See appendix HotpotQA v0 |
| 3 | A | Dev distractor tiny split (32–128 items) in CI or skip-if-missing |
Pipeline runs end-to-end; embedding_baseline.evidence_hit_rate >= lexical_baseline.evidence_hit_rate; both latencies captured |
Retrieval-only v0 |
| 4 | A | Full dev distractor locally (not necessarily CI) | Table of evidence_hit_rate by type (bridge / comparison) + latency |
Still no EM/F1 unless you add a reader |
| 5 | B + A | Ablations from evaluation.md on B fixtures + HotpotQA tiny |
At least one targeted Layer B fixture must show the expected direction for each ablation family: no-structure, no-weight, no-path-expansion | Document failures explicitly |
| 6 | C | Private golden pilot 20 cases |
Dual annotation complete; evidence-label agreement >= 0.85; review-complete set exported to benchmark JSON |
Aggregate only if policy allows |
Recommended minimum mode matrix per Layer B run (when comparing architecture, not only smoke):
lexical_baseline,embedding_baseline,structure_only,weighted_graph,activation_spreading_v1- For dynamic claims only:
activation_spreading_staticvsactivation_spreading_dynamicon priming fixtures
Gate philosophy
- Hard gate: anything that already has CI today (tests + JSON load) must stay green.
- Soft gate: public benchmark deltas are reported honestly; do not block merges on HotpotQA leaderboard scores until the adapter is stable.
- Release gate (optional): before tagging
v0.x, require Week 1 B-suite comparison + Week 3 HotpotQA tiny evidence metrics logged in release notes.
benchmarks/structured_memory/*.json: main CIbenchmarks/external/hotpotqa/hotpot_tiny_fixture.json: main CI smokebenchmarks/external/hotpotqa/data/*.json: nightly or local onlybenchmarks/external/longmemeval/longmemeval_tiny_fixture.json: local smokebenchmarks/external/longmemeval/data/*.json: manual or nightly after the adapter stabilizes
For a contract-focused private rollout plan, including inventory template, sampling rules, and golden annotation workflow, see private-contract-dataset-guide.md.
First public benchmark should be HotpotQA dev distractor (hotpot_dev_distractor_v1.json), not fullwiki: each question ships with a fixed context block (two gold paragraphs plus distractors), which matches per-sample MemoryStore construction without a corpus-wide index.
Data
- Official splits and format: HotpotQA. Typical fields:
question,answer,contextas[[title, [sentences...]], ...],supporting_factsas[[title, sent_id], ...]withsent_id0-based within that title’s sentence list. - License and citation: copy the exact terms from the official release into
benchmarks/external/hotpotqa/README.md(do not paraphrase legal text). - HuggingFace option: dataset id such as
hotpotqa/hotpot_qacan simplify download; pin revision for reproducibility.
Ingestion (v0)
- One sentence → one
MemoryNode:node_id = "{normalized_title}:{sent_idx}"for sentences under each title in that sample’scontext. - Edges (v0): within the same title only, chain adjacent sentences (
title:s→title:s+1). Do not add cross-title cliques in v0. - Normalization: implement one shared
normalize_titlefor both ingestion and gold mapping; watch Unicode, spacing, and punctuation.
Gold → benchmark
- For each
[title, sent_id]insupporting_facts, resolve to the samenode_idscheme; dedupe intoevidence_node_ids. minimum_evidence_matches: pick one policy and keep it fixed in reports (all gold sentencesvsat least one).
Metrics
- Retrieval-only phase (no answer generation): report
evidence_hit/evidence_hit_ratealigned withStructuredBenchmarkRunner, plus recall@k only after you define what “top-k” means for your retriever (best path steps vs union of paths—document the choice). - EM / F1: report only after you add a reader that produces a span or short answer and reuse HotpotQA’s official normalization.
Repo layout (suggested)
benchmarks/external/hotpotqa/README.md— download, license, metrics, commandsbenchmarks/external/hotpotqa/hotpot_tiny_fixture.json— two-item sanity fixture for CI-style testsscripts/download_hotpotqa.py— optional fetch + checksum (not required for v0)src/memory_engine/benchmarking/adapters/hotpotqa.py— implemented: sample →MemoryStore, gold mapping,run_hotpotqa_benchmark
CI
- Use a tiny slice (
32–128dev examples) orskipifwhen data absent; full dev set stays a local / nightly job.
Pitfalls
- Do not merge all dev questions into one global graph (leaks cross-question structure).
- Distractor setting: most nodes are negatives; compare modes under the same
top_kpolicy. - Title matching between
contextandsupporting_factsmust be identical after normalization.
When you publish or present results:
- use public benchmarks for external comparison
- use repository-owned benchmarks to prove structure, path, and dynamic-memory claims
- use private datasets to show realistic value, but report them carefully and transparently
Good practice:
- publish aggregate numbers
- describe the private data types and annotation protocol
- include a few anonymized example cases if policy allows
- never let private-only results become the sole support for a general technical claim