Skip to content

Commit dd8c430

Browse files
Copilotdevlux76
andauthored
P1-X: Complete hierarchical/dialectical query pipeline — bug fixes + Williams-bound enforcement (#77)
* Initial plan * Fix build errors, rename MetroidNeighbor→SemanticNeighbor, fix getAllVolumes syntax, fix FullNeighborRecalc budget logic Co-authored-by: devlux76 <[email protected]> * P1 implementations: Williams-bound maxDegree, dynamic subgraph bounds, HierarchyBuilder quota enforcement, fix pre-existing lint errors Co-authored-by: devlux76 <[email protected]> * Fix HierarchyBuilder split/delete, Query lazy page scan, FullNeighborRecalc budget floor, FastNeighborInsert default degree, PrototypeRecomputer docstring Co-authored-by: devlux76 <[email protected]> * docs: Add Williams bound / curse of dimensionality explanation to DESIGN.md and README.md Co-authored-by: devlux76 <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: devlux76 <[email protected]>
1 parent 5cde29f commit dd8c430

19 files changed

+556
-135
lines changed

DESIGN.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,8 +283,24 @@ This mechanism enables **distributed learning without hallucination**: the syste
283283

284284
### Motivation
285285

286+
#### The Geometric Root: Curse of Dimensionality
287+
288+
CORTEX operates on high-dimensional Matryoshka embeddings. In `n`-dimensional Euclidean space the volume of the unit ball is:
289+
290+
```
291+
Vol(B²ᵐ) = πᵐ / m! (n = 2m, even dimension)
292+
```
293+
294+
As `m` (half the embedding dimension) grows, this volume collapses toward zero exponentially fast. This is the geometric driver of the **curse of dimensionality**: pairwise distances concentrate (everything looks equally far away), interiors vanish (rejection sampling and kernel methods fail), and any linear or polynomial scaling law blows up. Naïve nearest-neighbor search, flat clustering, fixed-K neighbor graphs, and uniform fan-out become either useless or unboundedly expensive as the corpus scales.
295+
296+
Every structural decision in CORTEX — protected Matryoshka layers, hierarchical medoids, the Metroid antithesis hunt, dimensional unwinding, Williams-derived index sizes — is a direct geometric counter-measure to this collapse.
297+
298+
#### The Fix: Williams 2025 Sublinear Bound
299+
286300
CORTEX applies the Williams 2025 result — S = O(√(t log t)) — as a universal sublinear growth law everywhere the system trades space against time: the resident hotpath index, per-tier hierarchy quotas, per-community graph budgets, semantic neighbor degree limits, and Daydreamer maintenance batch sizing. This single principle ensures the system stays efficient as the memory graph scales from hundreds to millions of nodes.
287301

302+
Concretely: where a naïve system would grow capacity linearly (O(t)) or even quadratically (O(t²) for pairwise operations), CORTEX caps every space-or-time budget at O(√(t log t)). This is the mathematically precise bound that keeps the engine on-device forever, regardless of corpus size.
303+
288304
### Graph Mass Definition
289305

290306
```
@@ -797,7 +813,7 @@ relative to frozen c. Planned module: `cortex/MetroidBuilder.ts`.
797813

798814
**Hotpath**: The in-memory resident index of H(t) entries spanning all four hierarchy tiers. The hotpath is the first lookup target for every query; misses spill to WARM/COLD storage. HOT membership and salience are checkpointed to the `hotpath_index` IndexedDB store by Daydreamer each maintenance cycle, allowing the RAM index to be restored after a page reload or machine reboot without full corpus replay.
799815

800-
**Williams Bound**: The theoretical result S = O(√(t log t)) from Williams 2025, applied here as a universal sublinear growth law for all space-time tradeoff subsystems in CORTEX.
816+
**Williams Bound**: The theoretical result S = O(√(t log t)) from Williams 2025, applied here as a universal sublinear growth law for all space-time tradeoff subsystems in CORTEX. The bound is the constructive answer to the curse of dimensionality: in `n`-dimensional space the unit-ball volume collapses as `πᵐ/m!` (n = 2m), making linear-scale data structures infeasible. The Williams sublinear bound keeps every budget — hotpath capacity, hierarchy fanout, neighbor degree, maintenance batch size — proportional to √(t log t) rather than t, ensuring on-device viability at any corpus scale.
801817

802818
**Graph mass (t)**: t = |V| + |E| = total pages plus all edges (Hebbian + semantic neighbor). The canonical input to all capacity and bound formulas.
803819

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ This is the "dreaming" phase that prevents catastrophic forgetting and forces ab
9494
## Core Design Principles
9595

9696
- **Biological Scarcity** — Only a fixed number of active prototypes live in memory. Everything else is gracefully demoted to disk.
97-
- **Sublinear Growth (Williams Bound)**The resident hotpath index is bounded to H(t) = ⌈c·√(t·log₂(1+t))⌉ where t = total graph mass (pages + edges). Memory scales sublinearly as the graph grows, trading time for space at a mathematically principled rate. See [`DESIGN.md`](DESIGN.md) for the full theorem mapping.
97+
- **Sublinear Growth (Williams Bound)**In `n`-dimensional embedding space the unit-ball volume collapses as `πᵐ/m!` (n = 2m). This geometric fact — the curse of dimensionality — makes linear-scale data structures infeasible as corpora grow. CORTEX counters it with the Williams 2025 result S = O(√(t log t)), used as a universal sublinear growth law: the resident hotpath index is bounded to H(t) = ⌈c·√(t·log₂(1+t))⌉, with the same formula driving hierarchy fanout limits, semantic-neighbor degree caps, and Daydreamer maintenance batch sizes. Every space-or-time budget scales sublinearly, keeping the engine on-device at any corpus size. See [`DESIGN.md`](DESIGN.md) for the full theorem mapping.
9898
- **Three-Zone Memory** — HOT (resident in-memory index, capacity H(t)), WARM (indexed in IndexedDB, reachable via nearest-neighbor search), COLD (metadata in IndexedDB + raw vectors in OPFS, but semantically isolated from the search path — no strong nearest neighbors in vector space at insertion time; only discoverable by a deliberate random walk). All data is retained locally forever; zones control lookup cost and discoverability, not data lifetime.
9999
- **Hierarchical & Sparse** — Progressive dimensionality reduction + medoid clustering keeps memory efficient at any scale, with Williams-derived fanout bounds preventing any single tier from monopolising the index.
100100
- **Hebbian & Dynamic** — Connections strengthen and weaken naturally. Node salience (σ = α·H_in + β·R + γ·Q) drives promotion into and eviction from the resident hotpath.

core/HotpathPolicy.ts

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,112 @@ export function deriveCommunityQuotas(
194194
for (let i = 0; i < n; i++) quotas[i] += floors[i];
195195
return quotas;
196196
}
197+
198+
// ---------------------------------------------------------------------------
199+
// Semantic neighbor degree limit — Williams-bound derived
200+
// ---------------------------------------------------------------------------
201+
202+
// Bootstrap floor for Williams-bound log formulas: ensures t_eff ≥ 2 so that
203+
// log₂(t_eff) > 0 and log₂(log₂(1+t_eff)) is defined and positive.
204+
const MIN_GRAPH_MASS_FOR_LOGS = 2;
205+
206+
/**
207+
* Compute the Williams-bound-derived maximum degree for the semantic neighbor
208+
* graph given a corpus of `graphMass` total pages.
209+
*
210+
* The degree limit uses the same H(t) formula as the hotpath capacity but is
211+
* bounded by a hard cap to keep the graph sparse. At small corpora the
212+
* Williams formula naturally returns small values (e.g. 1–5 for t < 10);
213+
* at large corpora the `hardCap` clamps growth to prevent the graph becoming
214+
* too dense.
215+
*
216+
* @param graphMass Total number of pages in the corpus.
217+
* @param c Williams Bound scaling constant (default from policy).
218+
* @param hardCap Maximum degree regardless of formula result. Default: 32.
219+
*/
220+
export function computeNeighborMaxDegree(
221+
graphMass: number,
222+
c: number = DEFAULT_HOTPATH_POLICY.c,
223+
hardCap = 32,
224+
): number {
225+
const derived = computeCapacity(graphMass, c);
226+
return Math.min(hardCap, Math.max(1, derived));
227+
}
228+
229+
// ---------------------------------------------------------------------------
230+
// Dynamic subgraph expansion bounds — Williams-bound derived
231+
// ---------------------------------------------------------------------------
232+
233+
export interface SubgraphBounds {
234+
/** Maximum number of nodes to include in the induced subgraph. */
235+
maxSubgraphSize: number;
236+
/** Maximum BFS hops from seed nodes. */
237+
maxHops: number;
238+
/** Maximum fanout per hop (branching factor). */
239+
perHopBranching: number;
240+
}
241+
242+
/**
243+
* Compute dynamic Williams-derived bounds for subgraph expansion (step 9 of
244+
* the Cortex query path).
245+
*
246+
* Formulas from DESIGN.md "Dynamic Subgraph Expansion Bounds":
247+
*
248+
* t_eff = max(t, 2)
249+
* maxSubgraphSize = min(30, ⌊√(t_eff · log₂(1+t_eff)) / log₂(t_eff)⌋)
250+
* maxHops = max(1, ⌈log₂(log₂(1 + t_eff))⌉)
251+
* perHopBranching = max(1, ⌊maxSubgraphSize ^ (1/maxHops)⌋)
252+
*
253+
* The bootstrap floor `t_eff = max(t, 2)` eliminates division-by-zero for
254+
* t ≤ 1 and ensures a safe minimum of `maxSubgraphSize=1, maxHops=1`.
255+
*
256+
* @param graphMass Total number of pages in the corpus.
257+
*/
258+
export function computeSubgraphBounds(graphMass: number): SubgraphBounds {
259+
const tEff = Math.max(graphMass, MIN_GRAPH_MASS_FOR_LOGS);
260+
const log2tEff = Math.log2(tEff);
261+
262+
const maxSubgraphSize = Math.min(
263+
30,
264+
Math.floor(Math.sqrt(tEff * Math.log2(1 + tEff)) / log2tEff),
265+
);
266+
267+
const maxHops = Math.max(1, Math.ceil(Math.log2(Math.log2(1 + tEff))));
268+
269+
const perHopBranching = Math.max(
270+
1,
271+
Math.floor(Math.pow(maxSubgraphSize, 1 / maxHops)),
272+
);
273+
274+
return {
275+
maxSubgraphSize: Math.max(1, maxSubgraphSize),
276+
maxHops,
277+
perHopBranching,
278+
};
279+
}
280+
281+
// ---------------------------------------------------------------------------
282+
// Williams-derived hierarchy fanout limit
283+
// ---------------------------------------------------------------------------
284+
285+
/**
286+
* Compute the Williams-derived fanout limit for a hierarchy node that
287+
* currently has `childCount` children.
288+
*
289+
* Per DESIGN.md "Sublinear Fanout Bounds":
290+
* Max children = O(√(childCount · log childCount))
291+
*
292+
* The formula is evaluated with a bootstrap floor of t_eff = max(t, 2) to
293+
* avoid log(0) and returns at least 1 child.
294+
*
295+
* @param childCount Current number of children for the parent node.
296+
* @param c Williams Bound scaling constant.
297+
*/
298+
export function computeFanoutLimit(
299+
childCount: number,
300+
c: number = DEFAULT_HOTPATH_POLICY.c,
301+
): number {
302+
const tEff = Math.max(childCount, MIN_GRAPH_MASS_FOR_LOGS);
303+
const raw = c * Math.sqrt(tEff * Math.log2(1 + tEff));
304+
return Math.max(1, Math.ceil(raw));
305+
}

cortex/Query.ts

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ import type { ModelProfile } from "../core/ModelProfile";
22
import type { Hash, MetadataStore, Page, VectorStore } from "../core/types";
33
import type { EmbeddingRunner } from "../embeddings/EmbeddingRunner";
44
import { runPromotionSweep } from "../core/SalienceEngine";
5+
import { computeSubgraphBounds } from "../core/HotpathPolicy";
56
import type { QueryResult } from "./QueryResult";
67
import { rankPages, spillToWarm } from "./Ranking";
78
import { buildMetroid } from "./MetroidBuilder";
@@ -14,9 +15,13 @@ export interface QueryOptions {
1415
vectorStore: VectorStore;
1516
metadataStore: MetadataStore;
1617
topK?: number;
17-
/** BFS depth for semantic neighbor subgraph expansion. 2 hops covers direct
18-
* neighbors and their neighbors, which is the minimum needed to surface
19-
* bridge nodes without exploding the graph size. */
18+
/**
19+
* Maximum BFS depth for semantic neighbor subgraph expansion.
20+
*
21+
* When omitted, a dynamic Williams-derived value is computed from the
22+
* corpus size via `computeSubgraphBounds(t)`. Providing an explicit value
23+
* overrides the dynamic bound (useful for tests and controlled experiments).
24+
*/
2025
maxHops?: number;
2126
}
2227

@@ -30,7 +35,6 @@ export async function query(
3035
vectorStore,
3136
metadataStore,
3237
topK = 10,
33-
maxHops = 2,
3438
} = options;
3539
const nowIso = new Date().toISOString();
3640

@@ -116,8 +120,19 @@ export async function query(
116120
);
117121

118122
// --- Subgraph expansion ---
123+
// Use dynamic Williams-derived bounds unless the caller has pinned an
124+
// explicit maxHops value. Only load all pages when we actually need to
125+
// compute bounds — skip the full-page scan on the hot path when maxHops is
126+
// already known.
119127
const topPageIds = topPages.map((p) => p.pageId);
120-
const subgraph = await metadataStore.getInducedNeighborSubgraph(topPageIds, maxHops);
128+
let effectiveMaxHops: number;
129+
if (options.maxHops !== undefined) {
130+
effectiveMaxHops = options.maxHops;
131+
} else {
132+
const allPages = await metadataStore.getAllPages();
133+
effectiveMaxHops = computeSubgraphBounds(allPages.length).maxHops;
134+
}
135+
const subgraph = await metadataStore.getInducedNeighborSubgraph(topPageIds, effectiveMaxHops);
121136

122137
// --- TSP coherence path ---
123138
const coherencePath = solveOpenTSP(subgraph);

daydreamer/ClusterStability.ts

Lines changed: 24 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,36 @@
11
// ---------------------------------------------------------------------------
2-
// ClusterStability — Community detection via label propagation (P2-F)
2+
// ClusterStability — Community detection via label propagation (P2-F) and
3+
// volume split/merge for balanced cluster maintenance (P2-F3)
34
// ---------------------------------------------------------------------------
45
//
56
// Assigns community labels to pages by running lightweight label propagation
6-
// on the semantic (Metroid) neighbor graph. Labels are stored in
7+
// on the semantic neighbor graph. Labels are stored in
78
// PageActivity.communityId and propagate into SalienceEngine community quotas.
89
//
910
// Label propagation terminates when assignments stabilise (no label changes)
1011
// or a maximum iteration limit is reached.
12+
//
13+
// The Daydreamer background worker also calls ClusterStability periodically to
14+
// detect and fix unstable volumes:
15+
// - HIGH-VARIANCE volumes are split into two balanced sub-volumes.
16+
// - LOW-COUNT volumes are merged into the nearest neighbour volume.
17+
// - Community labels are updated after structural changes.
1118
// ---------------------------------------------------------------------------
1219

13-
import type { Hash, MetadataStore, PageActivity } from "../core/types";
20+
import { hashText } from "../core/crypto/hash";
21+
import type {
22+
Book,
23+
Hash,
24+
MetadataStore,
25+
PageActivity,
26+
Volume,
27+
} from "../core/types";
1428

1529
// ---------------------------------------------------------------------------
16-
// Options
30+
// Label propagation options
1731
// ---------------------------------------------------------------------------
1832

19-
export interface ClusterStabilityOptions {
33+
export interface LabelPropagationOptions {
2034
metadataStore: MetadataStore;
2135
/** Maximum number of label propagation iterations. Default: 20. */
2236
maxIterations?: number;
@@ -55,7 +69,7 @@ async function propagationPass(
5569
const sorted = [...pageIds].sort();
5670

5771
for (const pageId of sorted) {
58-
const neighbors = await metadataStore.getMetroidNeighbors(pageId);
72+
const neighbors = await metadataStore.getSemanticNeighbors(pageId);
5973
if (neighbors.length === 0) continue;
6074

6175
// Count neighbor labels
@@ -103,7 +117,7 @@ async function propagationPass(
103117
* `MetadataStore.putPageActivity`.
104118
*/
105119
export async function runLabelPropagation(
106-
options: ClusterStabilityOptions,
120+
options: LabelPropagationOptions,
107121
): Promise<LabelPropagationResult> {
108122
const {
109123
metadataStore,
@@ -200,32 +214,12 @@ export function detectEmptyCommunities(
200214
}
201215
}
202216
return empty;
203-
// ClusterStability — Volume split/merge for balanced cluster maintenance
217+
}
218+
204219
// ---------------------------------------------------------------------------
205-
//
206-
// The Daydreamer background worker calls ClusterStability periodically to
207-
// detect and fix unstable volumes:
208-
//
209-
// - HIGH-VARIANCE volumes are split into two balanced sub-volumes using
210-
// K-means with K=2 (one pass).
211-
// - LOW-COUNT volumes are merged into the nearest neighbour volume
212-
// (by medoid distance).
213-
// - Community labels on PageActivity records are updated after structural
214-
// changes so downstream salience computation stays coherent.
215-
//
216-
// All operations are idempotent: re-running on a stable set of volumes is a
217-
// no-op.
220+
// ClusterStability class — Volume split/merge configuration
218221
// ---------------------------------------------------------------------------
219222

220-
import { hashText } from "../core/crypto/hash";
221-
import type {
222-
Book,
223-
Hash,
224-
MetadataStore,
225-
PageActivity,
226-
Volume,
227-
} from "../core/types";
228-
229223
// ---------------------------------------------------------------------------
230224
// Configuration
231225
// ---------------------------------------------------------------------------

0 commit comments

Comments
 (0)