feat: emit '<external>' sentinel for cross-corpus source_file#880
feat: emit '<external>' sentinel for cross-corpus source_file#880josephmcknight-bot wants to merge 2 commits into
Conversation
Pin the contract before implementing: - AST extractor must emit source_file="<external>" for cross-corpus stub nodes (currently emits "") - Validator must reject empty/None source_file as missing (currently only checks key presence) - skill.md prompt must document the <external> sentinel so the LLM extractor stays aligned with the AST extractor 5 new failing tests + 1 fixture (sample_external_base.py).
Pin the source_file contract so cross-corpus references (framework base classes referenced via inheritance, etc.) are distinguishable from extraction bugs: - AST extractor: external-base stub nodes now use source_file="<external>" instead of an empty string. Module docstring documents the contract: source_file is either a path or the literal string "<external>". - Validator: empty string and None for source_file are now flagged as missing required field (previously only flagged when the key itself was absent). The "<external>" sentinel is accepted as valid. - skill.md (LLM extraction prompt): source_file is explicitly called out as REQUIRED on every node and edge, including document/rationale-style nodes. Prompt now references the "<external>" sentinel so the LLM extractor stays aligned with the AST extractor. Why: on a polyglot corpus, every `graphify update` emits hundreds of "missing required field 'source_file'" warnings. Some are real extraction bugs (LLM omitted the field); most are framework types referenced via inheritance but never defined locally. The contract change separates real bugs from cross-corpus references at the schema level so downstream consumers can filter cleanly.
safishamsi
left a comment
There was a problem hiding this comment.
Nice contribution — the external-source detection idea is solid. A few things to address before merge:
1. Export is_external_source()
The function is currently private (or not exported from detect.py's public surface). Since callers outside the module will use it, please export it explicitly (add to __all__ or make it a top-level public function).
2. Back-compat for existing nodes
Nodes written by older versions of graphify won't have a node_type attribute. Make sure any code that reads node_type guards with .get('node_type') rather than bracket access to avoid KeyError on existing graphs.
3. Soften the PR description
The term "concept-node" isn't used anywhere in the user-facing docs — it may confuse contributors. Consider describing it as "nodes inferred from semantic content rather than source structure" in the description/changelog.
Fix those three and this is good to merge.
The noise problem
Running
graphify update .on any polyglot codebase emits hundreds ofmissing required field 'source_file'warnings. On a real-world corpusof 78,809 nodes / 87,141 edges / 5,771 communities, 193 nodes triggered
the warning. Digging in:
DbContext,Endpoint,Migration,IDesignTimeDbContextFactory, etc.) referenced via inheritance butnever defined locally. The AST extractor today emits these as stub
nodes with
source_file="", which the validator flags. These arefalse positives — the symbols genuinely don't live in the parsed
corpus.
nodes where the subagent elided
source_file). These are the ones auser actually wants to see.
Today both classes look identical in the validator output, so the
signal-to-noise ratio buries the real bugs.
What this PR does
Pin the
source_filecontract at the schema level so the two classesare distinguishable:
graphify/extract.py— the Python AST extractor's external-basestub branch now emits
source_file="<external>"instead of"".The module docstring documents the contract:
source_fileisexactly one of (a) a path string or (b) the literal
"<external>"sentinel meaning "lives outside the parsed corpus".
graphify/validate.py—validate_extractionnow treatsempty-string and
Nonesource_fileas "missing required field"(it previously only flagged when the key itself was absent), and
accepts
"<external>"as a valid value. This keeps the LLM-omissionbug visible while silencing the framework-type false positives.
graphify/skill.md— the extraction-subagent prompt templatenow explicitly marks
source_fileas REQUIRED on every node andedge (including document/rationale-style nodes) and documents the
<external>sentinel so the LLM extractor stays aligned with theAST extractor.
Tests
TDD: 5 failing tests committed first, implementation flipped them
green. Total of 8 new tests across 3 files:
tests/test_extract.py— 3 tests for the AST sentinel emission(plus a
sample_external_base.pyfixture).tests/test_validate.py— 4 tests covering sentinel acceptance andempty/None rejection on both nodes and edges.
tests/test_skill_prompt.py— 2 tests pinning the prompt wording.Full upstream suite green pre + post change. The single failure in the
graspologic-dependent tests on Python 3.13 is unrelated to thischange (build failure on Windows / Py3.13 only — CI matrix on Py3.10
Backwards compatibility
.graphify_labels.jsonor any persisted file shape.source_filepaths continue to work unchanged.and
Nonesource_filevalues are now flagged where they previouslyslipped through. Any extractor relying on the old behavior should
use
"<external>"for cross-corpus symbols and a real path otherwise.