Skip to content

feat: emit '<external>' sentinel for cross-corpus source_file#880

Open
josephmcknight-bot wants to merge 2 commits into
safishamsi:mainfrom
josephmcknight-bot:feature/external-sentinel-source-file
Open

feat: emit '<external>' sentinel for cross-corpus source_file#880
josephmcknight-bot wants to merge 2 commits into
safishamsi:mainfrom
josephmcknight-bot:feature/external-sentinel-source-file

Conversation

@josephmcknight-bot
Copy link
Copy Markdown

The noise problem

Running graphify update . on any polyglot codebase emits hundreds of
missing required field 'source_file' warnings. On a real-world corpus
of 78,809 nodes / 87,141 edges / 5,771 communities, 193 nodes triggered
the warning. Digging in:

  • ~95 are framework types (DbContext, Endpoint, Migration,
    IDesignTimeDbContextFactory, etc.) referenced via inheritance but
    never defined locally. The AST extractor today emits these as stub
    nodes with source_file="", which the validator flags. These are
    false positives — the symbols genuinely don't live in the parsed
    corpus.
  • ~98 are real LLM-extractor bugs (55 rationale + 43 document
    nodes where the subagent elided source_file). These are the ones a
    user actually wants to see.

Today both classes look identical in the validator output, so the
signal-to-noise ratio buries the real bugs.

What this PR does

Pin the source_file contract at the schema level so the two classes
are distinguishable:

  1. graphify/extract.py — the Python AST extractor's external-base
    stub branch now emits source_file="<external>" instead of "".
    The module docstring documents the contract: source_file is
    exactly one of (a) a path string or (b) the literal "<external>"
    sentinel meaning "lives outside the parsed corpus".

  2. graphify/validate.pyvalidate_extraction now treats
    empty-string and None source_file as "missing required field"
    (it previously only flagged when the key itself was absent), and
    accepts "<external>" as a valid value. This keeps the LLM-omission
    bug visible while silencing the framework-type false positives.

  3. graphify/skill.md — the extraction-subagent prompt template
    now explicitly marks source_file as REQUIRED on every node and
    edge (including document/rationale-style nodes) and documents the
    <external> sentinel so the LLM extractor stays aligned with the
    AST extractor.

Tests

TDD: 5 failing tests committed first, implementation flipped them
green. Total of 8 new tests across 3 files:

  • tests/test_extract.py — 3 tests for the AST sentinel emission
    (plus a sample_external_base.py fixture).
  • tests/test_validate.py — 4 tests covering sentinel acceptance and
    empty/None rejection on both nodes and edges.
  • tests/test_skill_prompt.py — 2 tests pinning the prompt wording.

Full upstream suite green pre + post change. The single failure in the
graspologic-dependent tests on Python 3.13 is unrelated to this
change (build failure on Windows / Py3.13 only — CI matrix on Py3.10

  • Py3.12 is unaffected).

Backwards compatibility

  • No change to .graphify_labels.json or any persisted file shape.
  • Existing source_file paths continue to work unchanged.
  • The validator becomes stricter in one specific case: empty-string
    and None source_file values are now flagged where they previously
    slipped through. Any extractor relying on the old behavior should
    use "<external>" for cross-corpus symbols and a real path otherwise.

Pin the contract before implementing:
- AST extractor must emit source_file="<external>" for cross-corpus stub
  nodes (currently emits "")
- Validator must reject empty/None source_file as missing (currently only
  checks key presence)
- skill.md prompt must document the <external> sentinel so the LLM
  extractor stays aligned with the AST extractor

5 new failing tests + 1 fixture (sample_external_base.py).
Pin the source_file contract so cross-corpus references (framework base
classes referenced via inheritance, etc.) are distinguishable from
extraction bugs:

- AST extractor: external-base stub nodes now use source_file="<external>"
  instead of an empty string. Module docstring documents the contract:
  source_file is either a path or the literal string "<external>".
- Validator: empty string and None for source_file are now flagged as
  missing required field (previously only flagged when the key itself
  was absent). The "<external>" sentinel is accepted as valid.
- skill.md (LLM extraction prompt): source_file is explicitly called out
  as REQUIRED on every node and edge, including document/rationale-style
  nodes. Prompt now references the "<external>" sentinel so the LLM
  extractor stays aligned with the AST extractor.

Why: on a polyglot corpus, every `graphify update` emits hundreds of
"missing required field 'source_file'" warnings. Some are real
extraction bugs (LLM omitted the field); most are framework types
referenced via inheritance but never defined locally. The contract
change separates real bugs from cross-corpus references at the schema
level so downstream consumers can filter cleanly.
Copy link
Copy Markdown
Owner

@safishamsi safishamsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice contribution — the external-source detection idea is solid. A few things to address before merge:

1. Export is_external_source()
The function is currently private (or not exported from detect.py's public surface). Since callers outside the module will use it, please export it explicitly (add to __all__ or make it a top-level public function).

2. Back-compat for existing nodes
Nodes written by older versions of graphify won't have a node_type attribute. Make sure any code that reads node_type guards with .get('node_type') rather than bracket access to avoid KeyError on existing graphs.

3. Soften the PR description
The term "concept-node" isn't used anywhere in the user-facing docs — it may confuse contributors. Consider describing it as "nodes inferred from semantic content rather than source structure" in the description/changelog.

Fix those three and this is good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants