Skip to content

docs: clarify code-only semantic extraction#912

Open
balloon72 wants to merge 1 commit into
safishamsi:v8from
balloon72:docs/code-semantic-extraction
Open

docs: clarify code-only semantic extraction#912
balloon72 wants to merge 1 commit into
safishamsi:v8from
balloon72:docs/code-semantic-extraction

Conversation

@balloon72
Copy link
Copy Markdown
Contributor

Summary

  • Clarify that code files are handled by the local Tree-sitter/deterministic extraction path.
  • State that code-only corpora skip the semantic LLM pass.
  • Keep semantic extraction scoped to docs, papers, images, and transcripts.

Tests

  • python -m pytest tests/test_install_strings.py

Notes

Closes #836

@balloon72 balloon72 marked this pull request as ready for review May 18, 2026 01:47
@safishamsi
Copy link
Copy Markdown
Owner

Thanks for addressing #836 — the second sentence (code-only corpora skip the LLM pass entirely) is accurate and worth documenting. The first sentence needs a correction though.

Code files are not sent to the LLM semantic extractor in the normal pipeline.

This is only true for code-only corpora. In mixed corpora (code + docs/papers/images), code files do go through the LLM — skill.md line 318 explicitly says: "Code files: focus on semantic edges AST cannot find (call relationships, shared data, arch patterns)." The current wording would mislead users with a mixed corpus.

Suggested replacement:

In a code-only corpus, the LLM semantic pass is skipped entirely — Tree-sitter handles all extraction. In mixed corpora (code alongside docs, papers, or images), code files also go through the LLM, but only to extract semantic edges that AST cannot find (call relationships, shared data, architectural patterns) — never to re-extract imports or structure.

Update the test to match and this is good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do code files need semantic extract?

2 participants