fix: Generate llm docs off post-processed html#6424
fix: Generate llm docs off post-processed html#6424mike-plummer wants to merge 4 commits intomainfrom
Conversation
cypress-documentation
|
||||||||||||||||||||||||||||||||||||||||
| Project |
cypress-documentation
|
| Branch Review |
mikep/fix-markdown-processing
|
| Run status |
|
| Run duration | 07m 36s |
| Commit |
|
| Committer | Mike Plummer |
| View all properties for this run ↗︎ | |
| Test results | |
|---|---|
|
|
0
|
|
|
0
|
|
|
0
|
|
|
0
|
|
|
322
|
| View all changes introduced in this branch ↗︎ | |
UI Coverage
47.51%
|
|
|---|---|
|
|
337
|
|
|
3
|
Accessibility
96.88%
|
|
|---|---|
|
|
1 critical
3 serious
4 moderate
1 minor
|
|
|
35
|
| const className = codeNode?.className || ''; | ||
|
|
||
| const languageMatch = className.match(/language-(\w+)/); | ||
| const language = languageMatch ? languageMatch[1] : ''; |
There was a problem hiding this comment.
Language regex truncates hyphenated Prism language names
Low Severity
The regex /language-(\w+)/ uses \w which matches [a-zA-Z0-9_] but not hyphens. Prism language identifiers like shell-session, css-extras, or c-like would be truncated to just the first segment (e.g., shell), producing incorrect language hints on fenced code blocks in the markdown output.
Reviewed by Cursor Bugbot for commit ed03688. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2232e68. Configure here.
| } | ||
| }, | ||
| }, | ||
| } |
There was a problem hiding this comment.
Image tags silently stripped by default sanitize-html config
Medium Severity
The sanitizeOptions don't specify allowedTags, so sanitize-html uses its defaults which do not include img. Since img is a void element with no text content, disallowedTagsMode: 'discard' removes images entirely — including their alt text and src URL. All documentation images are silently lost before Turndown can convert them to markdown  syntax. Additionally, any <a> wrapping an <img> becomes an empty anchor that the exclusiveFilter then removes too.
Reviewed by Cursor Bugbot for commit 2232e68. Configure here.


The previous approach for generating our LLM docs relied on processing the pre-built MDX files and stripping content out to arrive at simpler MD files. Since the majority of JSX in these files is for styling it wasn't a big deal just to drop it. Unfortunately, there are a couple pages (namely, the Plugins List) that have substantial content rendered from JSX, thus leaving us with no meaningful content in the final MD files.
This PR changes our MD processing to instead generate markdown off the final built HTML files, stripping down the HTML to just the content area. This unfortunately will make it more difficult to restructure/rearrange content in the MD files if we choose do so in the future, but for now gets us more complete content without substantial complexity.
Comparison:
Current production
This PR
Note
Medium Risk
Changes the LLM export pipeline to depend on built
distHTML structure and a new sanitization/HTML→Markdown conversion path, which could alter exported content and link/code formatting across many docs. Failures may surface at build/export time if expected HTML files or selectors differ.Overview
Switches LLM doc generation from MDX AST normalization to post-build HTML extraction.
MarkdownExporternow locates each doc’s generatedindex.html(respecting frontmatterslug), sanitizes/extracts the main content, and converts it to markdown viaturndown(with a custom fenced-code-block rule).Removes partials/MDX-specific handling. Deletes
mdx-normalize,PartialsRegistry, related config (partialsMode,partialssection defaults), and associated tests; addshtml-normalizeplus a comprehensive test suite for content extraction, filtering, and link rewriting.Updates dependencies. Drops
rehype-parse/rehype-remarkfrom relevant packages and addssanitize-html,turndown,turndown-plugin-gfm, and@types/turndown(plus lockfile updates).Reviewed by Cursor Bugbot for commit ed03688. Bugbot is set up for automated code reviews on this repo. Configure here.