Skip to content

fix: Generate llm docs off post-processed html#6424

Open
mike-plummer wants to merge 4 commits intomainfrom
mikep/fix-markdown-processing
Open

fix: Generate llm docs off post-processed html#6424
mike-plummer wants to merge 4 commits intomainfrom
mikep/fix-markdown-processing

Conversation

@mike-plummer
Copy link
Copy Markdown
Contributor

@mike-plummer mike-plummer commented Apr 23, 2026

The previous approach for generating our LLM docs relied on processing the pre-built MDX files and stripping content out to arrive at simpler MD files. Since the majority of JSX in these files is for styling it wasn't a big deal just to drop it. Unfortunately, there are a couple pages (namely, the Plugins List) that have substantial content rendered from JSX, thus leaving us with no meaningful content in the final MD files.

This PR changes our MD processing to instead generate markdown off the final built HTML files, stripping down the HTML to just the content area. This unfortunately will make it more difficult to restructure/rearrange content in the MD files if we choose do so in the future, but for now gets us more complete content without substantial complexity.

Comparison:
Current production
This PR


Note

Medium Risk
Changes the LLM export pipeline to depend on built dist HTML structure and a new sanitization/HTML→Markdown conversion path, which could alter exported content and link/code formatting across many docs. Failures may surface at build/export time if expected HTML files or selectors differ.

Overview
Switches LLM doc generation from MDX AST normalization to post-build HTML extraction. MarkdownExporter now locates each doc’s generated index.html (respecting frontmatter slug), sanitizes/extracts the main content, and converts it to markdown via turndown (with a custom fenced-code-block rule).

Removes partials/MDX-specific handling. Deletes mdx-normalize, PartialsRegistry, related config (partialsMode, partials section defaults), and associated tests; adds html-normalize plus a comprehensive test suite for content extraction, filtering, and link rewriting.

Updates dependencies. Drops rehype-parse/rehype-remark from relevant packages and adds sanitize-html, turndown, turndown-plugin-gfm, and @types/turndown (plus lockfile updates).

Reviewed by Cursor Bugbot for commit ed03688. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread plugins/llm/src/MarkdownExporter.ts Outdated
Comment thread plugins/llm/src/html-normalize.ts
Comment thread plugins/llm/src/html-normalize.ts Outdated
@cypress
Copy link
Copy Markdown

cypress Bot commented Apr 23, 2026

cypress-documentation    Run #1053

Run Properties:  status check passed Passed #1053  •  git commit 2232e68107: Fix trailing slash handling
Project cypress-documentation
Branch Review mikep/fix-markdown-processing
Run status status check passed Passed #1053
Run duration 07m 36s
Commit git commit 2232e68107: Fix trailing slash handling
Committer Mike Plummer
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 322
View all changes introduced in this branch ↗︎
UI Coverage  47.51%
  Untested elements 337  
  Tested elements 3  
Accessibility  96.88%
  Failed rules  1 critical   3 serious   4 moderate   1 minor
  Failed elements 35  

Comment thread plugins/llm/src/html-normalize.ts
const className = codeNode?.className || '';

const languageMatch = className.match(/language-(\w+)/);
const language = languageMatch ? languageMatch[1] : '';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Language regex truncates hyphenated Prism language names

Low Severity

The regex /language-(\w+)/ uses \w which matches [a-zA-Z0-9_] but not hyphens. Prism language identifiers like shell-session, css-extras, or c-like would be truncated to just the first segment (e.g., shell), producing incorrect language hints on fenced code blocks in the markdown output.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ed03688. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2232e68. Configure here.

}
},
},
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image tags silently stripped by default sanitize-html config

Medium Severity

The sanitizeOptions don't specify allowedTags, so sanitize-html uses its defaults which do not include img. Since img is a void element with no text content, disallowedTagsMode: 'discard' removes images entirely — including their alt text and src URL. All documentation images are silently lost before Turndown can convert them to markdown ![alt](url) syntax. Additionally, any <a> wrapping an <img> becomes an empty anchor that the exclusiveFilter then removes too.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2232e68. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant