fix: Generate llm docs off post-processed html by mike-plummer · Pull Request #6424 · cypress-io/cypress-documentation

mike-plummer · 2026-04-23T13:54:35Z

The previous approach for generating our LLM docs relied on processing the pre-built MDX files and stripping content out to arrive at simpler MD files. Since the majority of JSX in these files is for styling it wasn't a big deal just to drop it. Unfortunately, there are a couple pages (namely, the Plugins List) that have substantial content rendered from JSX, thus leaving us with no meaningful content in the final MD files.

This PR changes our MD processing to instead generate markdown off the final built HTML files, stripping down the HTML to just the content area. This unfortunately will make it more difficult to restructure/rearrange content in the MD files if we choose do so in the future, but for now gets us more complete content without substantial complexity.

Comparison:
Current production
This PR

Note

Medium Risk
Changes the LLM export pipeline to depend on built dist HTML structure and a new sanitization/HTML→Markdown conversion path, which could alter exported content and link/code formatting across many docs. Failures may surface at build/export time if expected HTML files or selectors differ.

Overview
Switches LLM doc generation from MDX AST normalization to post-build HTML extraction. MarkdownExporter now locates each doc’s generated index.html (respecting frontmatter slug), sanitizes/extracts the main content, and converts it to markdown via turndown (with a custom fenced-code-block rule).

Removes partials/MDX-specific handling. Deletes mdx-normalize, PartialsRegistry, related config (partialsMode, partials section defaults), and associated tests; adds html-normalize plus a comprehensive test suite for content extraction, filtering, and link rewriting.

Updates dependencies. Drops rehype-parse/rehype-remark from relevant packages and adds sanitize-html, turndown, turndown-plugin-gfm, and @types/turndown (plus lockfile updates).

^{Reviewed by Cursor Bugbot for commit ed03688. Bugbot is set up for automated code reviews on this repo. Configure here.}

cypress · 2026-04-23T14:12:36Z

cypress-documentation Run #1053

Run Properties: Passed #1053 • 2232e68107: Fix trailing slash handling

Project	`cypress-documentation`
Branch Review	`mikep/fix-markdown-processing`
Run status	`Passed #1053`
Run duration	`07m 36s`
Commit	`2232e68107: Fix trailing slash handling`
Committer	`Mike Plummer`
View all properties for this run ↗︎

Test results
Failures	`0`
Flaky	`0`
Pending	`0`
Skipped	`0`
Passing	`322`
View all changes introduced in this branch ↗︎

UI Coverage `47.51%`
Untested elements	`337`
Tested elements	`3`

Accessibility `96.88%`
Failed rules	`1 critical` `3 serious` `4 moderate` `1 minor`
Failed elements	`35`

cursor · 2026-04-23T14:24:26Z

+        const className = codeNode?.className || '';
+
+        const languageMatch = className.match(/language-(\w+)/);
+        const language = languageMatch ? languageMatch[1] : '';


Language regex truncates hyphenated Prism language names

Low Severity

The regex /language-(\w+)/ uses \w which matches [a-zA-Z0-9_] but not hyphens. Prism language identifiers like shell-session, css-extras, or c-like would be truncated to just the first segment (e.g., shell), producing incorrect language hints on fenced code blocks in the markdown output.

^{Reviewed by Cursor Bugbot for commit ed03688. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2232e68. Configure here.}

cursor · 2026-04-23T14:57:55Z

+      }
+    },
+  },
+}


Image tags silently stripped by default sanitize-html config

Medium Severity

The sanitizeOptions don't specify allowedTags, so sanitize-html uses its defaults which do not include img. Since img is a void element with no text content, disallowedTagsMode: 'discard' removes images entirely — including their alt text and src URL. All documentation images are silently lost before Turndown can convert them to markdown ![alt](url) syntax. Additionally, any <a> wrapping an <img> becomes an empty anchor that the exclusiveFilter then removes too.

^{Reviewed by Cursor Bugbot for commit 2232e68. Configure here.}

mike-plummer added 2 commits April 23, 2026 08:50

fix: Generate llm docs off post-processed html

6700e8e

remove unused dep

e4969db

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread plugins/llm/src/MarkdownExporter.ts Outdated

Comment thread plugins/llm/src/html-normalize.ts

Comment thread plugins/llm/src/html-normalize.ts Outdated

address cursor comments

ed03688

cursor Bot reviewed Apr 23, 2026

View reviewed changes

Fix trailing slash handling

2232e68

cursor Bot reviewed Apr 23, 2026

View reviewed changes

mike-plummer requested review from davidr-cy, emilyrohrbough and estrada9166 April 23, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Generate llm docs off post-processed html#6424

fix: Generate llm docs off post-processed html#6424
mike-plummer wants to merge 4 commits intomainfrom
mikep/fix-markdown-processing

mike-plummer commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cypress Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot Apr 23, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mike-plummer commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cypress Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cypress-documentation Run #1053

Uh oh!

Uh oh!

cursor Bot Apr 23, 2026

Choose a reason for hiding this comment

Language regex truncates hyphenated Prism language names

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 23, 2026

Choose a reason for hiding this comment

Image tags silently stripped by default sanitize-html config

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mike-plummer commented Apr 23, 2026 •

edited

Loading

cypress Bot commented Apr 23, 2026 •

edited

Loading