Skip to content

pre-encode non-ASCII bytes in sanitize_url before parsing#23053

Open
thijsoo wants to merge 3 commits into
trunkfrom
22903-wpseo_utilssanitize_url-fails-on-path-with-unencoded-non-latin-characters
Open

pre-encode non-ASCII bytes in sanitize_url before parsing#23053
thijsoo wants to merge 3 commits into
trunkfrom
22903-wpseo_utilssanitize_url-fails-on-path-with-unencoded-non-latin-characters

Conversation

@thijsoo
Copy link
Copy Markdown
Contributor

@thijsoo thijsoo commented Mar 6, 2026

Resolves #22903

Context

WPSEO_Utils::sanitize_url() calls wp_parse_url(), which does not handle raw multibyte UTF-8 bytes in a URL path. URLs like https://example.com/中文路径 were therefore stripped or blanked when users saved them in fields such as the canonical URL override or a social profile URL. The issue was reported in #22903.

Summary

This PR can be summarized in the following changelog entry:

  • Updates url sanitization to better work with UTF-8.

Relevant technical choices:

  • Pre-encode bytes in the 0x800xFF range with rawurlencode before calling wp_parse_url(), so multibyte characters survive parsing and end up consistently percent-encoded in the returned URL.
  • Bytes that are already part of a valid percent-encoding (e.g. %da%af…) are ASCII (%, hex digits) and are left untouched, so already-encoded URLs round-trip unchanged. Covered by the existing with_encoded_url test and the new with_mixed_encoded_and_unencoded_non_latin_url case.

Test instructions

Test instructions for the acceptance test before the PR gets merged

This PR can be acceptance tested by following these steps:

  • there should not be any impact of this change. But a good impact check is worth it.

Relevant test scenarios

  • Changes should be tested with the browser console open
  • Changes should be tested on different posts/pages/taxonomies/custom post types/custom taxonomies
  • Changes should be tested on different editors (Default Block/Gutenberg/Classic/Elementor/other)
  • Changes should be tested on different browsers
  • Changes should be tested on multisite

The canonical URL field is the primary surface, but the same WPSEO_Utils::sanitize_url() path runs for category/tag terms — please smoke-test a category term as well.

Test instructions for QA when the code is in the RC

  • QA should use the same steps as above.

QA can test this PR by following these steps:

Impact check

This PR affects the following parts of the plugin, which may require extra testing:

  • Anything that runs URL input through WPSEO_Utils::sanitize_url(): canonical URL (post and term), OG/Twitter image URL overrides, the site-wide social profile URLs, the Elementor sidebar canonical, and the /yoast/v1/get_head?url=… REST endpoint. A regression check on plain ASCII URLs in these fields confirms nothing changed for the common case.

Other environments

  • This PR also affects Shopify. I have added a changelog entry starting with [shopify-seo], added test instructions for Shopify and attached the Shopify label to this PR.
  • This PR also affects Yoast SEO for Google Docs. I have added a changelog entry starting with [yoast-doc-extension], added test instructions for Yoast SEO for Google Docs and attached the Google Docs Add-on label to this PR.

Documentation

  • I have written documentation for this change. For example, comments in the Relevant technical choices, comments in the code, documentation on Confluence / shared Google Drive / Yoast developer portal, or other.

Quality assurance

  • I have tested this code to the best of my abilities.
  • During testing, I had activated all plugins that Yoast SEO provides integrations for.
  • I have added unit tests to verify the code works as intended.
  • If any part of the code is behind a feature flag, my test instructions also cover cases where the feature flag is switched off.
  • I have written this PR in accordance with my team's definition of done.
  • I have checked that the base branch is correctly set.
  • I have run grunt build:images and committed the results, if my PR introduces or edits images or SVGs.

Innovation

  • No innovation project is applicable for this PR.
  • This PR falls under an innovation project. I have attached the innovation label.
  • I have added my hours to the WBSO document.

Fixes #22903

wp_parse_url() (wrapping PHP's parse_url()) corrupts multibyte UTF-8
bytes in URL paths. Pre-encoding non-ASCII bytes (\x80-\xff) with
rawurlencode() converts them to percent-encoded ASCII before parsing,
fixing URLs with unencoded non-Latin characters like Farsi, Chinese,
and Cyrillic scripts.

Resolves #22903

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@thijsoo thijsoo added the changelog: non-user-facing Needs to be included in the 'Non-userfacing' category in the changelog label Mar 6, 2026
@coveralls
Copy link
Copy Markdown

coveralls commented Mar 6, 2026

Coverage Report for CI Build 252920

Coverage at 50.263% (no base build to compare)

Details

  • Coverage remained the same as the base build.
  • Patch coverage: 7 of 7 lines across 1 file are fully covered (100%).
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 41239
Covered Lines: 20728
Line Coverage: 50.26%
Coverage Strength: 4.02 hits per line

💛 - Coveralls

…eo_utilssanitize_url-fails-on-path-with-unencoded-non-latin-characters
@thijsoo thijsoo marked this pull request as ready for review May 19, 2026 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog: non-user-facing Needs to be included in the 'Non-userfacing' category in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WPSEO_Utils::sanitize_url fails on path with unencoded non-latin characters

2 participants