pre-encode non-ASCII bytes in sanitize_url before parsing#23053
Open
thijsoo wants to merge 3 commits into
Open
pre-encode non-ASCII bytes in sanitize_url before parsing#23053thijsoo wants to merge 3 commits into
thijsoo wants to merge 3 commits into
Conversation
wp_parse_url() (wrapping PHP's parse_url()) corrupts multibyte UTF-8 bytes in URL paths. Pre-encoding non-ASCII bytes (\x80-\xff) with rawurlencode() converts them to percent-encoded ASCII before parsing, fixing URLs with unencoded non-Latin characters like Farsi, Chinese, and Cyrillic scripts. Resolves #22903 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Coverage Report for CI Build 252920Coverage at 50.263% (no base build to compare)Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
…eo_utilssanitize_url-fails-on-path-with-unencoded-non-latin-characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #22903
Context
WPSEO_Utils::sanitize_url()callswp_parse_url(), which does not handle raw multibyte UTF-8 bytes in a URL path. URLs likehttps://example.com/中文路径were therefore stripped or blanked when users saved them in fields such as the canonical URL override or a social profile URL. The issue was reported in #22903.Summary
This PR can be summarized in the following changelog entry:
Relevant technical choices:
0x80–0xFFrange withrawurlencodebefore callingwp_parse_url(), so multibyte characters survive parsing and end up consistently percent-encoded in the returned URL.%da%af…) are ASCII (%, hex digits) and are left untouched, so already-encoded URLs round-trip unchanged. Covered by the existingwith_encoded_urltest and the newwith_mixed_encoded_and_unencoded_non_latin_urlcase.Test instructions
Test instructions for the acceptance test before the PR gets merged
This PR can be acceptance tested by following these steps:
Relevant test scenarios
The canonical URL field is the primary surface, but the same
WPSEO_Utils::sanitize_url()path runs for category/tag terms — please smoke-test a category term as well.Test instructions for QA when the code is in the RC
QA can test this PR by following these steps:
Impact check
This PR affects the following parts of the plugin, which may require extra testing:
WPSEO_Utils::sanitize_url(): canonical URL (post and term), OG/Twitter image URL overrides, the site-wide social profile URLs, the Elementor sidebar canonical, and the/yoast/v1/get_head?url=…REST endpoint. A regression check on plain ASCII URLs in these fields confirms nothing changed for the common case.Other environments
[shopify-seo], added test instructions for Shopify and attached theShopifylabel to this PR.[yoast-doc-extension], added test instructions for Yoast SEO for Google Docs and attached theGoogle Docs Add-onlabel to this PR.Documentation
Quality assurance
grunt build:imagesand committed the results, if my PR introduces or edits images or SVGs.Innovation
innovationlabel.Fixes #22903