Skip to content

Fix URL encoding for non-ASCII bytes#738

Open
fallintoplace wants to merge 1 commit into
apache:mainfrom
fallintoplace:fix/url-encoder-high-bit-bytes
Open

Fix URL encoding for non-ASCII bytes#738
fallintoplace wants to merge 1 commit into
apache:mainfrom
fallintoplace:fix/url-encoder-high-bit-bytes

Conversation

@fallintoplace

Copy link
Copy Markdown
Contributor

Summary

  • Treat URL encoder input bytes as unsigned before splitting them into hex nibbles.
  • Add UTF-8 regression coverage for café encoding to caf%C3%A9.

Why

On platforms where char is signed, bytes >= 0x80 can become negative before the hex lookup. That can lead to undefined behavior or incorrect percent encoding for non-ASCII strings.

Testing

  • cmake -S . -B build -G Ninja -DICEBERG_BUILD_BUNDLE=OFF -DICEBERG_BUILD_REST=OFF -DICEBERG_BUILD_HIVE=OFF -DICEBERG_BUILD_SQL_CATALOG=OFF
  • cmake --build build --target util_test
  • ctest --test-dir build -R util_test --output-on-failure

@zhjwpku zhjwpku left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, for non-ASCII input bytes, kHexChars[c >> 4] can lead to out-of-bounds access and undefined behavior. +1 for this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants