fix: add UTF-16 encoding detection and conversion to prevent assertion failures#4347
fix: add UTF-16 encoding detection and conversion to prevent assertion failures#4347gaborbernat wants to merge 5 commits intouniversal-ctags:masterfrom
Conversation
…n failures Universal Ctags crashed with assertion failure in vStringPutImpl() when encountering files with UTF-16 encoding. The assertion `c >= 0 && c <= 0xff` failed because ctags expected all characters to fit within single byte range, but UTF-16 files contain multi-byte sequences that violate this assumption. This fix adds: - Detection of UTF-16 BOM (both LE and BE) in file reading - Automatic conversion from UTF-16 to UTF-8 using iconv when UTF-16 is detected - Force memory stream processing for UTF-16 files to enable conversion - Test cases for both UTF-16 LE and BE files Resolves issue universal-ctags#4342 Signed-off-by: Bernát Gábor <bgabor8@bloomberg.net>
55764aa to
c041872
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4347 +/- ##
==========================================
+ Coverage 85.87% 85.89% +0.01%
==========================================
Files 252 252
Lines 62597 62689 +92
==========================================
+ Hits 53755 53845 +90
- Misses 8842 8844 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Enables test execution for existing UTF-16 test files by adding the required args.ctags configuration file. This ensures the UTF-16 LE and UTF-16 BE files are processed during test runs, improving code coverage for the UTF-16 to UTF-8 conversion functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds test for UTF-16 conversion failure path using malformed UTF-16 data with invalid surrogate sequences. This triggers the iconv() failure path and tests the fallback mechanism that preserves original data when UTF-16 to UTF-8 conversion fails. This ensures 100% coverage of the UTF-16 conversion error handling code including the eFree(converted_data) cleanup logic. Signed-off-by: Bernát Gábor <bgabor8@bloomberg.net>
Adds specific test for UTF-16 Big Endian BOM detection (FE FF) to ensure complete coverage of line 899: (bom[0] == 0xFE && bom[1] == 0xFF). This test completes 100% coverage of all UTF-16 BOM detection paths including both LE (FF FE) and BE (FE FF) byte order markers. Signed-off-by: Bernát Gábor <bgabor8@bloomberg.net>
|
@masatake any updates on this? |
|
Sorry to be late to respond. I will work on this request next. |
|
Ideally you can just review and accept this PR. Anything wrong with the solution in it? 🤔 |
|
The change for getMioFull() is excellent. This change requires new section like: I need time for thinking about the new test cases. |
Added.
I think while slightly related to this topic is at the end orthogonal concern and should not block this PR, which can and should live on its own. |
Universal Ctags crashed with assertion failure in
vStringPutImpl()when encountering files with UTF-16 encoding. The assertionc >= 0 && c <= 0xfffailed because ctags expected all characters to fit within single byte range, but UTF-16 files contain multi-byte sequences that violate this assumption.This fix adds:
Resolves issue #4342
Signed-off-by: Bernát Gábor bgabor8@bloomberg.net