fix: support non-UTF-8 encodings in eval data loading#4100
Open
CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
Open
fix: support non-UTF-8 encodings in eval data loading#4100CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Fixes microsoft#3670 pd.read_json defaulted to UTF-8 only. Files encoded with utf-8-sig (BOM) raised ValueError: Expected object or value. - Added _detect_encoding() BOM detection in load_data.py, _evaluate.py, _utils.py - Added fallback encoding chain: utf-8, utf-8-sig, latin-1, cp1252 - Improved error messages to show which encodings were attempted - Added test case and utf-8-sig encoded test data file
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3670
The eval SDK only read JSONL files as UTF-8. If your data had a BOM (utf-8-sig)
— common for multilingual content generated on Windows — it failed immediately
with
ValueError: Expected object or value. Not helpful.The fix adds BOM detection before reading and a fallback chain
(utf-8 → utf-8-sig → latin-1 → cp1252) so the loader handles real-world
files without requiring users to re-encode their data.
Three files touched:
promptflow/_utils/load_data.py—_pd_read_file()now detects encodingbefore calling
pd.read_json()on.jsonlfilesevaluate/_evaluate.py—_validate_and_load_data()gets the same treatmentevaluate/_utils.py—load_jsonl()updated with BOM detection + fallbackAdded a utf-8-sig encoded test file with multilingual content and a unit test
that would have caught this from the start.
Checklist