Skip to content

fix: support non-UTF-8 encodings in eval data loading#4100

Open
CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
CodeForgeNet:fix/eval-utf8-encoding-support
Open

fix: support non-UTF-8 encodings in eval data loading#4100
CodeForgeNet wants to merge 1 commit intomicrosoft:mainfrom
CodeForgeNet:fix/eval-utf8-encoding-support

Conversation

@CodeForgeNet
Copy link

Fixes #3670

The eval SDK only read JSONL files as UTF-8. If your data had a BOM (utf-8-sig)
— common for multilingual content generated on Windows — it failed immediately
with ValueError: Expected object or value. Not helpful.

The fix adds BOM detection before reading and a fallback chain
(utf-8 → utf-8-sig → latin-1 → cp1252) so the loader handles real-world
files without requiring users to re-encode their data.

Three files touched:

  • promptflow/_utils/load_data.py_pd_read_file() now detects encoding
    before calling pd.read_json() on .jsonl files
  • evaluate/_evaluate.py_validate_and_load_data() gets the same treatment
  • evaluate/_utils.pyload_jsonl() updated with BOM detection + fallback

Added a utf-8-sig encoded test file with multilingual content and a unit test
that would have caught this from the start.


Checklist

  • No breaking changes
  • Read the contribution guidelines
  • New dependencies are MIT compatible
  • CHANGELOG updated
  • Test coverage included for the change

Fixes microsoft#3670

pd.read_json defaulted to UTF-8 only. Files encoded with utf-8-sig
(BOM) raised ValueError: Expected object or value.

- Added _detect_encoding() BOM detection in load_data.py, _evaluate.py, _utils.py
- Added fallback encoding chain: utf-8, utf-8-sig, latin-1, cp1252
- Improved error messages to show which encodings were attempted
- Added test case and utf-8-sig encoded test data file
@CodeForgeNet
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]prompt flow eval only supports UTF8 encoding

1 participant