Fix/result embedding data encoding#243
Conversation
104e4cd to
d2019ce
Compare
Adds preserve_bytes and binary_fields parameters to search methods to prevent UTF-8 decoding from corrupting VECTOR field embeddings and other binary data. The Result class was inappropriately applying UTF-8 decoding to all field values, including binary vector embeddings. This corrupted FLOAT32 vector data and made valkey-py unsuitable for vector search applications. Changes: - Add preserve_bytes parameter to search() methods (default: False for backward compatibility) - Add binary_fields parameter for selective field preservation - Implement to_string_or_bytes() utility for conditional binary preservation - Update Result class to handle binary preservation during field processing - Add comprehensive tests for binary preservation functionality The fix maintains full backward compatibility while enabling proper vector search support when preserve_bytes=True is specified. Fixes vector search corruption where binary embeddings were being decoded as UTF-8 strings with 'ignore' error handling, silently dropping bytes and corrupting the vector data. Signed-off-by: Swarnaprakash Udayakumar <swarnap@amazon.com>
d2019ce to
7abf72f
Compare
|
@amirreza8002 is this by any chance something you got to work with? |
|
Also keep in mind approving this PR is slightly more problematic due to the fact valkey-search is not yet supported in the CI and the pytest skips that. I'll try to see if it can be easily fixed. |
@mkmkme I can make an attempt to fix the tests. First I want to clarify the expectation. Is valkey-py expected to be backward compatible with redis? specifically w.r.t search module is the expectation still backward compatibility with Redis search. I understand redis supports different data types (TEXT instead of TAG for example). Wondering if I can just delete/modify these or I need to keep them and make changes such that both redis search and valkey-search are supported |
|
sorry not in my area |
Pull Request check-list
Description of change
See #242
The
Resultclass invalkey/commands/search/result.pyinappropriately applies UTF-8 decoding to all field values, including binary vector data. This corrupts VECTOR field embeddings and makes valkey-py unsuitable for vector search applications.The change Adds preserve_bytes and binary_fields parameters to search methods to prevent UTF-8 decoding from corrupting VECTOR field embeddings and other binary data.
The Result class was inappropriately applying UTF-8 decoding to all field values, including binary vector embeddings. This corrupted FLOAT32 vector data and made valkey-py unsuitable for vector search applications.