generated from duckdb/extension-template
-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
I'm seeing an issue reading arrow output on windows - I think this is happening in shellfs; claude has some speculation about 0-byte output in the buffer triggering the termination code early - but I haven't confirmed that all, so just potentially wild LLM speculation.
Behavior:
read_arrow fails on "Invalid Input Error:
arrow_scan: get_next failed(): {"exception_type":"Serialization","exception_message":"not enough data in file to deserialize result"}"
when the total arrow size being output is > ~800 bytes.
Reproduction duckdb script:
./duckdb -f shellfs_bug_repro.sql`Content:
-- Install required extensions
INSTALL shellfs FROM community;
INSTALL nanoarrow FROM community;
LOAD shellfs;
LOAD nanoarrow;
.print '======================================================================'
.print 'shellfs Arrow IPC Pipe Bug Reproduction (Windows)'
.print '======================================================================'
.print ''
-- Test 1: Baseline - reading from file (should always work)
.print 'Test 1: Reading Arrow IPC from file (baseline)'
.print '----------------------------------------------------------------------'
.shell uv run arrow_gen.py > test.arrow
SELECT COUNT(*) as rows_from_file, 'OK' as status
FROM read_arrow('test.arrow');
.print ''
.print 'Test 2: Reading Arrow IPC from pipe via shellfs'
.print '----------------------------------------------------------------------'
.print 'On Windows, this FAILS with:'
.print ' Invalid Input Error: arrow_scan: get_next failed():'
.print ' {"exception_type":"Serialization",'
.print ' "exception_message":"not enough data in file to deserialize result"}'
.print ''
.print 'Attempting pipe read...'
.print ''
-- Test 2: Reading from pipe (should fail on Windows)
-- The trailing | tells shellfs to execute the command and read stdout
SELECT COUNT(*) as rows_from_pipe, 'OK' as status
FROM read_arrow('uv run arrow_gen.py |');
-- Cleanup
.shell rm test.arrow 2>/dev/null || del test.arrow 2>nulCalling this python script
#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.10"
# dependencies = ["pyarrow"]
# ///
import sys
import pyarrow as pa
# Generate table with enough data to trigger the bug (>800 bytes)
# Simple table with 100 rows is sufficient
data = {
"id": list(range(100)),
"value_a": [f"row_{i}_col_a" for i in range(100)],
"value_b": [f"row_{i}_col_b" for i in range(100)],
}
table = pa.table(data)
# Write Arrow IPC stream to stdout
sys.stdout.flush()
with pa.ipc.new_stream(sys.stdout.buffer, table.schema) as writer:
writer.write_table(table)
sys.stdout.buffer.flush()Metadata
Metadata
Assignees
Labels
No labels