Skip to content

Potential Buffer? Read Issue on Windows #8

@greenmtnboy

Description

@greenmtnboy

I'm seeing an issue reading arrow output on windows - I think this is happening in shellfs; claude has some speculation about 0-byte output in the buffer triggering the termination code early - but I haven't confirmed that all, so just potentially wild LLM speculation.

Behavior:

read_arrow fails on "Invalid Input Error:
arrow_scan: get_next failed(): {"exception_type":"Serialization","exception_message":"not enough data in file to deserialize result"}"

when the total arrow size being output is > ~800 bytes.

Reproduction duckdb script:

./duckdb -f shellfs_bug_repro.sql`

Content:

-- Install required extensions
INSTALL shellfs FROM community;
INSTALL nanoarrow FROM community;
LOAD shellfs;
LOAD nanoarrow;

.print '======================================================================'
.print 'shellfs Arrow IPC Pipe Bug Reproduction (Windows)'
.print '======================================================================'
.print ''

-- Test 1: Baseline - reading from file (should always work)
.print 'Test 1: Reading Arrow IPC from file (baseline)'
.print '----------------------------------------------------------------------'

.shell uv run arrow_gen.py > test.arrow

SELECT COUNT(*) as rows_from_file, 'OK' as status
FROM read_arrow('test.arrow');

.print ''
.print 'Test 2: Reading Arrow IPC from pipe via shellfs'
.print '----------------------------------------------------------------------'
.print 'On Windows, this FAILS with:'
.print '  Invalid Input Error: arrow_scan: get_next failed():'
.print '  {"exception_type":"Serialization",'
.print '   "exception_message":"not enough data in file to deserialize result"}'
.print ''
.print 'Attempting pipe read...'
.print ''

-- Test 2: Reading from pipe (should fail on Windows)
-- The trailing | tells shellfs to execute the command and read stdout
SELECT COUNT(*) as rows_from_pipe, 'OK' as status
FROM read_arrow('uv run arrow_gen.py |');


-- Cleanup
.shell rm test.arrow 2>/dev/null || del test.arrow 2>nul

Calling this python script

#!/usr/bin/env -S uv run
# /// script
# requires-python = ">=3.10"
# dependencies = ["pyarrow"]
# ///
import sys
import pyarrow as pa

# Generate table with enough data to trigger the bug (>800 bytes)
# Simple table with 100 rows is sufficient
data = {
    "id": list(range(100)),
    "value_a": [f"row_{i}_col_a" for i in range(100)],
    "value_b": [f"row_{i}_col_b" for i in range(100)],
}
table = pa.table(data)

# Write Arrow IPC stream to stdout
sys.stdout.flush()
with pa.ipc.new_stream(sys.stdout.buffer, table.schema) as writer:
    writer.write_table(table)
sys.stdout.buffer.flush()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions