Skip to content

difference in check_dtype for vlen compared to h5py #228

@kmuehlbauer

Description

@kmuehlbauer

pyfive.check_dtype(vlen=var.dtype) and h5py.check_dtype(vlen=var.dtype) return different. This fails in downstream xarray when using engine="h5netcdf" with pyfive backend.

In xarray check_dtype is used to check for vlen strings and decodes it from object to U. See below.

MCVE, pyfive/h5netcdf/xarray latest versions, h5py=13.5.1, hdf5=1.14.6

import h5py
import pyfive
import xarray as xr
import os

input_string = ["foó", "bár", "baź"]
original = xr.Dataset({"x": input_string})

kwargs = dict(encoding={"x": {"dtype": str}})
fname = "test.nc"
original.to_netcdf(fname, engine="h5netcdf", **kwargs)

print("----- PYFIVE --------------------")
with pyfive.File("test.nc") as fh:
    var = fh["x"]
    print(pyfive.check_dtype(vlen=var.dtype))
    print(var.dtype.metadata)
    print(fh["x"][...])

print("\n----- H5PY --------------------")
with h5py.File("test.nc") as fh:
    var = fh["x"]
    print(h5py.check_dtype(vlen=var.dtype))
    print(var.dtype.metadata)
    print(fh["x"][...])


backend = "h5py"
os.environ["H5NETCDF_READ_BACKEND"] = backend
print(f"\n----- xarray - h5netcdf - {backend} --------------------")
with xr.open_dataset("test.nc", engine="h5netcdf") as ds:
    print(ds["x"])

backend = "pyfive"
os.environ["H5NETCDF_READ_BACKEND"] = backend
print(f"\n----- xarray - h5netcdf - {backend} --------------------")
with xr.open_dataset("test.nc", engine="h5netcdf") as ds:
    print(ds["x"])
----- PYFIVE --------------------
string_info(encoding='ascii', length=None)
{'vlen': <class 'str'>}
['foó' 'bár' 'baź']

----- H5PY --------------------
<class 'str'>
{'vlen': <class 'str'>}
[b'fo\xc3\xb3' b'b\xc3\xa1r' b'ba\xc5\xba']

----- xarray - h5netcdf - h5py --------------------
<xarray.DataArray 'x' (x: 3)> Size: 36B
array(['foó', 'bár', 'baź'], dtype='<U3')
Coordinates:
  * x        (x) <U3 36B 'foó' 'bár' 'baź'

----- xarray - h5netcdf - pyfive --------------------
<xarray.DataArray 'x' (x: 3)> Size: 24B
array(['foó', 'bár', 'baź'], dtype=object)
Coordinates:
  * x        (x) object 24B 'foó' 'bár' 'baź'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions