Don't understand what h5py people are thinking

I was using h5py to read in some files, casually.

h5f = h5py.File(..., "r")
h5f['kVals'].value

then I see this deprecation warning:

dataset.value has been deprecated. Use dataset[()] instead.

I guess if you look really close you would see what they mean, but the two syntaxes I immediately tried are h5f.dataset[('kVals')] and h5py.dataset(...). It turns out the correct one is:

h5f['kVals'][()]

Could not wrap my head around as to how that is supposed to be better than .value

Based on the string syntax, you appear to be writing Python code. Is there a question on the Julia side here?

1 Like

nope, the take away is I wish Iā€™m writing julia.

I guess this too offtopic even for offtopic?

No, itā€™s ok, I was just seeing if there was a question to be answered :slight_smile:

3 Likes

Iā€™m also baffled by this change.

I tracked down an issue about it: Deprecate the damn .value property Ā· Issue #209 Ā· h5py/h5py Ā· GitHub

It sounds like they wanted to make reading the whole dataset less convenient so new users would use slicing more often. Maybe someone can extract a lesson from this?

Every other email I see with new h5py users somebodyā€™s recommending use of dataset.value, which is horrible because it dumps the entire dataset to an array. Then people complain that h5py is slow. We canā€™t get rid of this for backwards compatibility but Iā€™m removing it from the documentation and having it raise a warning.

Perhaps Iā€™m somewhat emotional over this because I see posts from people who stumble upon it and donā€™t realize that datasets support slicing operations. ā€˜dataset.valueā€™ is exactly equivalent to ā€˜dataset[ā€¦]ā€™; but people do things like ā€˜dataset.value[10:20]ā€™ and donā€™t understand why it takes forever (or takes 8GB of memory and hangs Python).

read_direct is a little different because you supply an existing array which h5py ā€œfills inā€ with the requested data; with both dataset.value and dataset[ā€¦], h5py creates a brand new array and returns it.

I resisted removing this for 2.0 because of backwards compatibility concerns.

See, thatā€™s stupid, and probably one can blame python for not be able to provide ā€˜multi-dispatchā€™, because it looks like [10:20] would read the whole array and then slice, where a lazy_array would easily solve the issue, people have done this in python: https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays

Also, I feel like you canā€™t even easily slice the data-set because h5py.File isnā€™t a subscriptable object. Say I have a long list of x and a long list y, I canā€™t just tell h5py to extract the first 20 elements of each (which is a common use case), instead, I have to materialize two datasets (x and y) and do slicing.

I havenā€™t use Juliaā€™s H5D packages, I hope they have a better way of doing this.

Juliaā€™s HDF5.jl supports hyperslabbing for bitstypes (reading only parts of a hdf5 data set from disk). Example here.

(On the HDF5 side of things: The HDF Group - Information, Support, and Software)

1 Like

I think the opposite is true. According to the linked issue, accessing .value would read the whole file, which is why it was deprecated. Instead, you are supposed to use h5f['kVals'][0:20,0:20] and it will only access the first elements in x and y, which avoids the unnecessary reads and allocations.

That makes more sense. But now if h5f['xx'] only contains a single scaler, h5f['xx'][0:] will fail, and you have to use [()]

I think this is pretty consistent if you know

assert x.ndim == 3
x[0, 0, 0]  # => scalar
x[0, 0, :]  # => vector
x[0, :, :]  # => matrix

and

x[0, 0, 0]
x[0, 0]
x[0]

and

x[(0, 0, 0)]
x[(0, 0)]
x[(0)]

are all equivalent. Extending the last set of statements, itā€™s pretty natural that x[()] selects the whole 3d array. If Python had x[] like Julia I think they wouldā€™ve used it.

By the way, you donā€™t have the similar equivalence like x[0, :, :] == x[0] in Julia and you need to explicitly specify exact number of :s you need. Although this leads to more strict code (which I like), as in many API decisions, it comes with a trade off. It would be nice to have a syntax to avoid hard-coding repeated : (see https://github.com/JuliaLang/julia/issues/5405). Interface like Array(::HDF5Dataset) or collect(::HDF5Dataset) is more idiomatic in Julia for ā€œmaterializingā€ dataset as an array. But I think we need a solution to #5405 for a notation as ā€œhandyā€ as x[()], e.g., x[...].

3 Likes