Don't understand what h5py people are thinking

I was using h5py to read in some files, casually.

h5f = h5py.File(..., "r")
h5f['kVals'].value

then I see this deprecation warning:

dataset.value has been deprecated. Use dataset[()] instead.

I guess if you look really close you would see what they mean, but the two syntaxes I immediately tried are h5f.dataset[('kVals')] and h5py.dataset(...). It turns out the correct one is:

h5f['kVals'][()]

Could not wrap my head around as to how that is supposed to be better than .value

Based on the string syntax, you appear to be writing Python code. Is there a question on the Julia side here?

1 Like

nope, the take away is I wish I’m writing julia.

I guess this too offtopic even for offtopic?

No, it’s ok, I was just seeing if there was a question to be answered :slight_smile:

3 Likes

I’m also baffled by this change.

I tracked down an issue about it: https://github.com/h5py/h5py/issues/209

It sounds like they wanted to make reading the whole dataset less convenient so new users would use slicing more often. Maybe someone can extract a lesson from this?

Every other email I see with new h5py users somebody’s recommending use of dataset.value, which is horrible because it dumps the entire dataset to an array. Then people complain that h5py is slow. We can’t get rid of this for backwards compatibility but I’m removing it from the documentation and having it raise a warning.

Perhaps I’m somewhat emotional over this because I see posts from people who stumble upon it and don’t realize that datasets support slicing operations. ‘dataset.value’ is exactly equivalent to ‘dataset[…]’; but people do things like ‘dataset.value[10:20]’ and don’t understand why it takes forever (or takes 8GB of memory and hangs Python).

read_direct is a little different because you supply an existing array which h5py “fills in” with the requested data; with both dataset.value and dataset[…], h5py creates a brand new array and returns it.

I resisted removing this for 2.0 because of backwards compatibility concerns.

See, that’s stupid, and probably one can blame python for not be able to provide ‘multi-dispatch’, because it looks like [10:20] would read the whole array and then slice, where a lazy_array would easily solve the issue, people have done this in python: https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays

Also, I feel like you can’t even easily slice the data-set because h5py.File isn’t a subscriptable object. Say I have a long list of x and a long list y, I can’t just tell h5py to extract the first 20 elements of each (which is a common use case), instead, I have to materialize two datasets (x and y) and do slicing.

I haven’t use Julia’s H5D packages, I hope they have a better way of doing this.

Julia’s HDF5.jl supports hyperslabbing for bitstypes (reading only parts of a hdf5 data set from disk). Example here.

(On the HDF5 side of things: https://support.hdfgroup.org/HDF5/Tutor/select.html)

1 Like

I think the opposite is true. According to the linked issue, accessing .value would read the whole file, which is why it was deprecated. Instead, you are supposed to use h5f['kVals'][0:20,0:20] and it will only access the first elements in x and y, which avoids the unnecessary reads and allocations.

That makes more sense. But now if h5f['xx'] only contains a single scaler, h5f['xx'][0:] will fail, and you have to use [()]

I think this is pretty consistent if you know

assert x.ndim == 3
x[0, 0, 0]  # => scalar
x[0, 0, :]  # => vector
x[0, :, :]  # => matrix

and

x[0, 0, 0]
x[0, 0]
x[0]

and

x[(0, 0, 0)]
x[(0, 0)]
x[(0)]

are all equivalent. Extending the last set of statements, it’s pretty natural that x[()] selects the whole 3d array. If Python had x[] like Julia I think they would’ve used it.

By the way, you don’t have the similar equivalence like x[0, :, :] == x[0] in Julia and you need to explicitly specify exact number of :s you need. Although this leads to more strict code (which I like), as in many API decisions, it comes with a trade off. It would be nice to have a syntax to avoid hard-coding repeated : (see https://github.com/JuliaLang/julia/issues/5405). Interface like Array(::HDF5Dataset) or collect(::HDF5Dataset) is more idiomatic in Julia for “materializing” dataset as an array. But I think we need a solution to #5405 for a notation as “handy” as x[()], e.g., x[...].

3 Likes