Don't understand what h5py people are thinking

jling · July 15, 2019, 1:40pm

I was using h5py to read in some files, casually.

h5f = h5py.File(..., "r")
h5f['kVals'].value

then I see this deprecation warning:

dataset.value has been deprecated. Use dataset[()] instead.

I guess if you look really close you would see what they mean, but the two syntaxes I immediately tried are h5f.dataset[('kVals')] and h5py.dataset(...). It turns out the correct one is:

h5f['kVals'][()]

Could not wrap my head around as to how that is supposed to be better than .value

StefanKarpinski · July 15, 2019, 1:48pm

Based on the string syntax, you appear to be writing Python code. Is there a question on the Julia side here?

jling · July 15, 2019, 1:49pm

nope, the take away is I wish I’m writing julia.

I guess this too offtopic even for offtopic?

StefanKarpinski · July 15, 2019, 1:50pm

No, it’s ok, I was just seeing if there was a question to be answered

ggggggggg · July 15, 2019, 3:26pm

I’m also baffled by this change.

I tracked down an issue about it: Deprecate the damn .value property · Issue #209 · h5py/h5py · GitHub

It sounds like they wanted to make reading the whole dataset less convenient so new users would use slicing more often. Maybe someone can extract a lesson from this?

Every other email I see with new h5py users somebody’s recommending use of dataset.value, which is horrible because it dumps the entire dataset to an array. Then people complain that h5py is slow. We can’t get rid of this for backwards compatibility but I’m removing it from the documentation and having it raise a warning.

Perhaps I’m somewhat emotional over this because I see posts from people who stumble upon it and don’t realize that datasets support slicing operations. ‘dataset.value’ is exactly equivalent to ‘dataset[…]’; but people do things like ‘dataset.value[10:20]’ and don’t understand why it takes forever (or takes 8GB of memory and hangs Python).

read_direct is a little different because you supply an existing array which h5py “fills in” with the requested data; with both dataset.value and dataset[…], h5py creates a brand new array and returns it.

I resisted removing this for 2.0 because of backwards compatibility concerns.

jling · July 15, 2019, 3:32pm

See, that’s stupid, and probably one can blame python for not be able to provide ‘multi-dispatch’, because it looks like [10:20] would read the whole array and then slice, where a lazy_array would easily solve the issue, people have done this in python: https://uproot.readthedocs.io/en/latest/opening-files.html#uproot-lazyarray-and-lazyarrays

Also, I feel like you can’t even easily slice the data-set because h5py.File isn’t a subscriptable object. Say I have a long list of x and a long list y, I can’t just tell h5py to extract the first 20 elements of each (which is a common use case), instead, I have to materialize two datasets (x and y) and do slicing.

I haven’t use Julia’s H5D packages, I hope they have a better way of doing this.

carstenbauer · July 15, 2019, 5:42pm

Julia’s HDF5.jl supports hyperslabbing for bitstypes (reading only parts of a hdf5 data set from disk). Example here.

(On the HDF5 side of things: The HDF Group - Information, Support, and Software)

fabiangans · July 16, 2019, 8:53am

I think the opposite is true. According to the linked issue, accessing .value would read the whole file, which is why it was deprecated. Instead, you are supposed to use h5f['kVals'][0:20,0:20] and it will only access the first elements in x and y, which avoids the unnecessary reads and allocations.

jling · July 16, 2019, 8:58am

That makes more sense. But now if h5f['xx'] only contains a single scaler, h5f['xx'][0:] will fail, and you have to use [()]

tkf · July 17, 2019, 12:28am

I think this is pretty consistent if you know

assert x.ndim == 3
x[0, 0, 0]  # => scalar
x[0, 0, :]  # => vector
x[0, :, :]  # => matrix

and

x[0, 0, 0]
x[0, 0]
x[0]

and

x[(0, 0, 0)]
x[(0, 0)]
x[(0)]

are all equivalent. Extending the last set of statements, it’s pretty natural that x[()] selects the whole 3d array. If Python had x[] like Julia I think they would’ve used it.

By the way, you don’t have the similar equivalence like x[0, :, :] == x[0] in Julia and you need to explicitly specify exact number of :s you need. Although this leads to more strict code (which I like), as in many API decisions, it comes with a trade off. It would be nice to have a syntax to avoid hard-coding repeated : (see https://github.com/JuliaLang/julia/issues/5405). Interface like Array(::HDF5Dataset) or collect(::HDF5Dataset) is more idiomatic in Julia for “materializing” dataset as an array. But I think we need a solution to #5405 for a notation as “handy” as x[()], e.g., x[...].

Topic		Replies	Views
Question on Loading HDF Data Saved in Python General Usage hdf5	19	615	May 28, 2024
Problems with h5py after update General Usage	5	990	March 27, 2017
Datatype when save matrices in HDF5 file General Usage hdf5	2	405	March 1, 2023
Reading and writing HDF5 compound-typed array datasets Data question , hdf5	3	1630	April 16, 2023
Dimension names in HDF5 General Usage hdf5	0	344	October 14, 2021

Don't understand what h5py people are thinking

Related topics