Passing a Julia variable to a callback

In HDF5.jl, I have a randomly failing test. Upstream HDF5 started testing HDF5.jl, and noticed that the test would also randomly fail, though at times it seems quite consistent.

Thr failing tests involve iteration where the HDF5 C library would call a user function provided as a C pointer. For example the H5Aiterate call iterates through attributes.

https://docs.hdfgroup.org/hdf5/v1_12/group___h5_a.html#ga9315a22b60468b6e996559b1b8a77251

To make this work in Julia for HDF5.jl we create a C function pointer to a helper function.

The helper function can use and return some user data through the final argument.

For this user data, we store a tuple consisting of a Julia function and a Ref{Any} to hold a reference to an errors caught when running the Julia user function.

The random nature of the bug makes me think the issue is related to garbage collection. I wonder if this tuple created in the initial call to the C function H5Aiterate and then passed back to the Julia helper function survives garbage collection back to the initial calling function.

The tests that are failing involve throwing an error in the Julia user function. This error should then be caught and returned to the function that initially called H5Aiterate. The test occasionally fails because the error is not returned, sometimes.

Is it possible that the tuple or its contents passed to the HDF5 C library and then passed back to a Julia function gets garbage collected before returning to the initial function?

I’m considering replacing the current scheme with one where I explicitly allocate and free the memory involved and perhaps use serialization for the Julia components involved.

Shouldn’t be — passing it as the Any argument in the initial h5a_iterate function call should root the object until the h5a_iterate returns.

1 Like

If this is not a Julia GC bug, perhaps there is a problem in the HDF5 C library?

h5a_iterate should throw an error if H5Aiterate2 returns a negative value.

Apparently, that is not happening on some occasions.

Does that mean the C library must be losing the return value somehow?

Thr failing test is here:

I think @assert is not a good way to throw an error for testing. However, at the moment this should not be causing an issue because there are no optimization levels which currently elide asserts.

Does anyone have any thoughts how to collect information on the bug should I find a way to occasionally reproduce it?

The random nature of the bug makes me think the issue is related to garbage collection.

Maybe, but it doesn’t have the GC smell to it. Normally GC bugs are a lot louder and involve segmentation faults.

I would recommend building an assert + debug build of Julia and HDF5 and on Linux run it with rr record -h until you caught the error in RR.

You could place a ccall to jl_breakpoint to have an easy way of finding when the function is called.

The question I would seek to answer is: Was the Julia function called in the first place.

Since all the manipulation of Julia objects are in Julia you shouldn’t encounter a missing write barrier.

Oh and lastly to see. If it is a GC error. Disable the GC during the test to see if it still reproduces on occasion.

4 Likes