In HDF5.jl, I have a randomly failing test. Upstream HDF5 started testing HDF5.jl, and noticed that the test would also randomly fail, though at times it seems quite consistent.
Thr failing tests involve iteration where the HDF5 C library would call a user function provided as a C pointer. For example the H5Aiterate call iterates through attributes.
To make this work in Julia for HDF5.jl we create a C function pointer to a helper function.
The helper function can use and return some user data through the final argument.
For this user data, we store a tuple consisting of a Julia function and a Ref{Any} to hold a reference to an errors caught when running the Julia user function.
The random nature of the bug makes me think the issue is related to garbage collection. I wonder if this tuple created in the initial call to the C function H5Aiterate and then passed back to the Julia helper function survives garbage collection back to the initial calling function.
The tests that are failing involve throwing an error in the Julia user function. This error should then be caught and returned to the function that initially called H5Aiterate. The test occasionally fails because the error is not returned, sometimes.
Is it possible that the tuple or its contents passed to the HDF5 C library and then passed back to a Julia function gets garbage collected before returning to the initial function?
I’m considering replacing the current scheme with one where I explicitly allocate and free the memory involved and perhaps use serialization for the Julia components involved.
I think @assert is not a good way to throw an error for testing. However, at the moment this should not be causing an issue because there are no optimization levels which currently elide asserts.
Does anyone have any thoughts how to collect information on the bug should I find a way to occasionally reproduce it?