Passing a Julia variable to a callback

mkitti · June 29, 2024, 2:10am

In HDF5.jl, I have a randomly failing test. Upstream HDF5 started testing HDF5.jl, and noticed that the test would also randomly fail, though at times it seems quite consistent.

Thr failing tests involve iteration where the HDF5 C library would call a user function provided as a C pointer. For example the H5Aiterate call iterates through attributes.

https://docs.hdfgroup.org/hdf5/v1_12/group___h5_a.html#ga9315a22b60468b6e996559b1b8a77251

To make this work in Julia for HDF5.jl we create a C function pointer to a helper function.

github.com

JuliaIO/HDF5.jl/blob/master/src/api/helpers.jl#L124


      
          ```julia-repl
          julia> HDF5.API.h5a_iterate(obj, HDF5.API.H5_INDEX_NAME, HDF5.API.H5_ITER_INC) do loc, name, info
                     println(unsafe_string(name))
                     return false
                 end
          ```
          """
          function h5a_iterate(@nospecialize(f), obj_id, idx_type, order, idx=0)
              err_ref = Ref{Any}(nothing)
              idxref = Ref{hsize_t}(idx)
              fptr = @cfunction(h5a_iterate_helper, herr_t, (hid_t, Ptr{Cchar}, Ptr{H5A_info_t}, Any))
              try
                  h5a_iterate(obj_id, idx_type, order, idxref, fptr, (f, err_ref))
              catch h5err
                  jlerr = err_ref[]
                  if !isnothing(jlerr)
                      rethrow(jlerr)
                  end
                  rethrow(h5err)
              end
              return idxref[]

The helper function can use and return some user data through the final argument.

github.com

JuliaIO/HDF5.jl/blob/fb5290602b1564f9227275e775c8a9c958458bd7/src/api/helpers.jl#L86


      
              len = h5a_get_name_by_idx(loc_id, obj_name, idx_type, order, idx, C_NULL, 0, lapl_id)
              buf = StringVector(len)
              h5a_get_name_by_idx(loc_id, obj_name, idx_type, order, idx, buf, len + 1, lapl_id)
              return String(buf)
          end
          
          # libhdf5 supports proper closure environments, so we use that support rather than
          # emulating it with the less desirable form of creating closure handles directly in
          # `@cfunction` with `$f`.
          # This helper translates between the two preferred forms for each respective language.
          function h5a_iterate_helper(
              loc_id::hid_t, attr_name::Ptr{Cchar}, ainfo::Ptr{H5A_info_t}, @nospecialize(data::Any)
          )::herr_t
              f, err_ref = data
              try
                  return herr_t(f(loc_id, attr_name, ainfo))
              catch err
                  err_ref[] = err
                  return herr_t(-1)
              end
          end

For this user data, we store a tuple consisting of a Julia function and a Ref{Any} to hold a reference to an errors caught when running the Julia user function.

The random nature of the bug makes me think the issue is related to garbage collection. I wonder if this tuple created in the initial call to the C function H5Aiterate and then passed back to the Julia helper function survives garbage collection back to the initial calling function.

The tests that are failing involve throwing an error in the Julia user function. This error should then be caught and returned to the function that initially called H5Aiterate. The test occasionally fails because the error is not returned, sometimes.

Is it possible that the tuple or its contents passed to the HDF5 C library and then passed back to a Julia function gets garbage collected before returning to the initial function?

I’m considering replacing the current scheme with one where I explicitly allocate and free the memory involved and perhaps use serialization for the Julia components involved.

stevengj · June 29, 2024, 2:14am

Shouldn’t be — passing it as the Any argument in the initial h5a_iterate function call should root the object until the h5a_iterate returns.

mkitti · June 29, 2024, 2:55am

If this is not a Julia GC bug, perhaps there is a problem in the HDF5 C library?

h5a_iterate should throw an error if H5Aiterate2 returns a negative value.

github.com

JuliaIO/HDF5.jl/blob/master/src/api/functions.jl#L362


      
          
          See `libhdf5` documentation for [`H5Aiterate2`](https://docs.hdfgroup.org/hdf5/v1_14/group___h5_a.html#ga9315a22b60468b6e996559b1b8a77251).
          """
          function h5a_iterate(obj_id, idx_type, order, n, op, op_data)
              lock(liblock)
              var"#status#" = try
                      ccall((:H5Aiterate2, libhdf5), herr_t, (hid_t, Cint, Cint, Ptr{hsize_t}, Ptr{Cvoid}, Any), obj_id, idx_type, order, n, op, op_data)
                  finally
                      unlock(liblock)
                  end
              var"#status#" < herr_t(0) && @h5error(string("Error iterating attributes in object ", h5i_get_name(obj_id)))
              return nothing
          end
          
          """
              h5a_open(obj_id::hid_t, attr_name::Cstring, aapl_id::hid_t) -> hid_t
          
          See `libhdf5` documentation for [`H5Aopen`](https://docs.hdfgroup.org/hdf5/v1_14/group___h5_a.html#ga59863b205b6d93b2145f0fbca49656f7).
          """
          function h5a_open(obj_id, attr_name, aapl_id)
              lock(liblock)

Apparently, that is not happening on some occasions.

Does that mean the C library must be losing the return value somehow?

github.com

HDFGroup/hdf5/blob/f859cb732bd614a08189f3e133076a254035a667/src/H5A.c#L1928


      
          
              /* Set up VOL callback arguments */
              vol_cb_args.op_type               = H5VL_ATTR_ITER;
              vol_cb_args.args.iterate.idx_type = idx_type;
              vol_cb_args.args.iterate.order    = order;
              vol_cb_args.args.iterate.idx      = idx;
              vol_cb_args.args.iterate.op       = op;
              vol_cb_args.args.iterate.op_data  = op_data;
          
              /* Iterate over attributes */
              if ((ret_value = H5VL_attr_specific(vol_obj, &loc_params, &vol_cb_args, H5P_DATASET_XFER_DEFAULT,
                                                  H5_REQUEST_NULL)) < 0)
                  HERROR(H5E_ATTR, H5E_BADITER, "error iterating over attributes");
          
          done:
              FUNC_LEAVE_API(ret_value)
          } /* H5Aiterate2() */
          
          /*--------------------------------------------------------------------------
           NAME
              H5Aiterate_by_name

mkitti · June 29, 2024, 3:07am

Thr failing test is here:

github.com

JuliaIO/HDF5.jl/blob/master/test/api.jl#L42-L46


      
          @test_throws AssertionError HDF5.API.h5a_iterate(
              f, HDF5.API.H5_INDEX_NAME, HDF5.API.H5_ITER_INC
          ) do loc, name, info
              @assert false
          end

I think @assert is not a good way to throw an error for testing. However, at the moment this should not be causing an issue because there are no optimization levels which currently elide asserts.

Does anyone have any thoughts how to collect information on the bug should I find a way to occasionally reproduce it?

vchuravy · June 29, 2024, 1:28pm

The random nature of the bug makes me think the issue is related to garbage collection.

Maybe, but it doesn’t have the GC smell to it. Normally GC bugs are a lot louder and involve segmentation faults.

I would recommend building an assert + debug build of Julia and HDF5 and on Linux run it with rr record -h until you caught the error in RR.

You could place a ccall to jl_breakpoint to have an easy way of finding when the function is called.

The question I would seek to answer is: Was the Julia function called in the first place.

Since all the manipulation of Julia objects are in Julia you shouldn’t encounter a missing write barrier.

Oh and lastly to see. If it is a GC error. Disable the GC during the test to see if it still reproduces on occasion.

Topic		Replies	Views
Loading HDF5.jl causes previously working function to crash General Usage	0	195	August 15, 2022
Memory Errors at Julia-C interface and Safe Pointer Usage General Usage	13	319	January 10, 2025
Segfault when invoking Julia callback function from C General Usage question , ccall , debug , struct , c	10	726	April 20, 2022
Ccall works on global scope but not within function General Usage ccall , c	13	730	November 29, 2021
What can and cannot be in a C-callback? General Usage	13	403	March 18, 2025

Passing a Julia variable to a callback

Related topics