How to debug non-determinstic behavior involving ccall

I’m involved with a package that provides a Julia wrapper (SCIP.jl) to a C library (https://scip.zib.de/).
In a current pull request (#100), I found that our tests pass most of the time, but not always. This non-deterministic behavior is observed both locally, as well as on Travis.

See for example these two builds: #424.3 passed, #425.3 failed, both on Julia nightly, which correspond to the pr and pull runs of Travis, but use identical code, since this PR starts off master.

When I run the tests locally, I would guess that they fail in about 1 of 10 cases.

I have been able to narrow down the problem to a specific function of the libscip.so library that I call: SCIPexprCreate. One of the arguments is a Cdouble, which is stored in a struct by the C library. When I retrieve it right afterwards, I can (sometimes!) see that the value is different, and looks like unitialized memory. I have set up a MWE (first in C, which always works, and then a faithful(?) recreation in Julia which sometimes shows the failing behavior) at this gist.

At first, I thought that the problem might be caused by Julia GC, and me failing to protect some of the Julia objects that are passed into the ccalls. But I found that even if I put GC.enable(false) and GC.enable(true) around the call of the main(runs) function in my MWE script, the behavior is non-deterministic.

By the way, it does not matter whether a new Julia process is started several times on this script, or an existing session is reused to include that script several times.

So, finally, my question would be: How would I go about debugging this problem?

I already tried running valgrind using the recommended flags, but I don’t learn much from its output that is related to my own code.

Two more details that might be relevant:

  1. The C function that I call is actually defined to be variadic.
    Here is the signature from the header:
SCIP_RETCODE SCIPexprCreate(
   BMS_BLKMEM*           blkmem,             /**< block memory data structure */
   SCIP_EXPR**           expr,               /**< pointer to buffer for expression address */
   SCIP_EXPROP           op,                 /**< operand of expression */
   ...                                       /**< arguments of operand */
   );

And this is the version (special case) that I use in Julia:

function SCIPexprCreate(blkmem_, expr__, op, value)
    ccall((:SCIPexprCreate, libscip), SCIP_RETCODE,
          (Ptr{BMS_BLKMEM}, Ref{Ptr{SCIP_EXPR}}, SCIP_EXPROP, Cdouble),
          blkmem_, expr__, op, value)
end
# with these types:
const BMS_BLKMEM = Cvoid
const SCIP_EXPR = Cvoid
const SCIP_RETCODE = Cint # enum
const SCIP_EXPROP = Cint # enum
  1. SCIP (the C library) can be compiled in different memory modes (“block memory” or “standard”).

Block memory is the default, and more efficient, as it allows SCIP to reuse allocated memory internally.
Standard memory is simpler and uses system calls to allocate and free memory as needed (I believe).

I could only reproduce the failing memory with the Julia code and SCIP compiled in block-memory mode. But the (equivalent) C code works with either memory mode.

You need special syntax for ccalling C varargs functions. See:

So in this case,

function SCIPexprCreate(blkmem_, expr__, op, value)
    ccall((:SCIPexprCreate, libscip), SCIP_RETCODE,
          (Ptr{BMS_BLKMEM}, Ref{Ptr{SCIP_EXPR}}, SCIP_EXPROP, Cdouble...),
                                                              # Note ^^^
          blkmem_, expr__, op, value)
end
2 Likes

To add some commentary… the intermittent failure mode you’re seeing here is super confusing. I thought it would be something like the value=2.0 which you are passing being aliased on the stack by some stack variables internal to the SCIP block memory allocator (due to incorrect use of the ABI). Thus only being clobbered seemingly at random based on whether SCIP decided to allocate or not.

But after reading about the complexity of the x86_64 va_arg calling convention over at amd64 and va_arg - Made of Bugs I’m slightly horrified and a bit surprised this ever worked at all! I think what has happened is that the value in the rax register is left over from some work on the julia side and may or may not be zero. When calling a varargs function this is meant to be set to the number of floating point arguments, but won’t be reset at all if julia doesn’t know it’s calling a varargs function. In your particular case you have one argument, so the call will work when rax>=1 but fail when rax==0.

1 Like

Thanks for your reply, I will try that ASAP. I have read about that syntax, but thought that it would only be relevant when I wanted to actually pass multiple arguments (of the same type).

Because I actually need to call that function with different argument types, I’ve created a whole list of Julia wrappers.

Most of them take only one argument, or two of the same type, but unfortunately, there is one that takes a Ptr{CVoid} and a Cdouble as the two arguments in the variadic list. So I understand that I can not call this version from Julia through ccall?
Or should I work-around the same-type restriction by converting/reinterpreting the Cdouble as a pointer, as well?

Unfortunately it seems like this workaround violates the calling convention. I think the only surefire way of doing this correctly right now is to create a tiny C wrapper for this particular case.

However if you’re willing to live dangerously and the Cdouble is passed last in the argument list, you might get away with annotating only the Cdouble as Cdouble.... That could easily break on a new/different platform though even if it works on x86_64 and x86.

1 Like

That’s what I feared, too. I hope it will not be too difficult to do that in a portable way as part of the Julia build.jl.

Yeah, that way seems to work reliably, for now.

BinaryBuilder should make this fairly easy. It’s too bad that SCIP is distributed under the somewhat bizarre terms of the ZIB academic license, or you could just maintain a tiny patch which could be applied to the SCIP source before it was built with BinaryBuilder. (Actually the redistribution terms for SCIP look mostly quite reasonable, except for 3.c which is impractical for use with the julia Pkg system.)

Yes, we expect our users to already have SCIP installed. Now that pre-compiled binaries, and even Debian packages are available on the website, this is no longer that much of a hurdle.

So I think I will ship a small C file with the SCIP.jl package that will be compiled during the Pkg.build.

By the way, your “dangerous” trick of only treating the last argument as variadic has worked, so far.

If you’ve got mainly linux users that should work fine I guess.

That’s cool :slight_smile: I don’t know how dangerous it is. To know that you’d need to survey the detail of the calling conventions on supported platforms.

1 Like