I have a snippet of code that runs just fine in C, but when I try to make the same code work in Julia, I get memory errors: either Illegal instruction or Segmentation fault. It’s a part of a larger project, creating a Julia wrapper for an Apple Accelerate sparse matrix binary. I’ve created a minimum working example, but you’ll be unable to run it unless you’re on an Apple Silicon Mac.
mwe.jl (5.0 KB)
Here’s the equivalent C, which runs just fine. The site won’t let me upload it with a .c extension, so here it is with a .jl extension. Compile with clang -Wall -framework Accelerate -x c mwe-c.jl. mwe-c.jl (1.0 KB)
However, my question isn’t so much “what’s wrong here” as “how does one go about troubleshooting such things in a comprehensive manner?” I’ve tried several things (GC.@preserve pointed-to objects, read and re-read the docs, compare to the C the header, inner structs are isbits) but to no avail. Is there some way to compare the arguments received by the library from the pure C versus from the Julia @ccall? Part of it is also my lack of familiarity with Julia’s garbage collector and memory model. I don’t understand the relationship between the scope of an object vs the validity of its (possibly heap-allocated) data in Julia, which makes dealing with pointers feel mildly hazardous. And because I’m interfacing with a C library, I have no choice but to deal with pointers (at least a little bit).
Usually this just means that you’re misunderstanding how to translate types from one language to another. If it happens every time you do the ccall, it’s not about GC safety.
I would focus on a single call that is crashing. Delete all extraneous code except for this call, to get a minimal reproducible example. What is the type signature in C, and how are you calling it in Julia?
PS. You don’t generally need GC.@preserve unless you are manually constructing your own pointers to Julia objects. If you use ccall to do your conversions (correctly), all the GC safety is handled for you.
First of all, figure out which call fails. I suppose it’s one of the @ccalls. This can be figured out by a println before and after, or you may run julia under a low level debugger. I’d guess that one of the structs are bad, or should or should not be passed as pointer, or something like that. I’m not familiar with mac, so I don’t have anything to afford in terms of debuggers. There is perhaps a gdb there? Or something more visual. With a decent debugger you can examine the content of the arguments passed to whatever routine which fails, and compare the julia and the C-version.
I’d guess that one of the structs are bad, or should or should not be passed as pointer, or something like that.
Aha! Change mutable struct to struct and it’s fine. My educated guess: all objects passed by value to C need to be isbits. Otherwise, the C gets a Julia reference (basically a pointer) when it expects a value. EDIT: nope, not my only issue. Ran it a dozen times and I still got memory errors a couple times. See bottom for updated minimum working example.
You don’t generally need GC.@preserve unless you are manually constructing your own pointers to Julia objects.
Good to know. I guess I went a bit overboard with the GC.@preserves.
Pared-down minimum working example mwe.jl (3.7 KB)
I’ve discovered and fixed one issue: my strategy of using a Cshort to emulate a C struct of packed bitfields was messing with field alignment. Changing that to a Cuint makes it usually run successfully, but it still gives memory errors every now and then. Checking the offsets of the struct fields (as in the help entry for fieldoffset), they now all match between the C and the Julia, so that’s not the problem.
Useful things learned while debugging: ccall(:jl_, Cvoid, (Any,), x) is a nice way to get low-level info about x. Julia’s -g debug flag made the memory error more consistent. lldb is a neat tool, but not very useful in this case (no debug symbols in the compiled C library).
I’ve updated my minimum working example to incorporate the above comments and a few other things: eg to compare apples to apples, I ought to heap allocate the arrays in the C. You may need to run the Julia file a couple times in order to get the memory error (-g flag recommended).
If I may, what C struct definition are you trying to emulate? I’ve built FieldFlags.jl for that exact purpose (with some caveats due to the semantics of julia), which I’ve been using for data from registers on embedded devices. Maybe it’s useful for you here.
If you’re talking about SparseKind_t, that is indeed best matched by an Cuint, since the underlying data seems to be UInt32.
Here SparseKind_t and SparseTriangle_t are enums, declared to be of type unsigned int and unsigned char, with values 0-3 and 0-1, respectively. i.e. kind does actually fit in 2 bits, and triangle in 1. Can your FieldFlags.jl library handle this sort of nesting? I briefly tried to figure this out myself: writing x:1 in the body of an @bitfield struct is okay, but writing x::MyEnum:1 (trying to specify the type) I get errors.
I added a call that tries to symbolically factor the matrix, with an @assert afterward to make sure the symbolic factorization is valid (status >= 0). No change. I believe the memory error happens during the call to SparseFactor, so it doesn’t even reach the @assert.
Do you know if SparseFactor touches on the memory of one of its arguments?
The SparseMatrix_Double structs contain various pointers to memory managed by Julia’s GC, and if the C side re-allocs some then you will have troubles.
Your C program passes DenseVector_Double to SparseSolve , whereas your Julia version passes a DenseMatrix_Double . They aren’t the same, or?
Correct, they aren’t quite the same. (mwe.c started out as a copy-paste of an example from the documentation.) But I can easily fix that. mwe.c.jl (1.9 KB)
Do you know if SparseFactor touches on the memory of one of its arguments? The SparseMatrix_Double structs contain various pointers to memory managed by Julia’s GC, and if the C side re-allocs some then you will have troubles.
I’m not sure how SparseFactor operates on the memory. From the headers, I can see that the call eventually becomes _SparseFactorQR_Double(type, &Matrix, &options, &nfoptions) where the last 2 args are some default algorithm parameters, but that’s about it. If the SparseFactor call re-allocates some memory, wouldn’t that cause issues in the C as well? Or is it that there’s some incompatibility between Julia’s GC and C reallocations? The error thrown by the library typically mentions SparseWriteMatrix or _SparseSymbolicFactorQR, if that helps.
Aside: I purposefully made A in AX = B square, so that SparseSolve can overwrite B with X safely (no buffer overflow).
Edit: to test the hypothesis that Julia’s GC is messing with things, I switched to manipulating all the buffers in @ccalls: ie @ccall malloc(...), unsafe_copyto!, use, then finally @ccall free(...). Same result: works 80% of the time, gives memory errors 20% of the time.
It’s curious that you need Cuint for this, since this struct is nominally 16 bits long, and not 32 There’s also lots of potential issues regarding the order of the fields in that bitfield, since IIRC that is not specified by the standard and is rather implementation defined…
Ah, it’s nested… That complicates things a bit, since I’ve put off implementing that sort of assertion since I haven’t needed it yet The issue to track for that is Custom field type annotations · Issue #9 · Seelengrab/FieldFlags.jl · GitHub, which itself isn’t difficult to implement apart from the mutability caveat (I might be able to do this on the weekend, if I find a few hours ). Other than that though, the bitsizes should work out just fine, this is what it’s made for after all.
Problem resolved! The library was behaving as intended (after this fix): rather, my error catching was the problem. Apparently Apple’s implementation of the sparse QR factorization throws an error if the matrix is singular. (Why don’t they mention that in the documentation…) In the Julia, I use sprand to create a random 3x3 sparse matrix: the error occurs precisely when that matrix is singular. The default error handling depends on Objective C functionality (nil, os_error_log), so Julia reports illegal instruction/bad access/etc instead of the intended “Your matrix is singular” message.
In hindsight, I should’ve put the exact numerical example that caused the error into the C, instead of hard-coding a fixed arbitrary matrix structure.