Creating a new value using `jl_new_struct_uninit`

In CxxWrap, C++ objects are wrapped in a struct like this:

struct CxxPtr{T} <: CxxBaseRef{T}
  cpp_object::Ptr{T}
end

Completely new objects are stored in a mutable version of the above struct, with exactly the same layout. On the C++ side, I use jl_new_struct to allocate the new struct, with as field argument the boxed pointer to store. However, this requires two allocations, so I tried to work around that using jl_new_struct_uninit, as seen here:

https://github.com/JuliaInterop/libcxxwrap-julia/pull/53/commits/5118cd9b07064b85a943f9835fceacb64ab714ce

Unfortunately, this appears to trigger a segfault. So my question is: what is the correct way to allocate a struct like the above one, using only a single allocation (if possible), for both the mutable and immutable version (immutable is for boxing when needed)?

1 Like

Do you have a MWE of the segfault? Many times I saw unexplainable segfault was because I was forgetting to protect a variable from the garbage collector.

2 Likes

No MWE unfortunately, I can’t even reproduce the problem anywhere except on Travis Linux (Mac is fine).

Checking for a GC problem is indeed always a good idea. Unfortunately running the tests without GC gives exactly the same error.

The log is here, but it doesn’t really say much: Travis CI - Test and Deploy Your Code with Confidence

Update: it seems that using memcpy to the directly dereferenced the pointer to the Julia object fixes it. Not sure what the difference is or if it is correct now, so having someone who is familiar with the Julia C code look at this would reassure me:

https://github.com/JuliaInterop/libcxxwrap-julia/pull/53/commits/733f403350a7c5ed477905e97328d41bfc4122d5

1 Like

Another update: I think this commit makes it more equivalent with the way jl_new_struct behaves: https://github.com/JuliaInterop/libcxxwrap-julia/pull/53/commits/40db46874de18e8f09603d81f8eb18a011051400

However, testing what avoiding the extra allocation actually brings in speed gain is making me question the usefulness of the change: the original version takes 0.0066 s to allocate 100000 C++ objects (2 Julia allocations per C++ object), while the new version takes … 0.0058 s, at indeed only one Julia allocation per C++ object.

@yuyichao Do you have any thoughts on this, please? Is the new version safe, or is tinkering at this low level simply not worth it?

I can only think of any issue here to be caused by aliasing and if that’s the case using memcpy should be enough.

I’m pretty sure travis support remote login that you can use to debug and download the binary from there.

1 Like

I think previous crashes were caused by some kind of alignment-related UB, which I don’t understand. In Julia they seem to cast to the integer types because of that, or at least the git comments led me to this:

https://github.com/JuliaLang/julia/issues/26512

What make you think it is related to alignment and no the julia commit has nothing to do with integer typeps.

The crash was fixed by using (on 64 bit)

uint64_t val = *(uint64_t*)(&cpp_ptr);
memcpy(dest, &val, 8);

instead of:

memcpy(dest, &cpp_ptr, sizeof(T*));

That change was inspired by the implementation of jl_store_unaligned_i64, which is called by set_nth_field in jl_new_struct, and which were modified in the commit that fixed the issue I cited.

What troubles me is that I don’t understand the difference between the crashing and the working implementation, so I have no idea why this fixes it (or even if it fixes it properly and always).

Again, the only think I can think of is that you are triggering some UB (or compiler bug) and the best way to debug it would still be having a debugger attached to it one way or another.

And FYI, your “fixed” version made MORE assumptions about both alignment and aliasing, which is exactly the opposite of the julia commit you linked.

Possibly, part of the problem is that I don’t understand what those assumptions are and why they are needed, though. But the same casting to uint happens in Julia, why is that done?

When I had similar crashes locally, valgrind and the debugger didn’t point to anything special, except that accessing the created Julia object results in a crash, and creating it using the higher level jl_new_struct fixed it (or at least the crash went away), but introduced an extra allocation for boxing the pointer.

Because it is making the assumption that the alignment and type are correct. Also, there’s no casting to uint anywhere, the cast is on the pointer. These are just sligntly more optimized case of the default branch. They are there for the sole purpose of giving the compiler some hint to produce better code for the most likely usecase. It will be completely correct and should not crash without any branchs other than the default one.

Well, I’m not sure what do you mean by “debugger didn’t point to anything special”. I assume the debugger will point to where it crashes which will tell you what instruction and point it is and should allow you to inspect the condition it happens, the normal thing you’ll use a debugger to do.

Thanks for the clarification, and indeed after switching to the default branch implementation and also exactly the same implementation that crashed before, it doesn’t crash anymore. Which is actually bad news, because it means there is probably some kind of UB elsewhere.

What I mean with “nothng special” is that when I did manage to reproduce the crash in a debugger, it always segfaulted on this seemingly innocent line, and valgrind didn’t show anything invalid up to that point:

https://github.com/JuliaInterop/libcxxwrap-julia/blob/fix-double-alloc/include/jlcxx/const_array.hpp#L94

Here, ptr is a Julia object created by the code we discussed.

And while writing this, it crashed again on this change, which should actually be no change: Comparing 035d575df6b2e30d03d1d8142222aef1e035b39f..a4dce9720f32ba9bb4d06efe0c54deeb65f0a6c2 · JuliaInterop/libcxxwrap-julia · GitHub

(log here)

The debugger has way more functions than just line numbers. As I said above, you should look at the instructions and variables.

OK, I have inspected the value returned by both implementations, and it is always the same, I haven’t been able to reproduce the crash on Travis either anymore. My best guess at this point is that our discussion properly aligned the planets so it works now :wink:

I’ve merged the change, since this part of the code was probably never the cause of the crashes in the first place. Thanks for the help!