So I don’t think that without something runnable, people here are able to help you (well except pointing at random things and making general statements). For starters it does not need to be MWE just easy to setup and run in some form.
Concerning multithreading: I think your workloads are quite short right now, so the overhead of threading (spawning a Task and scheduling it) might outweigh the speedup. I found this old thread where the overhead of Threads.@threads
is O(10µs). To get good speedups in short sections there are different libraries such as Polyester.jl. But I personally think we should try to optimize the single threaded case first to get back at the speed of your C++ implementation.
Of course finding the segfault would be very interesting! A quick and easy check would be removing all @inbounds
from your code base. If the segfault still occurs, then it really might be a bug in Julia which will be harder to isolate.