Help speedup a tight triple loop

It’s worth nothing that that the @simd annotation on the innermost for loop doesn’t do anything because of the if branch.

Now, I know you said this function is being called in a multi-threaded loop, but have you considered using threads inside this function as well? If the outer multi-threaded loop is using Threads.@spawn (or anything derived from that) instead of Threads.@threads, then you shouldn’t get any destructive interference from the nested multi-threading and can see performance improvements if there’s any waiting happening in the outer threaded loop.