Can you add checkbounds=true to the ODEProblem? It might be indexing out of bounds in the multithreaded sharded form? That’s my guess.
That’s because of the SIMD you’d get from the Cartesian iteration. The MTK generated code will lose that and will be slower, changing to Array{Float64, 3} will not help that in the current code gen since it would be more about the SLP vectorization pass. For this case, it would be most optimal to build the sparse Jacobian code and add that to the original f, though I wouldn’t be surprised if the difference in the f’s don’t make a major difference in the final ODE time since it’s probably dominated by Jacobian time.
That makes sense. I’m mostly pushing you down this route to reduce the kinds of possible bugs to hone in on what’s actually going on. For your real case, that split approach is a good idea.
It sounds like the issue is the multithreaded sharded form probably “completes” the indexing, so if it splits to 4 parts with 4 threads on 15 ODEs, it indexes 1:16 with even parts in all 4, and errors out with a bounds error (because it should’ve cut the last one early). Right now that’s just a guess though given your last post.