How to match performance of sum(A, dims=1)?

I don’t think race conditions are an issue, since IIUC, we’re not accessing the same memory location from multiple tasks. What @simd does is reorder associative operations, so it may evaluate a + (b+c) instead of (a+b) + c. This means that if we use @simd in reductions, it will likely change the result due to floating-point rounding errors. We may check that mysum(arr) differs from vec(sum(arr, dims=1)), although they’re approximately equal.