I can reproduce the code_native
and code_llvm
, but not the slower benchmark times. For me they benchmark exactly equal. Probably the additional overhead in the native code is not executed on every loop iteration.
However, it seems that the addition of @simd
significantly benefits the one-dimensional code, while not being very useful in the two-dimensional code. This makes about a 16x difference on my machine.