Summing matrix elements is >1000X slower than summing vector elements

I can reproduce the code_native and code_llvm, but not the slower benchmark times. For me they benchmark exactly equal. Probably the additional overhead in the native code is not executed on every loop iteration.

However, it seems that the addition of @simd significantly benefits the one-dimensional code, while not being very useful in the two-dimensional code. This makes about a 16x difference on my machine.