I’ve been exploring reduction operations on GPUs recently, and this led me to investigate CPU performance as well. I discovered that we can improve mapreduce in Base while maintaining full precision for floating-point operations.
My analysis suggests there may be an alignment issue with SIMD in the current Base implementation, which leads to suboptimal performance for common computational patterns. Additionally, I’ve implemented specialized functions for small arrays (size < 32) that provide significant speedups.
the most challenging part here is getting (as) uniform (as possible) speedups across all array types, shapes, sizes, element types, computer architectures, etc.
for example, a change to mapreduce that makes it 20% faster on Array{Float64} might accidentally cause 10x regressions on a ReshapedArray{BigInt, 2, SubArray{...}} (not particularly that type, just made something up for dramatic effect)
I’m getting into the post you mention. In any case, this is a very interesting problem I guess, both for optimizing Julia and for learning new stuff about Julia.
As you say, using @simd looks to be a performance killer for views and more complex types. However, I cannot see how to do better (at least on my computer) than @simd for small concrete types and coalesced memory – see my reply on the performance challenge post.