Imroving Base.mapreduce by 20% on vectors

Hello everyone,

I’ve been exploring reduction operations on GPUs recently, and this led me to investigate CPU performance as well. I discovered that we can improve mapreduce in Base while maintaining full precision for floating-point operations.

My analysis suggests there may be an alignment issue with SIMD in the current Base implementation, which leads to suboptimal performance for common computational patterns. Additionally, I’ve implemented specialized functions for small arrays (size < 32) that provide significant speedups.

https ://epilliat.github.io/coding/notebooks/mapreduce_cpu_perf.html

(remove the space after https), I cannot put a link to the html file nor upload it.

I’ve documented the entire analysis and implementation in a Pluto notebook:

More details in this notebook will be added later.

The key results:

  • ~20% performance improvement over Base for most array sizes
  • Maintained numerical precision for floating-point types
  • Better performance across multiple numeric types including Float32

I’d appreciate any feedback on the approach and results!

3 Likes

looks like a fun experiment!

you may be interested in this thread: Performance challenge: can you write a faster sum?

and there are lots of benchmarks in the associated PR WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub

the most challenging part here is getting (as) uniform (as possible) speedups across all array types, shapes, sizes, element types, computer architectures, etc.

for example, a change to mapreduce that makes it 20% faster on Array{Float64} might accidentally cause 10x regressions on a ReshapedArray{BigInt, 2, SubArray{...}} (not particularly that type, just made something up for dramatic effect)

4 Likes