Improving Base.mapreduce by 20% on vectors

epilliat · December 10, 2025, 4:43pm

Hello everyone,

I’ve been exploring reduction operations on GPUs recently, and this led me to investigate CPU performance as well. I discovered that we can improve mapreduce in Base while maintaining full precision for floating-point operations.

My analysis suggests there may be an alignment issue with SIMD in the current Base implementation, which leads to suboptimal performance for common computational patterns. Additionally, I’ve implemented specialized functions for small arrays (size < 32) that provide significant speedups.

https ://epilliat.github.io/coding/notebooks/mapreduce_cpu_perf.html

(remove the space after https), I cannot put a link to the html file nor upload it.

I’ve documented the entire analysis and implementation in a Pluto notebook:

More details in this notebook will be added later.

The key results:

~20% performance improvement over Base for most array sizes
Maintained numerical precision for floating-point types
Better performance across multiple numeric types including Float32

I’d appreciate any feedback on the approach and results!

adienes · December 10, 2025, 4:52pm

looks like a fun experiment!

you may be interested in this thread: Performance challenge: can you write a faster sum?

and there are lots of benchmarks in the associated PR WIP: The great pairwise reduction refactor by mbauman · Pull Request #58418 · JuliaLang/julia · GitHub

the most challenging part here is getting (as) uniform (as possible) speedups across all array types, shapes, sizes, element types, computer architectures, etc.

for example, a change to mapreduce that makes it 20% faster on Array{Float64} might accidentally cause 10x regressions on a ReshapedArray{BigInt, 2, SubArray{...}} (not particularly that type, just made something up for dramatic effect)

epilliat · December 12, 2025, 1:32pm

Thank you very much for the references !

I’m getting into the post you mention. In any case, this is a very interesting problem I guess, both for optimizing Julia and for learning new stuff about Julia.

As you say, using @simd looks to be a performance killer for views and more complex types. However, I cannot see how to do better (at least on my computer) than @simd for small concrete types and coalesced memory – see my reply on the performance challenge post.

Topic		Replies	Views
How to match performance of sum(A, dims=1)? Performance	14	1169	September 21, 2022
Performance challenge: can you write a faster sum? Performance simd	32	2357	December 11, 2025
Why are vectorized operations faster in julia? Performance simd	3	6811	April 1, 2019
Underperformant mapreduce Performance	0	249	July 1, 2022
Parallel reductions Julia at Scale	23	5370	June 19, 2022

Improving Base.mapreduce by 20% on vectors

Related topics