Why is sum slower than multiplying by vector of ones?

Jad_Zeitouni · September 28, 2023, 11:51am

So when I try to sum a 10000 x 100 matrix x on its second dimension, using sum(x, dims=2), it takes about 200 microseconds. If I instead do x * ones(100), this takes only 40 microseconds. What gives?

Here’s the full code

using BenchmarkTools
x = rand(10000, 100);
@benchmark sum($x, dims=2) # mean ≈ 200 μs
@benchmark $x * ones(100) # mean ≈ 40 μs

Oscar_Smith · September 28, 2023, 12:52pm

it’s adding in an order that takes better advantage of the cache. we should probably make sum use this sort of ordering in base.

Jad_Zeitouni · September 28, 2023, 1:09pm

Would you mind elaborating? As someone who knows very little of lower level programming I’m eager to learn.

Would it even be possible to implement this faster ordering using for loops in Julia? A simple for loop implementation does a bit better than sum but not much.

function f(x)
    m, n = size(x)
    res = zeros(m)
    @simd for j in 1:n
        @simd for i in 1:m
            @fastmath @inbound res[i] += x[i, j]
            end
        end
    end
    return res
end

HanD · September 28, 2023, 1:42pm

I’m guessing this is because matrix operations are handled by BLAS, which uses multiple threads for parallel computations by default. Whereas sum is implemented in Julia, and runs on a single thread by default.

Watching CPU usage during the benchmarks seems to confirm this hypothesis. If you set the envvar OPENBLAS_NUM_THREADS=1 before running the benchmarks, the time difference becomes significantly smaller (albeit it’s still there).

tbeason · September 28, 2023, 2:05pm

Funny, I was just writing some code and considered whether I should use sum or the more math-looking operation and I chose sum. Looks like I chose slow!

Topic		Replies	Views
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Performance challenge: can you write a faster sum? Performance simd	31	1522	July 9, 2025
Faster sum loop when looping through vector multiple times than once Performance	5	291	September 29, 2023
Matrix multiplication is slower when multithreading in Julia Performance question , multithreading , linearalgebra	13	4168	January 21, 2022
sum(Array{Bool,1}) vs sum(Array(Int8,1}) Performance	6	1878	December 8, 2019

Why is sum slower than multiplying by vector of ones?

Related topics