# Why is sum slower than multiplying by vector of ones?

So when I try to sum a 10000 x 100 matrix `x` on its second dimension, using `sum(x, dims=2)`, it takes about 200 microseconds. If I instead do `x * ones(100)`, this takes only 40 microseconds. What gives?

Here’s the full code

``````using BenchmarkTools
x = rand(10000, 100);
@benchmark sum(\$x, dims=2) # mean ≈ 200 μs
@benchmark \$x * ones(100) # mean ≈ 40 μs
``````
3 Likes

it’s adding in an order that takes better advantage of the cache. we should probably make sum use this sort of ordering in base.

Would you mind elaborating? As someone who knows very little of lower level programming I’m eager to learn.

Would it even be possible to implement this faster ordering using for loops in Julia? A simple for loop implementation does a bit better than `sum` but not much.

``````function f(x)
m, n = size(x)
res = zeros(m)
@simd for j in 1:n
@simd for i in 1:m
@fastmath @inbound res[i] += x[i, j]
end
end
end
return res
end
``````

I’m guessing this is because matrix operations are handled by BLAS, which uses multiple threads for parallel computations by default. Whereas `sum` is implemented in Julia, and runs on a single thread by default.

Watching CPU usage during the benchmarks seems to confirm this hypothesis. If you set the envvar `OPENBLAS_NUM_THREADS=1` before running the benchmarks, the time difference becomes significantly smaller (albeit it’s still there).

2 Likes

Funny, I was just writing some code and considered whether I should use `sum` or the more math-looking operation and I chose `sum`. Looks like I chose slow!