So when I try to sum a 10000 x 100 matrix x on its second dimension, using sum(x, dims=2), it takes about 200 microseconds. If I instead do x * ones(100), this takes only 40 microseconds. What gives?

Here’s the full code

using BenchmarkTools
x = rand(10000, 100);
@benchmark sum($x, dims=2) # mean ≈ 200 μs
@benchmark $x * ones(100) # mean ≈ 40 μs

Would you mind elaborating? As someone who knows very little of lower level programming I’m eager to learn.

Would it even be possible to implement this faster ordering using for loops in Julia? A simple for loop implementation does a bit better than sum but not much.

function f(x)
m, n = size(x)
res = zeros(m)
@simd for j in 1:n
@simd for i in 1:m
@fastmath @inbound res[i] += x[i, j]
end
end
end
return res
end

I’m guessing this is because matrix operations are handled by BLAS, which uses multiple threads for parallel computations by default. Whereas sum is implemented in Julia, and runs on a single thread by default.

Watching CPU usage during the benchmarks seems to confirm this hypothesis. If you set the envvar OPENBLAS_NUM_THREADS=1 before running the benchmarks, the time difference becomes significantly smaller (albeit it’s still there).

Funny, I was just writing some code and considered whether I should use sum or the more math-looking operation and I chose sum. Looks like I chose slow!