So when I try to sum a 10000 x 100 matrix
x on its second dimension, using
sum(x, dims=2), it takes about 200 microseconds. If I instead do
x * ones(100), this takes only 40 microseconds. What gives?
Here’s the full code
x = rand(10000, 100);
@benchmark sum($x, dims=2) # mean ≈ 200 μs
@benchmark $x * ones(100) # mean ≈ 40 μs
it’s adding in an order that takes better advantage of the cache. we should probably make sum use this sort of ordering in base.
Would you mind elaborating? As someone who knows very little of lower level programming I’m eager to learn.
Would it even be possible to implement this faster ordering using for loops in Julia? A simple for loop implementation does a bit better than
sum but not much.
m, n = size(x)
res = zeros(m)
@simd for j in 1:n
@simd for i in 1:m
@fastmath @inbound res[i] += x[i, j]
I’m guessing this is because matrix operations are handled by BLAS, which uses multiple threads for parallel computations by default. Whereas
sum is implemented in Julia, and runs on a single thread by default.
Watching CPU usage during the benchmarks seems to confirm this hypothesis. If you set the envvar
OPENBLAS_NUM_THREADS=1 before running the benchmarks, the time difference becomes significantly smaller (albeit it’s still there).
Funny, I was just writing some code and considered whether I should use
sum or the more math-looking operation and I chose
sum. Looks like I chose slow!