# Summing matrix elements is >1000X slower than summing vector elements

I find that summing elements using a for loop is more than 1000 times slower for a matrix than for a vector. Consider the following two functions that sum the elements of a 100×100 matrix and the elements of a vector with same size:

``````julia> VERSION
v"0.6.0-pre.beta.112"

julia> function sumelem_mat(X)
s = 0.0
for j = 1:100, i = 1:100
@inbounds s += X[i,j]
end

return s
end
sumelem_mat (generic function with 1 method)

julia> function sumelem_vec(x)
s = 0.0
for n = 1:10000
@inbounds s += x[n]
end

return s
end
sumelem_vec (generic function with 1 method)
``````

If these two functions are applied to a 100×100 matrix and a vector linearized from the matrix, they return the same value:

``````julia> X = rand(100,100); x = X[:];

julia> sumelem_mat(X)
5005.445543677584

julia> sumelem_vec(x)
5005.445543677584
``````

However, the matrix version is more than 1000 times slower than the vector version:

``````julia> using BenchmarkTools

julia> @btime sumelem_mat(\$X)
3.777 μs (0 allocations: 0 bytes)

julia> @btime sumelem_vec(\$x)
3.046 ns (0 allocations: 0 bytes)
``````

Note the unit difference in the above result: one is in microseconds and the other is in nanoseconds.

Here are my questions:

1. What is the origin of such difference in performance between the two functions?

2. Is there a way to make the matrix version as fast as the vector version?

The vector timing is way too fast to be realistic. 3 nanoseconds is only enough time for ~10 clock cycles, so there’s no way it’s actually summing 10,000 elements in that time. Are you sure you’re testing with the right `x`? I can’t reproduce your result.

I see:

``````julia> @btime sumelem_mat(\$X)
9.128 μs (0 allocations: 0 bytes)

julia> @btime sumelem_vec(\$x)
8.584 μs (0 allocations: 0 bytes)
``````

Note that the vector version is slightly faster. That’s because the matrix code is slightly sub-optimal. Since matrices in Julia are column-major, it will be faster to iterate along the first axis in the innermost loop: Edit: nope, I was the one who had it backwards.

Since remembering that can be annoying, Julia provides the `eachindex()` function, which will just “do the right thing” to give fast indexing on arrays of any dimension, including vectors. So you can replace both of your functions with this one:

``````function sum_generic(x::AbstractArray)
s = zero(eltype(x))
for I in eachindex(x)
s += x[I]
end
s
end
``````

which is fast for vectors and matrices:

``````julia> @btime sum_generic(\$x)
8.581 μs (0 allocations: 0 bytes)
5069.695139306912

julia> @btime sum_generic(\$X)
8.588 μs (0 allocations: 0 bytes)
5069.695139306912
``````
5 Likes

Actually, you got the `i, j` order right in your code, and mine was actually wrong. The difference in benchmark times seems to have been a flake. Sorry for the confusion! (and all the more reason to use `eachindex` and avoid the problem all together).

Thanks for the reply! I am using Julia v0.6, and I suspect some compatibility issue between BenchmarkTools and Julia v0.6 might have produced the wrong benchmark results. When I ran the same benchmark in Julia v0.5, indeed the two functions ran at similar speeds.

Perhaps not. In theory multi-dimensional access should have more overhead: multiple loops, multiple index bounds checking (not relevant here because of `@inbounds`) and conversion to linear index. Presumably this albeit small overhead increases with dimension.

Of course as you pointed out, using linear indexing where possible is the way to go.

I can’t reproduce this - copying and pasted the code from your OP, and I get:

``````julia> VERSION
v"0.6.0-pre.beta.11"

julia> @btime sumelem_mat(\$X)
9.724 μs (0 allocations: 0 bytes)
5003.372603528837

julia> @btime sumelem_vec(\$x)
9.719 μs (0 allocations: 0 bytes)
5003.372603528837
``````

Could you please try again on v0.6, just in case? If there is actually a BenchmarkTools problem on v0.6, it’d be great to know about it, since Julia’s performance testing CI uses BenchmarkTools.

You’re right in general, but LLVM should certainly be able to remove the overhead in a simple case like this. Indeed, it seems that it does.

Yes, this seems like a nice optimization.
However, I can’t see where it happens on v0.5.1 or early 0.6.0-alpha.
Should this optimization show up in `@code_llvm` or `@code_native`?

Also, I consistently get slightly slower benchmark times for `sumelem_mat` compared to `sumelem_vec`.

Here’s the output of @code_llvm and @code_native on 0.5.1 (similar for early 0.6.0-alpha):

``````Julia-0.5.1> @code_llvm sumelem_mat(X)

; Function Attrs: uwtable
define double @julia_sumelem_mat_72714(%jl_value_t*) #0 {
top:
%1 = bitcast %jl_value_t* %0 to double**
%2 = load double*, double** %1, align 8
%3 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 3, i32 0
%4 = bitcast %jl_value_t** %3 to i64*
%5 = load i64, i64* %4, align 8
br label %if

L.loopexit:                                       ; preds = %if6
%6 = add nuw nsw i64 %"#temp#.010", 1
%7 = icmp eq i64 %6, 101
br i1 %7, label %L5, label %if

L5:                                               ; preds = %L.loopexit
ret double %15

if:                                               ; preds = %top, %L.loopexit
%s.011 = phi double [ 0.000000e+00, %top ], [ %15, %L.loopexit ]
%"#temp#.010" = phi i64 [ 1, %top ], [ %6, %L.loopexit ]
%8 = add nsw i64 %"#temp#.010", -1
%9 = mul i64 %8, %5
%10 = add i64 %9, -1
br label %if6

if6:                                              ; preds = %if, %if6
%s.19 = phi double [ %s.011, %if ], [ %15, %if6 ]
%"#temp#1.08" = phi i64 [ 1, %if ], [ %11, %if6 ]
%11 = add nuw nsw i64 %"#temp#1.08", 1
%12 = add i64 %10, %"#temp#1.08"
%13 = getelementptr double, double* %2, i64 %12
%14 = load double, double* %13, align 8
%15 = fadd double %s.19, %14
%16 = icmp eq i64 %11, 101
br i1 %16, label %L.loopexit, label %if6
}
``````
``````Julia-0.5.1> @code_llvm sumelem_vec(x)

; Function Attrs: uwtable
define double @julia_sumelem_vec_72836(%jl_value_t*) #0 {
top:
%1 = bitcast %jl_value_t* %0 to double**
%2 = load double*, double** %1, align 8
br label %if

L2:                                               ; preds = %if
ret double %7

if:                                               ; preds = %top, %if
%s.04 = phi double [ 0.000000e+00, %top ], [ %7, %if ]
%"#temp#.03" = phi i64 [ 1, %top ], [ %3, %if ]
%3 = add nuw nsw i64 %"#temp#.03", 1
%4 = add nsw i64 %"#temp#.03", -1
%5 = getelementptr double, double* %2, i64 %4
%6 = load double, double* %5, align 8
%7 = fadd double %s.04, %6
%8 = icmp eq i64 %3, 10001
br i1 %8, label %L2, label %if
}
``````
``````Julia-0.5.1> @code_native sumelem_mat(X)
.text
Filename: REPL[2]
pushq   %rbp
movq    %rsp, %rbp
movq    (%rcx), %r8
movq    24(%rcx), %r9
Source line: 3
shlq    \$3, %r9
xorpd   %xmm0, %xmm0
movl    \$1, %edx
nopl    (%rax,%rax)
L32:
movq    %r8, %rax
movl    \$100, %ecx
nopl    (%rax,%rax)
Source line: 4
L48:
Source line: 3
decq    %rcx
jne     L48
incq    %rdx
cmpq    \$101, %rdx
jne     L32
Source line: 7
popq    %rbp
retq
nopl    (%rax,%rax)
``````
``````Julia-0.5.1> @code_native sumelem_vec(x)
.text
Filename: REPL[3]
pushq   %rbp
movq    %rsp, %rbp
Source line: 4
movq    uv_tcp_non_ifs_lsp_ipv4(%rcx), %rax
xorpd   %xmm0, %xmm0
xorl    %ecx, %ecx
nopl    (%rax)
L16:
I can reproduce the `code_native` and `code_llvm`, but not the slower benchmark times. For me they benchmark exactly equal. Probably the additional overhead in the native code is not executed on every loop iteration.
However, it seems that the addition of `@simd` significantly benefits the one-dimensional code, while not being very useful in the two-dimensional code. This makes about a 16x difference on my machine.