Summing matrix elements is >1000X slower than summing vector elements

performance

#1

I find that summing elements using a for loop is more than 1000 times slower for a matrix than for a vector. Consider the following two functions that sum the elements of a 100×100 matrix and the elements of a vector with same size:

julia> VERSION
v"0.6.0-pre.beta.112"

julia> function sumelem_mat(X)
           s = 0.0
           for j = 1:100, i = 1:100
               @inbounds s += X[i,j]
           end

           return s
       end
sumelem_mat (generic function with 1 method)

julia> function sumelem_vec(x)
           s = 0.0
           for n = 1:10000
               @inbounds s += x[n]
           end

           return s
       end
sumelem_vec (generic function with 1 method)

If these two functions are applied to a 100×100 matrix and a vector linearized from the matrix, they return the same value:

julia> X = rand(100,100); x = X[:];

julia> sumelem_mat(X)
5005.445543677584

julia> sumelem_vec(x)
5005.445543677584

However, the matrix version is more than 1000 times slower than the vector version:

julia> using BenchmarkTools

julia> @btime sumelem_mat($X)
  3.777 μs (0 allocations: 0 bytes)

julia> @btime sumelem_vec($x)
  3.046 ns (0 allocations: 0 bytes)

Note the unit difference in the above result: one is in microseconds and the other is in nanoseconds.

Here are my questions:

  1. What is the origin of such difference in performance between the two functions?

  2. Is there a way to make the matrix version as fast as the vector version?


#2

The vector timing is way too fast to be realistic. 3 nanoseconds is only enough time for ~10 clock cycles, so there’s no way it’s actually summing 10,000 elements in that time. Are you sure you’re testing with the right x? I can’t reproduce your result.

I see:

julia> @btime sumelem_mat($X)
  9.128 μs (0 allocations: 0 bytes)

julia> @btime sumelem_vec($x)
  8.584 μs (0 allocations: 0 bytes)

Note that the vector version is slightly faster. That’s because the matrix code is slightly sub-optimal. Since matrices in Julia are column-major, it will be faster to iterate along the first axis in the innermost loop: Edit: nope, I was the one who had it backwards.

Since remembering that can be annoying, Julia provides the eachindex() function, which will just “do the right thing” to give fast indexing on arrays of any dimension, including vectors. So you can replace both of your functions with this one:

function sum_generic(x::AbstractArray)
    s = zero(eltype(x))
    for I in eachindex(x)
        s += x[I]
    end
    s
end

which is fast for vectors and matrices:

julia> @btime sum_generic($x)
  8.581 μs (0 allocations: 0 bytes)
5069.695139306912

julia> @btime sum_generic($X)
  8.588 μs (0 allocations: 0 bytes)
5069.695139306912

#3

Actually, you got the i, j order right in your code, and mine was actually wrong. The difference in benchmark times seems to have been a flake. Sorry for the confusion! (and all the more reason to use eachindex and avoid the problem all together).


#4

Thanks for the reply! I am using Julia v0.6, and I suspect some compatibility issue between BenchmarkTools and Julia v0.6 might have produced the wrong benchmark results. When I ran the same benchmark in Julia v0.5, indeed the two functions ran at similar speeds.


#5

Perhaps not. In theory multi-dimensional access should have more overhead: multiple loops, multiple index bounds checking (not relevant here because of @inbounds) and conversion to linear index. Presumably this albeit small overhead increases with dimension.

Of course as you pointed out, using linear indexing where possible is the way to go.


#6

I can’t reproduce this - copying and pasted the code from your OP, and I get:

julia> VERSION
v"0.6.0-pre.beta.11"

julia> @btime sumelem_mat($X)
  9.724 μs (0 allocations: 0 bytes)
5003.372603528837

julia> @btime sumelem_vec($x)
  9.719 μs (0 allocations: 0 bytes)
5003.372603528837

Could you please try again on v0.6, just in case? If there is actually a BenchmarkTools problem on v0.6, it’d be great to know about it, since Julia’s performance testing CI uses BenchmarkTools.


#7

You’re right in general, but LLVM should certainly be able to remove the overhead in a simple case like this. Indeed, it seems that it does.


#8

Yes, this seems like a nice optimization.
However, I can’t see where it happens on v0.5.1 or early 0.6.0-alpha.
Should this optimization show up in @code_llvm or @code_native?

Also, I consistently get slightly slower benchmark times for sumelem_mat compared to sumelem_vec.

Here’s the output of @code_llvm and @code_native on 0.5.1 (similar for early 0.6.0-alpha):

Julia-0.5.1> @code_llvm sumelem_mat(X)

; Function Attrs: uwtable
define double @julia_sumelem_mat_72714(%jl_value_t*) #0 {
top:
  %1 = bitcast %jl_value_t* %0 to double**
  %2 = load double*, double** %1, align 8
  %3 = getelementptr inbounds %jl_value_t, %jl_value_t* %0, i64 3, i32 0
  %4 = bitcast %jl_value_t** %3 to i64*
  %5 = load i64, i64* %4, align 8
  br label %if

L.loopexit:                                       ; preds = %if6
  %6 = add nuw nsw i64 %"#temp#.010", 1
  %7 = icmp eq i64 %6, 101
  br i1 %7, label %L5, label %if

L5:                                               ; preds = %L.loopexit
  ret double %15

if:                                               ; preds = %top, %L.loopexit
  %s.011 = phi double [ 0.000000e+00, %top ], [ %15, %L.loopexit ]
  %"#temp#.010" = phi i64 [ 1, %top ], [ %6, %L.loopexit ]
  %8 = add nsw i64 %"#temp#.010", -1
  %9 = mul i64 %8, %5
  %10 = add i64 %9, -1
  br label %if6

if6:                                              ; preds = %if, %if6
  %s.19 = phi double [ %s.011, %if ], [ %15, %if6 ]
  %"#temp#1.08" = phi i64 [ 1, %if ], [ %11, %if6 ]
  %11 = add nuw nsw i64 %"#temp#1.08", 1
  %12 = add i64 %10, %"#temp#1.08"
  %13 = getelementptr double, double* %2, i64 %12
  %14 = load double, double* %13, align 8
  %15 = fadd double %s.19, %14
  %16 = icmp eq i64 %11, 101
  br i1 %16, label %L.loopexit, label %if6
}
Julia-0.5.1> @code_llvm sumelem_vec(x)

; Function Attrs: uwtable
define double @julia_sumelem_vec_72836(%jl_value_t*) #0 {
top:
  %1 = bitcast %jl_value_t* %0 to double**
  %2 = load double*, double** %1, align 8
  br label %if

L2:                                               ; preds = %if
  ret double %7

if:                                               ; preds = %top, %if
  %s.04 = phi double [ 0.000000e+00, %top ], [ %7, %if ]
  %"#temp#.03" = phi i64 [ 1, %top ], [ %3, %if ]
  %3 = add nuw nsw i64 %"#temp#.03", 1
  %4 = add nsw i64 %"#temp#.03", -1
  %5 = getelementptr double, double* %2, i64 %4
  %6 = load double, double* %5, align 8
  %7 = fadd double %s.04, %6
  %8 = icmp eq i64 %3, 10001
  br i1 %8, label %L2, label %if
}
Julia-0.5.1> @code_native sumelem_mat(X)
        .text
Filename: REPL[2]
        pushq   %rbp
        movq    %rsp, %rbp
        movq    (%rcx), %r8
        movq    24(%rcx), %r9
Source line: 3
        shlq    $3, %r9
        xorpd   %xmm0, %xmm0
        movl    $1, %edx
        nopl    (%rax,%rax)
L32:
        movq    %r8, %rax
        movl    $100, %ecx
        nopl    (%rax,%rax)
Source line: 4
L48:
        addsd   (%rax), %xmm0
Source line: 3
        addq    $8, %rax
        decq    %rcx
        jne     L48
        incq    %rdx
        addq    %r9, %r8
        cmpq    $101, %rdx
        jne     L32
Source line: 7
        popq    %rbp
        retq
        nopl    (%rax,%rax)
Julia-0.5.1> @code_native sumelem_vec(x)
        .text
Filename: REPL[3]
        pushq   %rbp
        movq    %rsp, %rbp
Source line: 4
        movq    uv_tcp_non_ifs_lsp_ipv4(%rcx), %rax
        xorpd   %xmm0, %xmm0
        xorl    %ecx, %ecx
        nopl    (%rax)
L16:
        addsd   (%rax,%rcx,8), %xmm0
Source line: 3
        incq    %rcx
        cmpq    $10000, %rcx            # imm = 0x2710
        jne     L16
Source line: 7
        popq    %rbp
        retq
        nopw    %cs:(%rax,%rax)

#9

I can reproduce the code_native and code_llvm, but not the slower benchmark times. For me they benchmark exactly equal. Probably the additional overhead in the native code is not executed on every loop iteration.

However, it seems that the addition of @simd significantly benefits the one-dimensional code, while not being very useful in the two-dimensional code. This makes about a 16x difference on my machine.