Puzzled by the difference in performance when summing an array in different ways

Hi everybody,

While trying to test my understanding of “views vs slice of arrays” and “row-major vs column major”, I time (using @time) the following functions :

A = rand(10000,10000);
function sum_1(x)
    counter = 0.0
    for i in axes(x,2)
        counter += sum(x[:,i])
    end
    return counter
end

function sum_2(x)
    counter = 0.0
    for i in axes(x,2)
        counter += sum(view(x,:,i))
    end
    return counter
end

function sum_3(x)
    counter = 0.0
    for i in axes(A,2)
        counter += sum(view(x,i,:))
    end
    return counter
end

function sum_4(x)
    counter = 0.0
    for I in CartesianIndices(x)
        counter += x[I]
    end
end

function sum_5(x)
    counter = 0.0
    for i in x
        counter += i
    end
    return counter
end

function sum_6(x)
    return sum(x)
end

I was surprised to consistently observe the following time and memory allocation:
0.485504 seconds (88.47 k allocations: 764.900 MiB, 18.79% gc time)
0.044661 seconds (78.47 k allocations: 1.655 MiB)
0.015908 seconds (68.98 k allocations: 1.510 MiB)
0.110523 seconds (4 allocations: 160 bytes)
0.117049 seconds (5 allocations: 176 bytes)
0.043822 seconds (5 allocations: 176 bytes)

This is not what I would have predicted. I would have predicted the column-major view to be faster than the row-major view, and in both cases I wouldn’t have predicted such high memory allocation.

Can anyone shed some light on the above?
Thank you!

There are a couple of issues with your benchmarks that might be affecting the results. First of all, try using BenchmarkTools.jl to get more accurate timing results (and automatically avoid accidentally measuring compilation time instead of runtime). For example:

julia> using BenchmarkTools

julia> @btime sum_1($A)  # check out the BenchmarkTools.jl manual for why the $ is needed here
  130.168 ms (20000 allocations: 763.70 MiB)
5.0001936938218646e7

Also, your sum_3 function looks like it accidentally references the global variable A in axes(A, 2) rather than the local variable x.

Do those changes affect your results?

Thank you for your swift response. It indeed made a difference, I am now getting:
538.940 ms (20000 allocations: 763.70 MiB)
50.070 ms (10000 allocations: 468.75 KiB)
1.092 s (10000 allocations: 468.75 KiB)
111.176 ms (0 allocations: 0 bytes)
109.877 ms (0 allocations: 0 bytes)
48.378 ms (0 allocations: 0 bytes)

Which makes a lot more sense!
Do you understand why sum_2 is competitive with sum_6 in terms of time despite the many memory allocations?

I’m not sure, but I would guess that it’s because each column of your matrix is quite large. There is some cost to creating a view (unless the compiler can decide not to actually allocate it), but you are only paying that cost once per 10,000 elements. I bet that if you tried the same benchmark with a 10x10000 matrix you would see sum_6 perform better relative to sum_2.

1 Like

You are right, with a 10x10000 matrix, sum_6 out performs sum_2. Thanks again!