Performance deviations when summing an integer matrix

goerch · October 23, 2021, 10:53am

The following program

using Tables
using CSV
using BenchmarkTools

function process1(a::Matrix{Int})
    count = 0
    for i in 1:size(a, 1)
        for j in 1:size(a, 2)
            count += a[i, j]
        end
    end
    count
end

function process2(a::Matrix{Int})
    count = 0
    for j in 1:size(a, 2)
        for i in 1:size(a, 1)
            count += a[i, j]
        end
    end
    count
end

function process3(a::Matrix{Int})
    count = 0
    for i in 1:size(a, 1)
        count += sum(a[i, :])
    end
    count
end

function process4(a::Matrix{Int})
    count = 0
    for j in 1:size(a, 2)
        count += sum(a[:, j])
    end
    count
end

function process5(a::Matrix{Int})
    count = sum(a[:, :])
end

function test(f)
    a = CSV.File("src/pandas_zeros.csv", type=Int) |> Tables.matrix
    f(convert(Matrix{Int}, a))
end

x = test(process1)
@assert test(process2) == x
@assert test(process3) == x
@assert test(process4) == x
@assert test(process5) == x

@btime test($process1)
@btime test($process2)
@btime test($process3)
@btime test($process4)
@btime test($process5)

reports on Julia 1.6.3

  32.214 ms (7698 allocations: 77.07 MiB)
  29.601 ms (7698 allocations: 77.07 MiB)
  40.952 ms (107698 allocations: 124.37 MiB)
  34.759 ms (7798 allocations: 115.22 MiB)
  39.082 ms (7700 allocations: 115.22 MiB)

Is this expected behaviour or suspicious (or am I doing it wrong)?

NB: I’m using pandas_zeros.csv from the CSV.jl test set.

stevengj · October 23, 2021, 12:55pm

It will be a lot easier to see what is going on if you benchmark the summation computation separately from importing the CSV file and converting it to a Tables.matrix.

julia> a = rand(Int,100,200);

julia> @btime process1($a);
  7.625 μs (0 allocations: 0 bytes)

julia> @btime process2($a);
  1.933 μs (0 allocations: 0 bytes)

process1 is slower because it has worse cache locality (worse spatial locality = less-consecutive access = poor cache-line utilization), because Matrix is column major (columns are contiguous).

julia> @btime process3($a);
  16.393 μs (100 allocations: 176.56 KiB)

julia> @btime process4($a);
  21.190 μs (200 allocations: 175.00 KiB)

julia> @btime process5($a);
  24.343 μs (2 allocations: 156.33 KiB)

These are much slower because a slice like sum(a[i, :]) creates a copy in Julia, so you are allocating little 1d arrays over and over. You can speed things up by using a view, e.g. by putting @views in front of your function declaration. If I do that for all three of these functions, then the allocations go away:

julia> @btime process3($a);
  8.856 μs (0 allocations: 0 bytes)

julia> @btime process4($a);
  3.696 μs (0 allocations: 0 bytes)

julia> @btime process5($a);
  2.217 μs (0 allocations: 0 bytes)

and process4 is faster than process3 because of the above-mentioned spatial locality, while process5 is fastest because it calls an optimized sum routine on the whole array at once.

(Note also that the ::Matrix{Int} argument-type declaration is irrelevant for performance. You would get exactly the same performance if the argument type were not declared at all.)

goerch · October 23, 2021, 1:08pm

Yikes.

This I expected.

That I should have imagined, sorry. And thanks very much for the help.

function process3(a::Matrix{Int})
    count = 0
    for i in 1:size(a, 1)
        count += sum(@view a[i, :])
    end
    count
end

function process4(a::Matrix{Int})
    count = 0
    for j in 1:size(a, 2)
        count += sum(@view a[:, j])
    end
    count
end

function process5(a::Matrix{Int})
    count = sum(@view a[:, :])
end

Seif_Shebl · October 23, 2021, 5:29pm

Here you only want sum(a), views are only required before arrays slices, using the whole array does not allocate here.

Topic		Replies	Views
Puzzled by the difference in performance when summing an array in different ways Performance array , memory-allocation	4	876	July 19, 2019
Reduce memory allocated in array view and in place sum Performance question	12	694	November 10, 2023
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1330	April 17, 2017
Faster sum loop when looping through vector multiple times than once Performance	5	291	September 29, 2023
Why is sum slower than multiplying by vector of ones? Performance	4	478	September 28, 2023

Performance deviations when summing an integer matrix

Related topics