Performance deviations when summing an integer matrix

The following program

using Tables
using CSV
using BenchmarkTools

function process1(a::Matrix{Int})
    count = 0
    for i in 1:size(a, 1)
        for j in 1:size(a, 2)
            count += a[i, j]
        end
    end
    count
end

function process2(a::Matrix{Int})
    count = 0
    for j in 1:size(a, 2)
        for i in 1:size(a, 1)
            count += a[i, j]
        end
    end
    count
end

function process3(a::Matrix{Int})
    count = 0
    for i in 1:size(a, 1)
        count += sum(a[i, :])
    end
    count
end

function process4(a::Matrix{Int})
    count = 0
    for j in 1:size(a, 2)
        count += sum(a[:, j])
    end
    count
end

function process5(a::Matrix{Int})
    count = sum(a[:, :])
end

function test(f)
    a = CSV.File("src/pandas_zeros.csv", type=Int) |> Tables.matrix
    f(convert(Matrix{Int}, a))
end

x = test(process1)
@assert test(process2) == x
@assert test(process3) == x
@assert test(process4) == x
@assert test(process5) == x

@btime test($process1)
@btime test($process2)
@btime test($process3)
@btime test($process4)
@btime test($process5)

reports on Julia 1.6.3

  32.214 ms (7698 allocations: 77.07 MiB)
  29.601 ms (7698 allocations: 77.07 MiB)
  40.952 ms (107698 allocations: 124.37 MiB)
  34.759 ms (7798 allocations: 115.22 MiB)
  39.082 ms (7700 allocations: 115.22 MiB)

Is this expected behaviour or suspicious (or am I doing it wrong)?

NB: I’m using pandas_zeros.csv from the CSV.jl test set.

It will be a lot easier to see what is going on if you benchmark the summation computation separately from importing the CSV file and converting it to a Tables.matrix.

julia> a = rand(Int,100,200);

julia> @btime process1($a);
  7.625 μs (0 allocations: 0 bytes)

julia> @btime process2($a);
  1.933 μs (0 allocations: 0 bytes)

process1 is slower because it has worse cache locality (worse spatial locality = less-consecutive access = poor cache-line utilization), because Matrix is column major (columns are contiguous).

julia> @btime process3($a);
  16.393 μs (100 allocations: 176.56 KiB)

julia> @btime process4($a);
  21.190 μs (200 allocations: 175.00 KiB)

julia> @btime process5($a);
  24.343 μs (2 allocations: 156.33 KiB)

These are much slower because a slice like sum(a[i, :]) creates a copy in Julia, so you are allocating little 1d arrays over and over. You can speed things up by using a view, e.g. by putting @views in front of your function declaration. If I do that for all three of these functions, then the allocations go away:

julia> @btime process3($a);
  8.856 μs (0 allocations: 0 bytes)

julia> @btime process4($a);
  3.696 μs (0 allocations: 0 bytes)

julia> @btime process5($a);
  2.217 μs (0 allocations: 0 bytes)

and process4 is faster than process3 because of the above-mentioned spatial locality, while process5 is fastest because it calls an optimized sum routine on the whole array at once.

(Note also that the ::Matrix{Int} argument-type declaration is irrelevant for performance. You would get exactly the same performance if the argument type were not declared at all.)

4 Likes

Yikes.

This I expected.

That I should have imagined, sorry. And thanks very much for the help.

function process3(a::Matrix{Int})
    count = 0
    for i in 1:size(a, 1)
        count += sum(@view a[i, :])
    end
    count
end

function process4(a::Matrix{Int})
    count = 0
    for j in 1:size(a, 2)
        count += sum(@view a[:, j])
    end
    count
end

function process5(a::Matrix{Int})
    count = sum(@view a[:, :])
end

Here you only want sum(a), views are only required before arrays slices, using the whole array does not allocate here.

1 Like