It is important to understand what the bottleneck is, and solve that root cause of slowness. Let’s take the example of filling a matrix with some computed values:
julia> @noinline function fillmat_good!(M)
for j=1:size(M,2)
for i=1:size(M,1)
M[i,j] = i*j
end
end
end;
julia> @noinline function fillmat_bad!(M)
for i=1:size(M,1)
for j=1:size(M,2)
M[i,j] = i*j
end
end
end;
julia> @noinline alloc_mat(n)=Matrix{Int}(undef, n, n);
julia> @noinline function alloc_fill_good(n)
M=alloc_mat(n)
fillmat_good!(M)
M
end;
julia> @noinline function alloc_fill_bad(n)
M=alloc_mat(n)
fillmat_bad!(M)
M
end
In order to figure out what is how expensive, we test these things out separately. First, array allocation (involving gc pressure):
julia> using BenchmarkTools
julia> n=5000;
julia> @btime alloc_mat(n);
124.841 μs (2 allocations: 190.73 MiB)
That’s cheap. Next, filling the matrix:
julia> @btime fillmat_good!(M);
22.680 ms (0 allocations: 0 bytes)
julia> @btime fillmat_bad!(M);
227.534 ms (0 allocations: 0 bytes)
We observe that the right iteration order is 10x faster. If memory bandwidth was the only concern, we would expect 8x faster, because the bad iteration order wastes 7/8 integer slots in every memory transfer. 8x, 10x, close enough for me, compatible with memory bandwidth limit.
Now let’s observe in combination:
julia> @btime alloc_fill_good(n);
108.312 ms (2 allocations: 190.73 MiB)
julia> @btime alloc_fill_bad(n);
323.044 ms (2 allocations: 190.73 MiB)
Ok. The naive viewpoint (alloc_fill
costs alloc + fill) is obviously wrong. Something else is going on, and that something is the dominant cost when using the right iteration order.
Note that data cache (M
still hot from last iteration in benchmark loop) cannot be the thing, because M is too large for that to matter.
The essential and somewhat unobvious part is that we can get a speed-up of 5x by re-using the matrix, and this is entirely unrelated to garbage collection or allocator overhead (that we measured separately at 0.1 ms). I stated my theory about what the issue is (kernel overhead on too many pagefaults due to too small pagesize chosen by glibc; hot TLB in the benchmark loop is a second possibility), but the proper thing to do would be to perf
, strace
, and look at the memory map of the julia process.
The entire above post was triggered by me getting confused by seeing only a 3x speedup from the iteration order in alloc_fill
, which looked entirely wrong to me (because the mental model of this being bandwidth constrained predicts 8x speedup).