Hello,
Introduction
I’ve been trying to process some data and I’ve encountered crippling allocation when using the built-in sum()
. I have some mid-level understanding of how to increase performance in Julia, but regarding the issue presented below I am not sure how I should search the web / analyze the code for missed potential. The usual @code_warntype
, @profview
and @profview_allocs
do not reveal anything to me, maybe they would to somebody more experienced.
Problem Statement
The very short version is that I need to compute a series of arrays A[i, t]
which I then use to produce a number of results. Among those results, one of them has the form B[t] = sum(A[i, t]*v[i] for i in 1:n)
.
Since A[i, t] = f(A[i, t-1])
, I’ve initially written a function to compute all of A
using @threads
to speed up the process. This method works well until you fill all available computer RAM. Thus, I’ve decided to write a version which computes A[i, t]
in chunks, which I proceed to use to compute B[t]
and then overwrite with the next chunk.
Here’s the problem, when I write the function along the lines presented in the minimal example below, dressing the sum()
in while
and a for
loop, the latter with and without @threads
, the memory allocations grow by about ~11x.
MWE
1. using sum()
1.1 Full Run
using BenchmarkTools
import Base.Threads: @threads
const n = 128^3
const t = 500
const v = rand(n)
function A_complete(n = n, t = t)
A = zeros(n, t)
@threads for i in 1:n
for τ in 2:t
A[i, τ] = A[i, τ - 1] + sin(i)
end
end
return A
end
function B_complete(n = n, t = t)
B = zeros(t)
A = A_complete(n, t)
@threads for τ in 2:t
B[τ] = sum(A[i, τ] * v[i] for i in 1:n)
end
return B
end
1.2 Benchmark: Full Run
Running this version with the @btime
macro shows allocations in line with the expectation set by the preallocation at the start of A_complete()
, A = zeros(n, t)
where 8 bytes per Float64
in an array of n
by t
means about 8nt/1024^3 = 7.8125 \mathrm{GiB}, in accordance with the macro output.
julia> @btime B_complete();
19.605 s (170 allocations: 7.81 GiB)
1.3 Run With Chunks
function A_chunk(n = n; chunk_size, A_previous)
A = zeros(n, chunk_size)
@threads for i in 1:n
A[i, 1] = A_previous[i] + sin(i)
for τ in 2:chunk_size
A[i, τ] = A[i, τ - 1] + sin(i)
end
end
return A
end
function B_chunk(n = n, t = t; chunk_size)
B = zeros(t)
t_lo, t_hi = 2, chunk_size + 1
A_previous = zeros(n)
A = zeros(n, chunk_size)
while t_hi < t
A = A_chunk(; chunk_size, A_previous)
A_previous = A[:, end]
@threads for τ in t_lo:t_hi
B[τ] = sum(A[i, τ - t_lo + 1] * v[i] for i in 1:n)
end
t_lo, t_hi = t_hi + 1, t_hi + chunk_size
end
A = A_chunk(; chunk_size = t - t_lo + 1, A_previous)
@threads for τ in t_lo:t
B[τ] = sum(A[i, τ - t_lo + 1] * v[i] for i in 1:n)
end
return B
end
1.4 Benchmark: Run With Chunks
The above implementation is very wasteful with about ~11x allocations…
julia> @btime B_chunk(; chunk_size=125)
56.042 s (5232139470 allocations: 87.78 GiB)
julia> @btime B_chunk(; chunk_size=248);
56.273 s (5232139292 allocations: 89.68 GiB)
2. Run With \cdot
2.1 Full Run
Notably this doesn’t use @threads
because \cdot
is already multithreaded and the macro just adds an overhead.
import LinearAlgebra: ⋅
function B_complete(n = n, t = t)
B = zeros(t)
A = A_complete(n, t)
for τ in 2:t
B[τ] = A[:, τ] ⋅ v
end
return B
end
2.2 Benchmark: Full Run
This approach runs just as fast, while using double the allocations.
@btime B_complete();
19.169 s (1585 allocations: 15.61 GiB)
2.3 Run With Chunks
function A_chunk(n = n; chunk_size, A_previous)
A = zeros(n, chunk_size)
@threads for i in 1:n
A[i, 1] = A_previous[i] + sin(i)
for τ in 2:chunk_size
A[i, τ] = A[i, τ - 1] + sin(i)
end
end
return A
end
function B_chunk(n = n, t = t; chunk_size)
B = zeros(t)
t_lo, t_hi = 2, chunk_size + 1
A_previous = zeros(n)
A = zeros(n, chunk_size)
while t_hi < t
A = A_chunk(; chunk_size, A_previous)
A_previous = A[:, end]
for τ in t_lo:t_hi
B[τ] = A[:, τ - t_lo + 1] ⋅ v
end
t_lo, t_hi = t_hi + 1, t_hi + chunk_size
end
A = A_chunk(; chunk_size = t - t_lo + 1, A_previous)
for τ in t_lo:t
B[τ] = A[:, τ - t_lo + 1] ⋅ v
end
return B
end
2.4 Benchmark: Run With Chunks
It’s can be the same, worse or better than the full run with \cdot
, depending on chunk_size
. Interestingly, this version of the function seems to strongly prefer small chunk_size = 14
, possibly related to the number of threads available, 16.
julia> @btime B_chunk(chunk_size = 14);
6.950 s (4670 allocations: 16.38 GiB)
julia> @btime B_chunk(chunk_size = 25);
11.118 s (3262 allocations: 16.30 GiB)
julia> @btime B_chunk(; chunk_size=125)
21.819 s (1854 allocations: 17.61 GiB)
Conclusions?
For now, I’ve refactored my code to use \cdot
. But I am still rather puzzled by why the version of B_chunk()
presented at 1.3 using sum() should be so slow, at least in relation to the same size B_complete()
. At a human-reading level it just feels like B_chunk()
performs the same steps as B_complete()
except every chunk_size
steps it renews the A
matrix.
Initially I thought the issue was my lack of experience with multi-threading, but I also ran a @btime
on a version of B_chunks
without @threads
, but the allocations remained the same.