A simple example is shown in the end of the post. It can be seen that for the smaller vectors, the multithreading in @turbo
makes no difference. For the larger vectors, increasing nthreads will enhance the performance, but 4 and 8 only differ slightly. (I’m curious about the 5* performance difference between 1 and 2, but that’s another story)
That is to say, the multithreading permance boost depends highly on the size of the vector, which is quite natural. The problem is, considering this, how we should finely tune the performance of a piece of code?
My actual user case involves the handling of a large N*M
matrix. The algorithm is roughly like the following:
for i_m = 1 : M
v = view(mat, :, i_m)
@turbo for i_n = 1 : N
v[i_n] = aux1[i_n] * aux2[i_n] # the actual code is more complex
end
end
That is to say, the treatments of the M dimension is completely independent.
The matrix may be so large that theh memory becomes a great problem. Thus, users are allowed to input the maximum memory, and the matrix will be split into different n*M
blocks (only in the N dimension) with each block smaller than the memory required. All following quantities may differ from run to run:
- N and M values
- the maximum memory
- number of threads used for calculation
Combining these arguments, I double the best performance can only be achieved by finely tune the multithreading between M dimension and @turbo
, which is roughly like the following:
chunks = get_start_end_M_for_threads(M, nt_M)
tasks = map(chunks) do chunk
Threads.@spawn my_f_thread(chunk[1], chunk[2], nn, block_mat, nt_N)
end
_ = fetch.(tasks)
# no need to reduce here
function my_f_thread(ms, me, nn, block_mat, nt_N)
for i_m = ms : me
v = view(block_mat, :, i_m)
@turbo threads=nt_N for i_n = 1 : nn
v[i_n] = aux1[i_n] * aux2[i_n]
end
end
end
However, I find difficulties when actually implementing it:
- How
@turbo
multithreading behaviours differ among different hardwares? That is to say, is it even possible to find a good heuristic way to determine the bestnt_M, nt_N
? - As a macro, the
@turbo threads=nt_N
will raise error for a variablent_N
, which must be determined in the run time. Is it even possible to implementmy_f_thread
withouteval
, which I highly doubt will raise even greater performance problems?
I cannot combine N and M into one dimension, because some auxiliary vectors only has the N dimension.
I cannot block the large matrix in the M dimension, because the aforementiond function is only a part of a large processing pipeline, and the later stages require the full M dimension. Only the N dimension can be blocked.
julia> using MKL, LoopVectorization, BenchmarkTools, Random, LinearAlgebra
julia> n1 = 2 ^ 14; n2 = 2 ^ 17;
julia> a1 = zeros(n1); b1 = zeros(n1); rand!(a1); rand!(b1);
julia> a2 = zeros(n2); b2 = zeros(n2); rand!(a2); rand!(b2);
julia> function my_dot_1(x, y, n)
s = zero(eltype(x))
@turbo for i = 1 : n
s += x[i] * y[i]
end
return s
end
my_dot_1 (generic function with 1 method)
julia>
julia> function my_dot_2(x, y, n)
s = zero(eltype(x))
@turbo thread=2 for i = 1 : n
s += x[i] * y[i]
end
return s
end
my_dot_2 (generic function with 1 method)
julia> function my_dot_4(x, y, n)
s = zero(eltype(x))
@turbo thread=4 for i = 1 : n
s += x[i] * y[i]
end
return s
end
my_dot_4 (generic function with 1 method)
julia> function my_dot_8(x, y, n)
s = zero(eltype(x))
@turbo thread=8 for i = 1 : n
s += x[i] * y[i]
end
return s
end
my_dot_8 (generic function with 1 method)
julia> @btime my_dot_1($a1, $b1, $n1)
1.487 μs (0 allocations: 0 bytes)
4086.504852746797
julia> @btime my_dot_2($a1, $b1, $n1)
1.573 μs (0 allocations: 0 bytes)
4086.504852746797
julia> @btime my_dot_4($a1, $b1, $n1)
1.576 μs (0 allocations: 0 bytes)
4086.504852746797
julia> @btime my_dot_8($a1, $b1, $n1)
1.492 μs (0 allocations: 0 bytes)
4086.504852746797
julia> @btime dot($a1, $b1)
1.615 μs (1 allocation: 16 bytes)
4086.5048527467975
julia> @btime my_dot_1($a2, $b2, $n2)
75.687 μs (0 allocations: 0 bytes)
32625.7939963322
julia> @btime my_dot_2($a2, $b2, $n2)
16.270 μs (0 allocations: 0 bytes)
32625.79399633219
julia> @btime my_dot_4($a2, $b2, $n2)
6.024 μs (0 allocations: 0 bytes)
32625.7939963322
julia> @btime my_dot_8($a2, $b2, $n2)
5.492 μs (0 allocations: 0 bytes)
32625.7939963322
julia> @btime dot($a2, $b2)
75.348 μs (1 allocation: 16 bytes)
32625.7939963322
julia> using VectorizationBase
julia> VectorizationBase.num_cores()
static(24)
julia> Threads.nthreads()
8