Julia multithreading no performance improvement

please refer to the 5th post in English.

You’ll probably have more success on

or translate it into English and post here again :grinning:

Sorry about posting a Chinese version. I thought i was visiting juliacn :rofl:

I wrote some interpolation code based on the ScatteredInterpolation.jl package, and tried multithreading but failed to get some performance improvement.

After getting the interpolator with

rbx = ScatteredInterpolation.interpolate(ThinPlate(), 
                                         xy_m, ox, 
                                    smooth = smooth)

I calculated the interpolation result by:

@time oxp = map(x -> evaluate(rbx, x), grids)

where grids is the target grids, and rbx is the interpolator.
This cost about 220s when length(xy_m) = 20000 and length(grids) = 2300000.
I also rewrote it with multithreading as:

@time begin
Threads.@threads for i in 1:length(grids)
        evaluate(rbx, grids[i])
    end
end

or

@time @sync begin
for i in 1:length(grids)
    Threads.@spawn evaluate(rbx, grids[1])
end

but the multi-threaded code cost 250s -300s.
The OS is win10, and the used cores(threads) is 16. The total CPU usage was about 100%
Is there any possible way to improve?

The performance depends on many things, the CPU type, different strategy of resource scheduling between different system.
For example, on my laptop with 5800H, 4 threads is best for Windows (wired since 5800H is 8c16t).
But for WSL (Linux subsystem), 8 threads is best.
I recommend you to test your program with different thread numbers, and find out the best value.

It is also a good idea to profile your code to see where it spends the most time. For instance, if memory access is the bottleneck, then the threads may fight over memory instead of performing computations independently

1 Like

when length(grids) = 8000,

  • the @time result for non-threading version is

1.850699 seconds (1.79 M allocations: 1.232 GiB, 12.27% gc time, 14.91% compilation time)
the

  • the multi-threading version is

0.574647 seconds (116.74 k allocations: 1.140 GiB, 48.57% gc time, 5.38% compilation time)
but when the size of girds increased, the multi-threading version is getting slower and slower.

What I meant by profiling was actually a more detailed analysis of the time spent in each subroutine of your code. If you’re developing on VSCode for example, the @profview macro is perfect for that purpose

One of the most common mistakes new users make is that they don’t start Julia with -tauto. This really should be the default.

What does Threads.nthreads() say?

1 Like

thank you very much for introducing the useful tool @profview.
with @profview, I found that when grid size increases, the multithreaded version begins to spend much more time on these functions

  • Distances/6E33b/src/generic.jl:320 & 321
    r = Matrix{result_type(metric, a, b)}(undef, m, n)
    pairwise!(r, metric, a, b, dims=dims)

as well as

  • base/boot.jl: 459
Array{T,2}(::UndefInitializer, m::Int, n::Int) where {T} =
    ccall(:jl_alloc_array_2d, Array{T,2}, (Any, Int, Int), Array{T,2}, m, n)

The Array function is flagged as “GC” and the time cost increased from 15% to 83% when using multithreading.

16

This is typically a sign that memory usage is a bottleneck of your code. You spend too much time allocating large matrices (within the Array function) and subsequently cleaning up (that’s where the Garbage Collector comes in).
Is it necessary to re-create a new matrix r every time, or could you modify it in-place?

2 Likes

You might try out DIVAnd, which also provides a thin-plate smoothed interpolation of scattered points and on my laptop uses 45s for a problem of the same size.

using DIVAnd
using BenchmarkTools


# test function to interpolate
fun(x,y) = sin.(6x) * cos.(6y)

# observations

x = rand(75000);
y = rand(75000);
f = fun.(x,y).+randn(75000)

# final grid
xi,yi = ndgrid(range(0,stop=1,length=2800),range(0,stop=1,length=1000));


# all points are valid points
mask = trues(size(xi));

# this problem has a simple cartesian metric
# pm is the inverse of the resolution along the 1st dimension
# pn is the inverse of the resolution along the 2nd dimension

pm = ones(size(xi)) / (xi[2,1]-xi[1,1]);
pn = ones(size(xi)) / (yi[1,2]-yi[1,1]);

# correlation length
len = 0.1;

# obs. error variance normalized by the background error variance
epsilon2 = 1.;

# fi is the interpolated field
@btime fi,s = DIVAndrun(mask,(pm,pn),(xi,yi),(x,y),f,len,epsilon2;alphabc=2);

I rewrote the code as

    function mythread(df)
        n1 = length(df)
        nstep = 1000
        nseg = n1 ÷ nstep
        res = Vector{Float64}[]
        Threads.@threads for i = 1:nseg
            push!(res, evaluate(rbx, reduce(hcat, grids[((i-1)*nstep+1):(i*nstep)])))
        end
        push!(res, evaluate(rbx, reduce(hcat, grids[((nseg-1)*nstep+1):end])))
        return rbx
    end

and the @profview result showed a reduction of GC for the Array function (from 83% to 10%), The total time decreased from ~250s to 48s.
I guess this is because the pairwise function now processes the 1000-element array (corresponds to 1000 points) at a time and only creates nseg+1 Matrixes, but the previous version creates a Matrix for all the 2300000 points.

2 Likes

Beware of pushing to res within a thread, I think you might open yourself up to race conditions. It would be safer to specify the indices beforehand:

    function mythread(df)
        n1 = length(df)
        nstep = 1000
        nseg = n1 ÷ nstep
        res = Vector{Vector{Float64}}(undef, nseg + 1)
        Threads.@threads for i = 1:nseg
            res[i] = evaluate(rbx, reduce(hcat, grids[((i-1)*nstep+1):(i*nstep)]))
        end
        res[end] = evaluate(rbx, reduce(hcat, grids[((nseg-1)*nstep+1):end]))
        return rbx
    end

Out of curiosity, do you really need to fill res if you don’t return it?

Misstyping. i should return res. i knew the race conditions here, but i was getting out off work.