Julia multithreading no performance improvement

111 · May 31, 2022, 3:11am

please refer to the 5th post in English.

albheim · May 31, 2022, 4:27am

You’ll probably have more success on

photor · May 31, 2022, 4:39am

or translate it into English and post here again

111 · May 31, 2022, 5:20am

Sorry about posting a Chinese version. I thought i was visiting juliacn

I wrote some interpolation code based on the ScatteredInterpolation.jl package, and tried multithreading but failed to get some performance improvement.

After getting the interpolator with

rbx = ScatteredInterpolation.interpolate(ThinPlate(), 
                                         xy_m, ox, 
                                    smooth = smooth)

I calculated the interpolation result by:

@time oxp = map(x -> evaluate(rbx, x), grids)

where grids is the target grids, and rbx is the interpolator.
This cost about 220s when length(xy_m) = 20000 and length(grids) = 2300000.
I also rewrote it with multithreading as:

@time begin
Threads.@threads for i in 1:length(grids)
        evaluate(rbx, grids[i])
    end
end

or

@time @sync begin
for i in 1:length(grids)
    Threads.@spawn evaluate(rbx, grids[1])
end

but the multi-threaded code cost 250s -300s.
The OS is win10, and the used cores(threads) is 16. The total CPU usage was about 100%
Is there any possible way to improve?

mrs504aa · May 31, 2022, 5:35am

The performance depends on many things, the CPU type, different strategy of resource scheduling between different system.
For example, on my laptop with 5800H, 4 threads is best for Windows (wired since 5800H is 8c16t).
But for WSL (Linux subsystem), 8 threads is best.
I recommend you to test your program with different thread numbers, and find out the best value.

gdalle · May 31, 2022, 5:40am

It is also a good idea to profile your code to see where it spends the most time. For instance, if memory access is the bottleneck, then the threads may fight over memory instead of performing computations independently

111 · May 31, 2022, 6:33am

when length(grids) = 8000,

the @time result for non-threading version is

1.850699 seconds (1.79 M allocations: 1.232 GiB, 12.27% gc time, 14.91% compilation time)
the

the multi-threading version is

0.574647 seconds (116.74 k allocations: 1.140 GiB, 48.57% gc time, 5.38% compilation time)
but when the size of girds increased, the multi-threading version is getting slower and slower.

gdalle · May 31, 2022, 6:39am

What I meant by profiling was actually a more detailed analysis of the time spent in each subroutine of your code. If you’re developing on VSCode for example, the @profview macro is perfect for that purpose

Elrod · May 31, 2022, 7:00am

One of the most common mistakes new users make is that they don’t start Julia with -tauto. This really should be the default.

What does Threads.nthreads() say?

111 · May 31, 2022, 7:15am

thank you very much for introducing the useful tool @profview.
with @profview, I found that when grid size increases, the multithreaded version begins to spend much more time on these functions

Distances/6E33b/src/generic.jl:320 & 321

    r = Matrix{result_type(metric, a, b)}(undef, m, n)
    pairwise!(r, metric, a, b, dims=dims)

as well as

base/boot.jl: 459

Array{T,2}(::UndefInitializer, m::Int, n::Int) where {T} =
    ccall(:jl_alloc_array_2d, Array{T,2}, (Any, Int, Int), Array{T,2}, m, n)

The Array function is flagged as “GC” and the time cost increased from 15% to 83% when using multithreading.

111 · May 31, 2022, 7:16am

16

gdalle · May 31, 2022, 7:22am

This is typically a sign that memory usage is a bottleneck of your code. You spend too much time allocating large matrices (within the Array function) and subsequently cleaning up (that’s where the Garbage Collector comes in).
Is it necessary to re-create a new matrix r every time, or could you modify it in-place?

JM_Beckers · May 31, 2022, 7:30am

You might try out DIVAnd, which also provides a thin-plate smoothed interpolation of scattered points and on my laptop uses 45s for a problem of the same size.

using DIVAnd
using BenchmarkTools


# test function to interpolate
fun(x,y) = sin.(6x) * cos.(6y)

# observations

x = rand(75000);
y = rand(75000);
f = fun.(x,y).+randn(75000)

# final grid
xi,yi = ndgrid(range(0,stop=1,length=2800),range(0,stop=1,length=1000));


# all points are valid points
mask = trues(size(xi));

# this problem has a simple cartesian metric
# pm is the inverse of the resolution along the 1st dimension
# pn is the inverse of the resolution along the 2nd dimension

pm = ones(size(xi)) / (xi[2,1]-xi[1,1]);
pn = ones(size(xi)) / (yi[1,2]-yi[1,1]);

# correlation length
len = 0.1;

# obs. error variance normalized by the background error variance
epsilon2 = 1.;

# fi is the interpolated field
@btime fi,s = DIVAndrun(mask,(pm,pn),(xi,yi),(x,y),f,len,epsilon2;alphabc=2);

111 · May 31, 2022, 9:24am

I rewrote the code as

    function mythread(df)
        n1 = length(df)
        nstep = 1000
        nseg = n1 ÷ nstep
        res = Vector{Float64}[]
        Threads.@threads for i = 1:nseg
            push!(res, evaluate(rbx, reduce(hcat, grids[((i-1)*nstep+1):(i*nstep)])))
        end
        push!(res, evaluate(rbx, reduce(hcat, grids[((nseg-1)*nstep+1):end])))
        return rbx
    end

and the @profview result showed a reduction of GC for the Array function (from 83% to 10%), The total time decreased from ~250s to 48s.
I guess this is because the pairwise function now processes the 1000-element array (corresponds to 1000 points) at a time and only creates nseg+1 Matrixes, but the previous version creates a Matrix for all the 2300000 points.

gdalle · May 31, 2022, 11:07am

Beware of pushing to res within a thread, I think you might open yourself up to race conditions. It would be safer to specify the indices beforehand:

    function mythread(df)
        n1 = length(df)
        nstep = 1000
        nseg = n1 ÷ nstep
        res = Vector{Vector{Float64}}(undef, nseg + 1)
        Threads.@threads for i = 1:nseg
            res[i] = evaluate(rbx, reduce(hcat, grids[((i-1)*nstep+1):(i*nstep)]))
        end
        res[end] = evaluate(rbx, reduce(hcat, grids[((nseg-1)*nstep+1):end]))
        return rbx
    end

Out of curiosity, do you really need to fill res if you don’t return it?

111 · May 31, 2022, 11:55am

Misstyping. i should return res. i knew the race conditions here, but i was getting out off work.

Topic		Replies	Views
V1.3.1 no gain using multithread General Usage multithreading	33	2003	March 4, 2020
Performance issue with multithreaded computation with matrix operations at its heart (Threads.@threads vs. BLAS threads) Performance blas , parallel , multithreading , linearalgebra , threads	7	390	November 13, 2023
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	3925	February 11, 2019
Same multi-threaded code, scaling observed only on some machines Performance	2	67	August 14, 2024
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1767	April 6, 2021

Julia multithreading no performance improvement

Related topics