Threaded loop far slower than sequential loop (+ high compilation time)

StefanMathis · September 17, 2021, 2:24pm

Hello,

when trying to parallelize a loop, the performance got absolutely thrashed - roundabout by a factor of 1000. I tried to replicate the problem with a MWE and found something interesting:

main() = loop(Vector{Float64}(undef, 100_000))
function loop(arr)
    for k in 1:length(arr)
        k_float = float(k)
        arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
    end
end

mainthreaded() = loopthreaded(Vector{Float64}(undef, 100_000))
function loopthreaded(arr)
    Threads.@threads for k in 1:length(arr)
        k_float = float(k)
        arr[k] = sin(k_float)*cos(k_float)*tan(k_float)/sqrt(k_float) # Just some calculation to sink time into
    end
end

println("===============")
@time main()
@time mainthreaded()

When running this script four times in a row, I get the following REPL output:

  0.004578 seconds (2 allocations: 781.328 KiB)
  0.041883 seconds (47.42 k allocations: 3.599 MiB, 95.31% compilation time)
===============
  0.004192 seconds (2 allocations: 781.328 KiB)
  0.022073 seconds (18.21 k allocations: 1.812 MiB, 91.24% compilation time)
===============
  0.004347 seconds (2 allocations: 781.328 KiB)
  0.023520 seconds (18.21 k allocations: 1.812 MiB, 91.38% compilation time)
===============
  0.005623 seconds (2 allocations: 781.328 KiB)
  0.032980 seconds (18.21 k allocations: 1.812 MiB, 29.23% gc time, 95.21% compilation time)

So the threaded loop is roughly 5 times slower than the sequential loop (not counting for GC time) after the first execution. I am aware that for such a trivial example, the computational cost of managing threads is much higher than simply calculating the loop, therefore explaining the loss in speed. In my RL-application, the calculation is sufficiently expensive for multithreading to make sense. However, two things really stood out for me:

The high number of allocations (from my understanding, due to the thread spawning?)
Every time the mainthreaded() function is invoked, most of the time is spent as compilation time?

In my real-world application, the Gtk package is used (not in the loop, just in general). Profiling reports that almost all time is spent in gtk_main(), which seems to be the same issue as reported here:

However, even when completely removing Gtk from the project, the issue simply shifts to threading-setup functions.

One suggestion I found was using the ThreadPools package, see here:

github.com/JuliaLang/julia

inspectdr trashes threading performance

opened 11:36PM - 19 Apr 21 UTC

mattcbro

This is a rather obscure but difficult to find interaction between the Threads l…ibrary and the inspectdr() backend for the Plots library. I have some code that attempts to use a simple @threads for loop to parallelize multiple QR decompositions of a block of raw data. Calling the inspectdr() initializer will cause the threaded code to run anywhere from 200 to 7000 times slower in this example. The difference for the genprojmat() function timings can be seen by commenting and uncommenting the inspectdr() line. I don't do any plotting in this script. I'm on linux mint 20.1, 8 core processor. The code follows: ```julia # test simple thread idea using Plots # uncommenting inspectdr() causes the threaded version of genprojmat to run up to 7000 times slower inspectdr() using LinearAlgebra using BenchmarkTools #using QThread ## """ A matlab version of Julias QR decomposition. Q,R = flatqr(X) """ function flatqr(X) F = qr(X) return(Matrix(F.Q), F.R) end """ Circularly symmetric Complex noise """ function cgauss(varargin...) Z = (1 ./ sqrt(2.)).* (randn(varargin)+im.*randn(varargin)) end """ complex zeros functions since I forget how to write the type signatures """ function czeros(x...) y = zeros(Complex{Float64}, x) ; return(y) end # function czeros """ Generate all the projection matrices. Use threaded loop to exploit embarassing parallel problem """ function genprojmat(xdata) Mants, Nf, Ns = size(xdata) Qall = czeros(Ns, Mants, Nf) Threads.@threads for k = 1:Nf Qx, Rx = flatqr(xdata[:,k,:]') Qall[:, :, k] = Qx end return(Qall) end """ Generate all the projection matrices. Non threaded case is much faster. Why? """ function genprojmatnt(xdata) Mants, Nf, Ns = size(xdata) Qall = czeros(Ns, Mants, Nf) for k = 1:Nf Qx, Rx = flatqr(xdata[:,k,:]') Qall[:, :, k] = Qx end return(Qall) end ## # Set up the input data Mants =4 Nf = 24 nsgn = 0.1 apr = cgauss(4) chan = cgauss(Nf) ac = apr * transpose(chan) # Mants, Nf, Ns = size(xdata) Ns = 1024 K = 3 xdata = nsgn .* cgauss(Mants, Nf, Ns) st = randn(Ns) for q=1:Ns xdata[:,:,q] = xdata[:,:,q] + ac .* st[q] end # threaded version @btime genprojmat(xdata) ; # not threaded version @btime genprojmatnt(xdata) ; ``` The output of running the script, first with inspectdr() uncommented: ``` matt@Hope /mnt/WorkSpace/projects/Maestro/QThread $ julia -t 8 _ _ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 1.6.0 (2021-03-24) _/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release |__/ | julia> include("testthreaded.jl") Gtk-Message: 16:20:03.457: Failed to load module "xapp-gtk3-module" 5.003 s (332 allocations: 7.54 MiB) 2.380 ms (290 allocations: 7.54 MiB) 1024×4×24 Array{ComplexF64, 3}: ..... julia> Threads.nthreads() 8 ``` Now with inspectdr() commented out and after restarting julia. ``` matt@Hope /mnt/WorkSpace/projects/Maestro/QThread $ julia -t 8 _ _ _ _(_)_ | Documentation: https://docs.julialang.org (_) | (_) (_) | _ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 1.6.0 (2021-03-24) _/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release |__/ | julia> Threads.nthreads() 8 julia> include("testthreaded.jl") 705.049 μs (331 allocations: 7.54 MiB) 2.547 ms (290 allocations: 7.54 MiB) 1024×4×24 Array{ComplexF64, 3}: ```

However, this didn’t help either.

Therefore, I am wondering whether an individual thread is spawned for each value k and the loop body is recompiled for each single iteration. Could this be true? This could also explain the multitude of threading-related issues here on Discourse, e.g.:

Any insights would be appreciated very much!

s-broda · September 17, 2021, 2:43pm

When you say “running the script 4 times in a row”, do you mean you are restarting julia inbetween runs? That would explain why you see these compilation times, as the compiled code isn’t cached.

When I run the code twice in the same session, I get the following:

julia> @time main()
  0.003548 seconds (2 allocations: 781.328 KiB)

julia> @time mainthreaded()
  0.037268 seconds (47.78 k allocations: 3.629 MiB, 47.28% compilation time)

julia> @time main()
  0.003816 seconds (2 allocations: 781.328 KiB)

julia> @time mainthreaded()
  0.001187 seconds (127 allocations: 793.719 KiB)

So the threaded code is, in fact, faster (this is a 12 core machine).
Cheers
Simon

StefanMathis · September 17, 2021, 2:57pm

Thank you for your fast answer! Well, this is embarrassing… I run the entire script four times, but of course this redefined the functions, therefore a recompilation was necessary each time. Doing it properly solves the issue. It doesn’t solve my RL-application issue, but I think I have to come up with a new MWE for that

raminammour · September 17, 2021, 3:16pm

The real issue is that the function without multithreading is precompiled when you run the script whereas the threaded function is not. That is why you don’t see the compilation time of main() every time you re-run the script (even though, as you said, you are redefining the function). I tried adding precompile statements to no avail.

Someone more knowledgeable than me should pitch in as to why this is the case

Topic		Replies	Views
Question for lower performance by using @threads in for loop New to Julia question	13	1054	July 9, 2021
Compilation time with `@time` wrong with multi-threading? General Usage	3	197	January 18, 2024
Independent threads much slower by parallelizable Performance	2	244	January 25, 2024
Julia multithreading is running slower than serial, can someone please explain why…? Performance multithreading , floops	14	705	March 24, 2023
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	738	August 13, 2020

Threaded loop far slower than sequential loop (+ high compilation time)

Related topics