Speed up for-loop with multithreading

Himbeersternchen · April 22, 2022, 9:11am

Hi Guys,

I’m trying to speed up my for-loop with multithreading like following (test3.jl code):

time0 = @elapsed begin
    n = 250
    a1 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
        a1[i] = [i+0.12345, 0]
        # push!(a1, [i+0.12345, 0])
    end
end

# without multithreading
time = @elapsed try
    a2 = Array{Array{Float64, 1}, 1}()
    for row in a1 
    tmp = [round(x * 0.00001, digits=5) for x in row]
    push!(a2, tmp)
end
catch
end

# with multithreading
time2 = @elapsed try
    a3 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
    tmp = [round(x * 0.00001, digits=5) for x in a1[i]]
    a3[i] = tmp
end
catch
end

println("time0=$time0, time=$time, time2=$time2")

and i run:

julia> include("test/test3.jl")
time0=0.072520981, time=0.020055303, time2=0.070619797

I have set Number of Threads = 4

I know, time2 > time because of overhead, which means, starting 4 Threads costs more time than directly running it without parallelism. But i still want to ask, is there any chance to speed up this for-loop? Maybe i should write more efficient codes? But how to do it?

Best Regards

goerch · April 22, 2022, 9:18am

Hi, it would be nice if you could edit your post to markdown the code in triple back ticks. That way it would be easier for us to copy and paste it. Thanks!

Himbeersternchen · April 22, 2022, 9:27am

Thanks for your hint! I’ve done it!

goerch · April 22, 2022, 9:51am

Correct, here is what I see using @btime:

  27.400 μs (1023 allocations: 43.67 KiB)
  5.283 μs (2 allocations: 64 bytes)
  5.267 μs (1 allocation: 16 bytes)

If you need more speed, this MWE doesn’t seem to be representative for your real problem? I can’t estimate if your problem size is bigger or if this represents a hot loop, which is executed a lot of times?

Himbeersternchen · April 22, 2022, 10:19am

Thanks for your answer! This is only a part of the function (f1). In f1 exists only one loop with 250 iterations. I want to speed up f1 with speeding up this loop with multithreading. But it seems not to work.
The whole Program is very big, and there are many different functions. My final goal is to speed up the whole program. And speeding up f1 is only my first trying step. Maybe i should move on and try to speed up other functions.

goerch · April 22, 2022, 10:22am

That is what I suspected. Did you try to profile your program? I’d recommend to use a visual profiler in VSCode or Atom/Juno(which I still slightly prefer).

Edit: another idea. If your MWE is representative for your problem, allocations could be part of the problem. I’d see some room for improvement there.

Himbeersternchen · April 22, 2022, 10:44am

I don’t know about profile yet. I 'll read the docu :).

goerch · April 22, 2022, 11:25am

But the original benchmark is problematic: the try-catch blocks mask that a1 is not defined. Corrected benchmark

using BenchmarkTools 

init() = begin
    n = 250
    a1 = Vector{Vector{Float64}}(undef, n)
    for i in 1:n 
        a1[i] = [rand(), rand()]
    end
    a1
end

# without multithreading
round_serial(a1) = begin
    n = length(a1)
    a2 = Vector{Vector{Float64}}(undef, n)
    for i in 1:n 
        a2[i] = [round(x, digits=5) for x in a1[i]]
    end
    a2
end

# with multithreading
round_parallel(a1) = begin
    n = length(a1)
    a3 = Vector{Vector{Float64}}(undef, n)
    Threads.@threads for i in 1:n 
        a3[i] = [round(x, digits=5) for x in a1[i]]
    end
    a3
end

a1 = init()
a2 = @btime round_serial($a1)
a3 = @btime round_parallel($a1)
@assert isapprox(a3, a2)

shows some speedup for me

  29.100 μs (251 allocations: 21.59 KiB)
  10.500 μs (270 allocations: 24.00 KiB)

Himbeersternchen · April 22, 2022, 12:51pm

Ok. Thanks for your check! I until there i still used @elapsed, because i need the returned time variable and the total run time is relevant for me.
I have read docu about profile. But i really don’t understand good, how can i understand the output of

ProfileView.@profview

Code for visualizing profile:

using ProfileView

time0 = @elapsed begin
    n = 250
    a1 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
        a1[i] = [i+0.12345, 0]
        # push!(a1, [i+0.12345, 0])
    end
end

# without multithreading
time1 = @elapsed @profview begin
    a2 = Array{Array{Float64, 1}, 1}()
    for row in a1 
        tmp = [round(x * 0.00001, digits=5) for x in row]
        push!(a2, tmp)
    end
end

# with multithreading
time2 = @elapsed @profview begin
    
    a3 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
        tmp = [round(x * 0.00001, digits=5) for x in a1[i]]
        a3[i] = tmp
    end

end

println("time0=$time0, time1=$time1, time2=$time2")

There’s not something in this picture that I’m familiar with. What is the crucial info of those plots i should extract?

goerch · April 22, 2022, 1:24pm

Using

using Profile
Profile.clear()
a1 = init()
round_serial(a1)
@profile for i in 1:1000; round_serial(a1); end
Juno.profiler()

for my example I see in Juno/Atom something like this

where you can navigate from the profile pane to the source code. In the source code pane bigger bars mean larger part of the runtime. Read color indicates part of the program which allocate, yellow color indicates dynamic dispatch.

Himbeersternchen · April 22, 2022, 1:32pm

Cool for you

I m using IDE VSCode, Juno is not set up and installed by me. Hmm Maybe i should start a new topic in discourse?

Topic		Replies	Views
Julia multithreading is running slower than serial, can someone please explain why…? Performance multithreading , floops	14	705	March 24, 2023
Question for lower performance by using @threads in for loop New to Julia question	13	1054	July 9, 2021
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	738	August 13, 2020
Multithreading doesn't improve the performance Performance multithreading	6	126	February 18, 2025
Multi-threading on a 2 CPU system New to Julia multithreading	15	1084	February 2, 2023

Speed up for-loop with multithreading

Related topics