Speed up for-loop with multithreading

Hi Guys,

I’m trying to speed up my for-loop with multithreading like following (test3.jl code):

time0 = @elapsed begin
    n = 250
    a1 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
        a1[i] = [i+0.12345, 0]
        # push!(a1, [i+0.12345, 0])
    end
end

# without multithreading
time = @elapsed try
    a2 = Array{Array{Float64, 1}, 1}()
    for row in a1 
    tmp = [round(x * 0.00001, digits=5) for x in row]
    push!(a2, tmp)
end
catch
end

# with multithreading
time2 = @elapsed try
    a3 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
    tmp = [round(x * 0.00001, digits=5) for x in a1[i]]
    a3[i] = tmp
end
catch
end

println("time0=$time0, time=$time, time2=$time2")

and i run:

julia> include("test/test3.jl")
time0=0.072520981, time=0.020055303, time2=0.070619797

I have set Number of Threads = 4

I know, time2 > time because of overhead, which means, starting 4 Threads costs more time than directly running it without parallelism. But i still want to ask, is there any chance to speed up this for-loop? Maybe i should write more efficient codes? But how to do it?

Best Regards

Hi, it would be nice if you could edit your post to markdown the code in triple back ticks. That way it would be easier for us to copy and paste it. Thanks!

Thanks for your hint! I’ve done it!

Correct, here is what I see using @btime:

  27.400 μs (1023 allocations: 43.67 KiB)
  5.283 μs (2 allocations: 64 bytes)
  5.267 μs (1 allocation: 16 bytes)

If you need more speed, this MWE doesn’t seem to be representative for your real problem? I can’t estimate if your problem size is bigger or if this represents a hot loop, which is executed a lot of times?

2 Likes

Thanks for your answer! This is only a part of the function (f1). In f1 exists only one loop with 250 iterations. I want to speed up f1 with speeding up this loop with multithreading. But it seems not to work.
The whole Program is very big, and there are many different functions. My final goal is to speed up the whole program. And speeding up f1 is only my first trying step. Maybe i should move on and try to speed up other functions.

That is what I suspected. Did you try to profile your program? I’d recommend to use a visual profiler in VSCode or Atom/Juno(which I still slightly prefer).

Edit: another idea. If your MWE is representative for your problem, allocations could be part of the problem. I’d see some room for improvement there.

1 Like

I don’t know about profile yet. I 'll read the docu :).

But the original benchmark is problematic: the try-catch blocks mask that a1 is not defined. Corrected benchmark

using BenchmarkTools 

init() = begin
    n = 250
    a1 = Vector{Vector{Float64}}(undef, n)
    for i in 1:n 
        a1[i] = [rand(), rand()]
    end
    a1
end

# without multithreading
round_serial(a1) = begin
    n = length(a1)
    a2 = Vector{Vector{Float64}}(undef, n)
    for i in 1:n 
        a2[i] = [round(x, digits=5) for x in a1[i]]
    end
    a2
end

# with multithreading
round_parallel(a1) = begin
    n = length(a1)
    a3 = Vector{Vector{Float64}}(undef, n)
    Threads.@threads for i in 1:n 
        a3[i] = [round(x, digits=5) for x in a1[i]]
    end
    a3
end

a1 = init()
a2 = @btime round_serial($a1)
a3 = @btime round_parallel($a1)
@assert isapprox(a3, a2)

shows some speedup for me

  29.100 μs (251 allocations: 21.59 KiB)
  10.500 μs (270 allocations: 24.00 KiB)

Ok. Thanks for your check! I until there i still used @elapsed, because i need the returned time variable and the total run time is relevant for me.
I have read docu about profile. But i really don’t understand good, how can i understand the output of

ProfileView.@profview

Code for visualizing profile:

using ProfileView

time0 = @elapsed begin
    n = 250
    a1 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
        a1[i] = [i+0.12345, 0]
        # push!(a1, [i+0.12345, 0])
    end
end

# without multithreading
time1 = @elapsed @profview begin
    a2 = Array{Array{Float64, 1}, 1}()
    for row in a1 
        tmp = [round(x * 0.00001, digits=5) for x in row]
        push!(a2, tmp)
    end
end

# with multithreading
time2 = @elapsed @profview begin
    
    a3 = Array{Array{Float64, 1}, 1}(undef, n)
    Threads.@threads for i in 1:n 
        tmp = [round(x * 0.00001, digits=5) for x in a1[i]]
        a3[i] = tmp
    end

end

println("time0=$time0, time1=$time1, time2=$time2")

There’s not something in this picture that I’m familiar with. What is the crucial info of those plots i should extract?

Using

using Profile
Profile.clear()
a1 = init()
round_serial(a1)
@profile for i in 1:1000; round_serial(a1); end
Juno.profiler()

for my example I see in Juno/Atom something like this

where you can navigate from the profile pane to the source code. In the source code pane bigger bars mean larger part of the runtime. Read color indicates part of the program which allocate, yellow color indicates dynamic dispatch.

Cool for you :stuck_out_tongue:

I m using IDE VSCode, Juno is not set up and installed by me. Hmm Maybe i should start a new topic in discourse?