I’m trying to speed up my for-loop with multithreading like following (test3.jl code):

time0 = @elapsed begin
n = 250
a1 = Array{Array{Float64, 1}, 1}(undef, n)
Threads.@threads for i in 1:n
a1[i] = [i+0.12345, 0]
# push!(a1, [i+0.12345, 0])
end
end
# without multithreading
time = @elapsed try
a2 = Array{Array{Float64, 1}, 1}()
for row in a1
tmp = [round(x * 0.00001, digits=5) for x in row]
push!(a2, tmp)
end
catch
end
# with multithreading
time2 = @elapsed try
a3 = Array{Array{Float64, 1}, 1}(undef, n)
Threads.@threads for i in 1:n
tmp = [round(x * 0.00001, digits=5) for x in a1[i]]
a3[i] = tmp
end
catch
end
println("time0=$time0, time=$time, time2=$time2")

I know, time2 > time because of overhead, which means, starting 4 Threads costs more time than directly running it without parallelism. But i still want to ask, is there any chance to speed up this for-loop? Maybe i should write more efficient codes? But how to do it?

Hi, it would be nice if you could edit your post to markdown the code in triple back ticks. That way it would be easier for us to copy and paste it. Thanks!

If you need more speed, this MWE doesn’t seem to be representative for your real problem? I can’t estimate if your problem size is bigger or if this represents a hot loop, which is executed a lot of times?

Thanks for your answer! This is only a part of the function (f1). In f1 exists only one loop with 250 iterations. I want to speed up f1 with speeding up this loop with multithreading. But it seems not to work.
The whole Program is very big, and there are many different functions. My final goal is to speed up the whole program. And speeding up f1 is only my first trying step. Maybe i should move on and try to speed up other functions.

That is what I suspected. Did you try to profile your program? I’d recommend to use a visual profiler in VSCode or Atom/Juno(which I still slightly prefer).

Edit: another idea. If your MWE is representative for your problem, allocations could be part of the problem. I’d see some room for improvement there.

But the original benchmark is problematic: the try-catch blocks mask that a1 is not defined. Corrected benchmark

using BenchmarkTools
init() = begin
n = 250
a1 = Vector{Vector{Float64}}(undef, n)
for i in 1:n
a1[i] = [rand(), rand()]
end
a1
end
# without multithreading
round_serial(a1) = begin
n = length(a1)
a2 = Vector{Vector{Float64}}(undef, n)
for i in 1:n
a2[i] = [round(x, digits=5) for x in a1[i]]
end
a2
end
# with multithreading
round_parallel(a1) = begin
n = length(a1)
a3 = Vector{Vector{Float64}}(undef, n)
Threads.@threads for i in 1:n
a3[i] = [round(x, digits=5) for x in a1[i]]
end
a3
end
a1 = init()
a2 = @btime round_serial($a1)
a3 = @btime round_parallel($a1)
@assert isapprox(a3, a2)

Ok. Thanks for your check! I until there i still used @elapsed, because i need the returned time variable and the total run time is relevant for me.
I have read docu about profile. But i really don’t understand good, how can i understand the output of

ProfileView.@profview

Code for visualizing profile:

using ProfileView
time0 = @elapsed begin
n = 250
a1 = Array{Array{Float64, 1}, 1}(undef, n)
Threads.@threads for i in 1:n
a1[i] = [i+0.12345, 0]
# push!(a1, [i+0.12345, 0])
end
end
# without multithreading
time1 = @elapsed @profview begin
a2 = Array{Array{Float64, 1}, 1}()
for row in a1
tmp = [round(x * 0.00001, digits=5) for x in row]
push!(a2, tmp)
end
end
# with multithreading
time2 = @elapsed @profview begin
a3 = Array{Array{Float64, 1}, 1}(undef, n)
Threads.@threads for i in 1:n
tmp = [round(x * 0.00001, digits=5) for x in a1[i]]
a3[i] = tmp
end
end
println("time0=$time0, time1=$time1, time2=$time2")

where you can navigate from the profile pane to the source code. In the source code pane bigger bars mean larger part of the runtime. Read color indicates part of the program which allocate, yellow color indicates dynamic dispatch.