I am struggling with how @threads works. I believe I understand what it does but I don’t get the expected result in the following MWE:
struct MyStr
arr::Array{Int,1}
end
function MyStr(N::Int)
return MyStr(collect(1:N))
end
function loop_seq(N::Int)
a = Array{Int,1}(undef,N)
B = MyStr(N)
f(i::Int)::Int = (B.arr[i])-1
for i in 1:N
a[i] = f(i)
end
return a
end
function loop_par(N::Int)
a = Array{Int,1}(undef,N)
B = MyStr(N)
f(i::Int)::Int = (B.arr[i])-1
Threads.@threads for i in 1:N
a[i] = f(i)
end
return a
end
function main()
N = 10^8
println("Seq")
@time loop_seq(N)
@time loop_seq(N)
println("Par")
@time loop_par(N)
@time loop_par(N)
end
main()
I execute this file as: julia -t 2 test.jl.
Output for N=10^8:
For small values of N, loop_seq runs faster than loop_par, and I assume it is the expected behavior. However, for these values of N, I would expect a faster run of the multi-thread version.
I would really appreciate an explanation to this phenomena and a possible workaround.
I think this function is so simple that you may see little benefit to multi-threading it. Nevertheless I see some benefit from about 10^5 (with 4 cores), but this will depend on details of your computer:
At size 10^9, notice that you are using 15GB of memory, and that the times seem to fluctuate wildly. My guess is that you’re timing something like how fast the OS can move things to disk.
In this particular case you seem to be limited by other things (I cannot even run your code with N=10^9). But as a general advice, it is better launch less threads and do more work on each thread when possible. That could be done with, for example:
function loop_par(N::Int)
a = Array{Int,1}(undef,N)
B = MyStr(N)
f(i::Int)::Int = (B.arr[i])-1
n_per_thread = N÷Threads.nthreads() # ÷ is \div
Threads.@threads for i in 1:Threads.nthreads()
for j in 1:n_per_thread
ii = (i-1)*n_per_thread + j
a[ii] = f(ii)
end
end
return a
end
(this one will only work if N is multiple of the number of threads)
Thanks for the answer. Ok, I got your point. Yes, I have also noticed that the times are not “stable” and they vary. I guess it is something related to the memory usage as you point out. Thanks!
Thanks for the answer. I see what you did there… That’s interesting! I will try to implement it in my real case example. Thank you very much for your reply!
@mcabbott , I changed function f by the sine of the sine of the sine … and I got improved timings for the parallel version. Indeed, you were right about what I was actually timing.
@lmiq , is it safe to write at position a[ii]? I guess it is because the memory position is different. However, is it time consuming? I thought of creating a sub-array for each of the chunks associated to each thread but sooner or later I have to “collect” the results into a single array. Would I need a lock then?
In that MWE it is. But if different threads may write to the same position, you need to care of that.
That might be better (faster) or not, it depends. I am a somewhat naive user or parallelization, and I end splitting and allocating everything per-thread by hand, which is not optimal. There are better solutions using for example Floops, you would need just to see how to use it. It does that in an optimized way for you.
Currently I’m doing this splitting and allocation by hand, as you mention, but I cannot see a real improvement in my personal laptop. I will try it later with more cpus in another machine.
I didn’t know about Floops package. I will see if I can integrate it with my code.
I think Threads.@threads already does roughly what you suggest – its help says:
“The only currently supported value is :static, which creates one task per thread and divides the iterations equally among them.”
Using all nthreads() can still be too many for small tasks, and it won’t try to balance tasks of varying (or unknown) size. This is where I think FLoops/FoldsThreads.jl etc. are smarter.
What part? Splitting the loop? I understand that it launches one thread per loop iteration (not simultaneously). I have examples where splitting the loop manually makes a huge difference (if that I can show those later).
Yeah… I admittedly will have to revise the situations in which that seemed to have worked. I am not able to reproduce that in a MWE. (and I even have that improvement by manual spiting recorded in a video, during a lecture… ). It was something more complex than that (computation of all the distance between particles of a simulation, but anyway I would have to understand why in that particular case that made a difference.