Simple multi-thread loop with array

shce · April 12, 2021, 3:21pm

Hello!

I am struggling with how @threads works. I believe I understand what it does but I don’t get the expected result in the following MWE:

struct MyStr
    arr::Array{Int,1}
end
function MyStr(N::Int)
    return MyStr(collect(1:N))
end

function loop_seq(N::Int)
    a = Array{Int,1}(undef,N)
    B = MyStr(N)
    f(i::Int)::Int = (B.arr[i])-1
    for i in 1:N
        a[i] = f(i)
    end
    return a
end

function loop_par(N::Int)
    a = Array{Int,1}(undef,N)
    B = MyStr(N)
    f(i::Int)::Int = (B.arr[i])-1
    Threads.@threads for i in 1:N
        a[i] = f(i)
    end
    return a
end

function main()
    N = 10^8
    println("Seq")
    @time loop_seq(N)
    @time loop_seq(N)
    println("Par")
    @time loop_par(N)
    @time loop_par(N)
end

main()

I execute this file as: julia -t 2 test.jl.
Output for N=10^8:

Seq
  0.849747 seconds (4 allocations: 1.490 GiB, 1.02% gc time)
  0.891866 seconds (4 allocations: 1.490 GiB, 10.91% gc time)
Par
  0.868159 seconds (50.78 k allocations: 1.493 GiB, 11.89% gc time)
  0.865398 seconds (15 allocations: 1.490 GiB, 0.16% gc time)

Output for N=10^9:

Seq
 26.144827 seconds (4 allocations: 14.901 GiB, 0.03% gc time)
 17.237483 seconds (4 allocations: 14.901 GiB, 7.76% gc time)
Par
 15.957512 seconds (50.78 k allocations: 14.904 GiB, 7.18% gc time)
 25.792571 seconds (16 allocations: 14.901 GiB, 0.34% gc time)

For small values of N, loop_seq runs faster than loop_par, and I assume it is the expected behavior. However, for these values of N, I would expect a faster run of the multi-thread version.

I would really appreciate an explanation to this phenomena and a possible workaround.

Thanks in advance.

Regards

mcabbott · April 12, 2021, 4:29pm

I think this function is so simple that you may see little benefit to multi-threading it. Nevertheless I see some benefit from about 10^5 (with 4 cores), but this will depend on details of your computer:

julia> @btime loop_seq(10^3);
  975.000 ns (2 allocations: 15.88 KiB)
julia> @btime loop_par(10^3);
  6.292 μs (23 allocations: 17.50 KiB)

julia> @btime loop_seq(10^8);
  425.799 ms (4 allocations: 1.49 GiB)
julia> @btime loop_par(10^8);
  296.791 ms (26 allocations: 1.49 GiB)

At size 10^9, notice that you are using 15GB of memory, and that the times seem to fluctuate wildly. My guess is that you’re timing something like how fast the OS can move things to disk.

lmiq · April 12, 2021, 4:43pm

In this particular case you seem to be limited by other things (I cannot even run your code with N=10^9). But as a general advice, it is better launch less threads and do more work on each thread when possible. That could be done with, for example:

function loop_par(N::Int)
    a = Array{Int,1}(undef,N)
    B = MyStr(N)
    f(i::Int)::Int = (B.arr[i])-1
    n_per_thread = N÷Threads.nthreads() # ÷ is \div
    Threads.@threads for i in 1:Threads.nthreads()
      for j in 1:n_per_thread
          ii = (i-1)*n_per_thread + j 
          a[ii] = f(ii)
      end
    end
    return a
end

(this one will only work if N is multiple of the number of threads)

shce · April 12, 2021, 8:26pm

Thanks for the answer. Ok, I got your point. Yes, I have also noticed that the times are not “stable” and they vary. I guess it is something related to the memory usage as you point out. Thanks!

shce · April 12, 2021, 8:28pm

Thanks for the answer. I see what you did there… That’s interesting! I will try to implement it in my real case example. Thank you very much for your reply!

shce · April 13, 2021, 1:27pm

@mcabbott , I changed function f by the sine of the sine of the sine … and I got improved timings for the parallel version. Indeed, you were right about what I was actually timing.

@lmiq , is it safe to write at position a[ii]? I guess it is because the memory position is different. However, is it time consuming? I thought of creating a sub-array for each of the chunks associated to each thread but sooner or later I have to “collect” the results into a single array. Would I need a lock then?

Thank you!

lmiq · April 13, 2021, 1:51pm

In that MWE it is. But if different threads may write to the same position, you need to care of that.

That might be better (faster) or not, it depends. I am a somewhat naive user or parallelization, and I end splitting and allocating everything per-thread by hand, which is not optimal. There are better solutions using for example Floops, you would need just to see how to use it. It does that in an optimized way for you.

shce · April 13, 2021, 2:06pm

Currently I’m doing this splitting and allocation by hand, as you mention, but I cannot see a real improvement in my personal laptop. I will try it later with more cpus in another machine.

I didn’t know about Floops package. I will see if I can integrate it with my code.

Thank you again.

mcabbott · April 13, 2021, 2:18pm

I think Threads.@threads already does roughly what you suggest – its help says:

“The only currently supported value is :static, which creates one task per thread and divides the iterations equally among them.”

Using all nthreads() can still be too many for small tasks, and it won’t try to balance tasks of varying (or unknown) size. This is where I think FLoops/FoldsThreads.jl etc. are smarter.

lmiq · April 13, 2021, 3:18pm

What part? Splitting the loop? I understand that it launches one thread per loop iteration (not simultaneously). I have examples where splitting the loop manually makes a huge difference (if that I can show those later).

mcabbott · April 13, 2021, 3:58pm

I mean that tt here is more like tp than like ts:

julia> function tt(n::Int)
       z = zeros(Int, n); Threads.@threads for i in 1:n; z[i] = Threads.threadid(); end; z'
       end;

julia> function tp(n::Int)
       z = zeros(Int, n); ps = collect(Iterators.partition(1:n, cld(n,Threads.nthreads())))
       Threads.@sync for p in ps; Threads.@spawn for i in p; z[i] = Threads.threadid(); end; end; z'
       end;

julia> function tp2(n::Int)  # very similar
       z = zeros(Int, n); ps = collect(Iterators.partition(1:n, cld(n,Threads.nthreads())))
       Threads.@threads for p in ps; for i in p; z[i] = Threads.threadid(); end; end; z'
       end;

julia> function ts(n::Int)
       z = zeros(Int, n); Threads.@sync for i in 1:n; Threads.@spawn z[i] = Threads.threadid(); end; z'
       end;

julia> @btime tt(10^6)  # @threads
  279.834 μs (23 allocations: 7.63 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  …  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4

julia> @btime tp(10^6)  # partitioned by hand
  284.458 μs (39 allocations: 7.63 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  …  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3

julia> @btime tp2(10^6)  # ditto
  265.333 μs (24 allocations: 7.63 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  …  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4

julia> @btime ts(10^6)  # spawn for every iteration: 2000x slower
  502.687 ms (6000032 allocations: 474.39 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 2  4  2  4  2  4  2  4  3  3  3  2  2  3  2  2  …  2  4  2  3  4  3  4  3  3  2  4  2  3  3  3

lmiq · April 13, 2021, 6:09pm

Yeah… I admittedly will have to revise the situations in which that seemed to have worked. I am not able to reproduce that in a MWE. (and I even have that improvement by manual spiting recorded in a video, during a lecture… ). It was something more complex than that (computation of all the distance between particles of a simulation, but anyway I would have to understand why in that particular case that made a difference.

Topic		Replies	Views
Multithreading for nested for loops General Usage parallel , multithreading , threads	13	1749	August 16, 2023
Julia multithreading is running slower than serial, can someone please explain why…? Performance multithreading , floops	14	705	March 24, 2023
Julia Threads.@threads slower than single thread performance Performance multithreading , pde	11	2913	April 24, 2023
Slower with threads Performance question	26	1171	August 6, 2022
Multi-threading on a 2 CPU system New to Julia multithreading	15	1084	February 2, 2023

Simple multi-thread loop with array

Related topics