Simple multi-thread loop with array

Hello!

I am struggling with how @threads works. I believe I understand what it does but I don’t get the expected result in the following MWE:

struct MyStr
    arr::Array{Int,1}
end
function MyStr(N::Int)
    return MyStr(collect(1:N))
end

function loop_seq(N::Int)
    a = Array{Int,1}(undef,N)
    B = MyStr(N)
    f(i::Int)::Int = (B.arr[i])-1
    for i in 1:N
        a[i] = f(i)
    end
    return a
end

function loop_par(N::Int)
    a = Array{Int,1}(undef,N)
    B = MyStr(N)
    f(i::Int)::Int = (B.arr[i])-1
    Threads.@threads for i in 1:N
        a[i] = f(i)
    end
    return a
end

function main()
    N = 10^8
    println("Seq")
    @time loop_seq(N)
    @time loop_seq(N)
    println("Par")
    @time loop_par(N)
    @time loop_par(N)
end

main()

I execute this file as: julia -t 2 test.jl.
Output for N=10^8:

Seq
  0.849747 seconds (4 allocations: 1.490 GiB, 1.02% gc time)
  0.891866 seconds (4 allocations: 1.490 GiB, 10.91% gc time)
Par
  0.868159 seconds (50.78 k allocations: 1.493 GiB, 11.89% gc time)
  0.865398 seconds (15 allocations: 1.490 GiB, 0.16% gc time)

Output for N=10^9:

Seq
 26.144827 seconds (4 allocations: 14.901 GiB, 0.03% gc time)
 17.237483 seconds (4 allocations: 14.901 GiB, 7.76% gc time)
Par
 15.957512 seconds (50.78 k allocations: 14.904 GiB, 7.18% gc time)
 25.792571 seconds (16 allocations: 14.901 GiB, 0.34% gc time)

For small values of N, loop_seq runs faster than loop_par, and I assume it is the expected behavior. However, for these values of N, I would expect a faster run of the multi-thread version.

I would really appreciate an explanation to this phenomena and a possible workaround.

Thanks in advance.

Regards

I think this function is so simple that you may see little benefit to multi-threading it. Nevertheless I see some benefit from about 10^5 (with 4 cores), but this will depend on details of your computer:

julia> @btime loop_seq(10^3);
  975.000 ns (2 allocations: 15.88 KiB)
julia> @btime loop_par(10^3);
  6.292 μs (23 allocations: 17.50 KiB)

julia> @btime loop_seq(10^8);
  425.799 ms (4 allocations: 1.49 GiB)
julia> @btime loop_par(10^8);
  296.791 ms (26 allocations: 1.49 GiB)

At size 10^9, notice that you are using 15GB of memory, and that the times seem to fluctuate wildly. My guess is that you’re timing something like how fast the OS can move things to disk.

In this particular case you seem to be limited by other things (I cannot even run your code with N=10^9). But as a general advice, it is better launch less threads and do more work on each thread when possible. That could be done with, for example:

function loop_par(N::Int)
    a = Array{Int,1}(undef,N)
    B = MyStr(N)
    f(i::Int)::Int = (B.arr[i])-1
    n_per_thread = N÷Threads.nthreads() # ÷ is \div
    Threads.@threads for i in 1:Threads.nthreads()
      for j in 1:n_per_thread
          ii = (i-1)*n_per_thread + j 
          a[ii] = f(ii)
      end
    end
    return a
end

(this one will only work if N is multiple of the number of threads)

Thanks for the answer. Ok, I got your point. Yes, I have also noticed that the times are not “stable” and they vary. I guess it is something related to the memory usage as you point out. Thanks!

Thanks for the answer. I see what you did there… That’s interesting! I will try to implement it in my real case example. Thank you very much for your reply!

@mcabbott , I changed function f by the sine of the sine of the sine … and I got improved timings for the parallel version. Indeed, you were right about what I was actually timing.

@lmiq , is it safe to write at position a[ii]? I guess it is because the memory position is different. However, is it time consuming? I thought of creating a sub-array for each of the chunks associated to each thread but sooner or later I have to “collect” the results into a single array. Would I need a lock then?

Thank you!

In that MWE it is. But if different threads may write to the same position, you need to care of that.

That might be better (faster) or not, it depends. I am a somewhat naive user or parallelization, and I end splitting and allocating everything per-thread by hand, which is not optimal. There are better solutions using for example Floops, you would need just to see how to use it. It does that in an optimized way for you.

Currently I’m doing this splitting and allocation by hand, as you mention, but I cannot see a real improvement in my personal laptop. I will try it later with more cpus in another machine.

I didn’t know about Floops package. I will see if I can integrate it with my code.

Thank you again.

I think Threads.@threads already does roughly what you suggest – its help says:

“The only currently supported value is :static, which creates one task per thread and divides the iterations equally among them.”

Using all nthreads() can still be too many for small tasks, and it won’t try to balance tasks of varying (or unknown) size. This is where I think FLoops/FoldsThreads.jl etc. are smarter.

What part? Splitting the loop? I understand that it launches one thread per loop iteration (not simultaneously). I have examples where splitting the loop manually makes a huge difference (if that I can show those later).

I mean that tt here is more like tp than like ts:

julia> function tt(n::Int)
       z = zeros(Int, n); Threads.@threads for i in 1:n; z[i] = Threads.threadid(); end; z'
       end;

julia> function tp(n::Int)
       z = zeros(Int, n); ps = collect(Iterators.partition(1:n, cld(n,Threads.nthreads())))
       Threads.@sync for p in ps; Threads.@spawn for i in p; z[i] = Threads.threadid(); end; end; z'
       end;

julia> function tp2(n::Int)  # very similar
       z = zeros(Int, n); ps = collect(Iterators.partition(1:n, cld(n,Threads.nthreads())))
       Threads.@threads for p in ps; for i in p; z[i] = Threads.threadid(); end; end; z'
       end;

julia> function ts(n::Int)
       z = zeros(Int, n); Threads.@sync for i in 1:n; Threads.@spawn z[i] = Threads.threadid(); end; z'
       end;

julia> @btime tt(10^6)  # @threads
  279.834 μs (23 allocations: 7.63 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  …  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4

julia> @btime tp(10^6)  # partitioned by hand
  284.458 μs (39 allocations: 7.63 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  …  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3

julia> @btime tp2(10^6)  # ditto
  265.333 μs (24 allocations: 7.63 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  …  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4

julia> @btime ts(10^6)  # spawn for every iteration: 2000x slower
  502.687 ms (6000032 allocations: 474.39 MiB)
1×1000000 adjoint(::Vector{Int64}) with eltype Int64:
 2  4  2  4  2  4  2  4  3  3  3  2  2  3  2  2  …  2  4  2  3  4  3  4  3  3  2  4  2  3  3  3


1 Like

Yeah… I admittedly will have to revise the situations in which that seemed to have worked. I am not able to reproduce that in a MWE. (and I even have that improvement by manual spiting recorded in a video, during a lecture… :man_shrugging:). It was something more complex than that (computation of all the distance between particles of a simulation, but anyway I would have to understand why in that particular case that made a difference.

1 Like