@threads vs @spawn

Hi, would appreciate some feedback about this issue. I’m trying to understand the difference between @threads and @spawn.

I think the former is easy to understand. I have an experiment shuffling cards, many many times. See if royal flush appears. I have supplied --threads=8 in the command line. Thus the following loop processes 8 batches in parallel at any one time. It will finish only when all 20 batches complete.

    batches = 20
    @threads for i in 1:batches
        batch!(...)
    end

Indeed, that’s what it does.

But I read about @spawn first before I understand the simpler syntax of @threads. So I had written the loop this way in the beginning:

@sync for i in 1:batches
    Threads.@spawn batch!(...)
end

I thought it also does the same thing as above… Are they equivalent? This spawns 20 threads at once. I guess 8 of them will run at any one time. @sync should wait until they all completed.

Apparently performance-wise they are not the same. The second one doesn’t behave like the first. I can see in “top” the julia process started with 800% cpu (correct… 8 threads). But after completing two batches, CPU dropped down to 400%… then after a much longer time a couple more batches completed. Then CPU dropped down to 200%. Eventually CPU dropped down to 100% without any more batch completion. It’s obviously not what I intended. And the experiment ran too long I had to Ctrl-C it.

How come?

Thanks

Happy to supply source code (about 100 lines). Let me know if anyone wants to see it.

1 Like

In order to see what happens, I suggest two little test functions returning the thread load:

using .Threads

function threaded(batches)
    ret = zeros(Int, nthreads())
    @threads for i in 1:batches
        ret[threadid()] += 1
    end
    return ret
end

function spawned(batches)
    ret = zeros(Int, nthreads())
    @sync for i in 1:batches
        Threads.@spawn ret[threadid()] += 1
    end
    return ret
end

then:

julia> threaded(20)
8-element Array{Int64,1}:
 3
 3
 3
 3
 2
 2
 2
 2

julia> spawned(20)
8-element Array{Int64,1}:
 0
 8
 9
 1
 2
 0
 0
 0

If you want uniformly loaded threads for parallel computation, use @threads for ....

@spawn is more for randomly spawning concurrent tasks. The manual has for it:

Create and run a Task on any available thread.

There is no guarantee for balanced load. In a bad case you can get even:

julia> spawned(20)
8-element Array{Int64,1}:
  1
 13
  1
  1
  1
  1
  1
  1
11 Likes

Thanks a lot. I knew someone would know right away. This is an awesome language. So fast, so easy to write parallel threads, so many experts willing to help.

Thanks again.

5 Likes

Sorry for the bump, but from this (super-enlightening) example, it sounds like @threads is just like @spawn, but more constrained. In that logic, it would seem like @spawn is generally preferrable, but I’m pretty sure that’s wrong, right?

I agree that @spawn (or Task) is indeed more flexible than @threads and provides more fine-grained control. But if the situation allows it, you can just stick with @threads.

Note that the performance difference does not seem to exist anymore. Sure, @pbayer’s example still shows less balanced distribution across threads, but it is also not performing any calculations, so that does not matter. For a computationally more intensive example, consider

using .Threads
using BenchmarkTools

function threaded(batches, A)
    ret = zeros(Int, nthreads())
    s = 0  # Do something with our calculations so they don't get compiled away
    @threads for i in 1:batches
        M = view(A, :, :, i)
        ret[threadid()] += 1
        s += sum(M * M')  # Note: race condition, so incorrect final s, but not important
    end
    return ret, s
end

function spawned(batches, A)
    ret = zeros(Int, nthreads())
    s = 0
    @sync for i in 1:batches
        @spawn begin
            M = view(A, :, :, i)
            ret[threadid()] += 1
            s += sum(M * M')
        end
    end
    return ret, s
end

batches = 20
A = rand(1000, 1000, batches)

# For the timings below nthreads() == 8, and BLAS.get_num_threads() is at its default of 4. 
# (Using BLAS.set_num_threads(1) approximately halves the execution time, both for threaded and spawned.)
@btime threaded($batches, $A)
#    222.530 ms (124 allocations: 152.59 MiB)
#  ([2, 3, 3, 3, 2, 3, 2, 2], 7.508622562098708e8)

@btime spawned($batches, $A)
#    216.786 ms (193 allocations: 152.60 MiB)
#  ([2, 3, 3, 2, 3, 2, 3, 2], 7.505509242284892e8)

I think this is a bad example. Each task takes very little time, so typically a task is finished before the next one is ready to start, so they can better be run serially. To better model a real computation, you should ensure that it grabs the thread for some time, e.g. with a systemsleep.

function spawned(batches)
    ret = zeros(Int, nthreads())
    @sync for i in 1:batches
        Threads.@spawn begin 
            ret[threadid()] += 1
            Libc.systemsleep(0.001)
        end
    end
    return ret
end
1 Like

Please don’t use threadid: PSA: Thread-local state is no longer recommended

While good advice in general, how else would you obtain the (initial) distribution of tasks across threads?

For this particular case, to measure the distribution of tasks across threads, that advice does not apply? But I agree it’s a bad idea in general, since tasks may move between threads. They don’t in these particular model runs, but they may do in the real shuffling problem.

No, that’s not wrong. Generally speaking, @spawn is preferable. However, it doesn’t have a static option. That is, if you want a static task-thread assignment (i.e. no task migration), the only way to get it (without using packages or making ccalls) is through @threads :static.

Packages like OhMyThreads.jl try to provide a more consistent and configurable API. For example, @threads becomes @tasks (which is configurable) and there also is @spawnat.

1 Like