Confusion regarding Threads.@spawn performance

I’m not able to use Threads.@spawn with the kind of performance I’d expect:

samuela@n64:~/dev/research/julia/odecontrol$ JULIA_NUM_THREADS=32 julia --project
   _       _ _(_)_     |  Documentation:
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.4.2 (2020-05-23)
 _/ |\__'_|_|_|\__'_|  |  Official release
|__/                   |

julia> using LinearAlgebra

julia> using BenchmarkTools

julia> @time svdvals(rand(1000, 1000))
  0.102739 seconds (33 allocations: 15.859 MiB)
1000-element Array{Float64,1}:

julia> @time svdvals(rand(1000, 1000))
  0.104076 seconds (12 allocations: 15.839 MiB, 3.85% gc time)
1000-element Array{Float64,1}:

Ok, so the SVD takes about 100ms. Got it.

julia> @time Threads.@spawn svdvals(rand(1000, 1000))
  0.003804 seconds (6.58 k allocations: 380.660 KiB)
Task (runnable) @0x00007f829b07b340

julia> @time Threads.@spawn svdvals(rand(1000, 1000))
  0.000123 seconds (27 allocations: 2.203 KiB)
Task (runnable) @0x00007f829d945870

We can spawn a thread for the SVD and get back a Task in ~0.1ms. It stands to reason then that we should be able to spin up 1000 Tasks in the same way in about 100ms. But instead this code just hangs forever:

julia> @time [Threads.@spawn svdvals(rand(1000, 1000)) for _ in 1:1000]

Why is that? Am I hitting some kind of thread limit?

svd should already be multithreaded, so I presume that spawning threads over that just isn’t a good idea since BLAS threads don’t currently compose, so it’s probably running thousands of simultaneous threads simultaneously there.

Note that you are only starting the computation and not waiting for them. Maybe what you wanted to write was @time @sync Threads.@spawn svdvals(rand(1000, 1000)) and map(fetch, [Threads.@spawn svdvals(rand(1000, 1000)) for _ in 1:1000]). However, spawning 1000 tasks at once does not sound like a good idea (although it doesn’t matter much here since svdvals is a compute-intensive function). It’d be better to chunk the iteration space.

In general, I highly recommend to not use @spawn directly unless you are in very low-level programming mode. For high-level computations like this, consider one of

But, if you use only svdvals (or other threaded functions from BLAS) in threads and not other functions, it doesn’t make sense to combine it with threads in Julia, as @ChrisRackauckas mentioned.

Just in case you did mean to measure an overhead of just starting a task, a better way is probably something like

julia> function onlystart()
           t = Task(() -> nothing)
           t.sticky = false
onlystart (generic function with 1 method)

julia> @time onlystart()
  0.000005 seconds (4 allocations: 704 bytes)
Task (done) @0x00007f6e31dbb5b0

julia> @time onlystart()
  0.000003 seconds (4 allocations: 704 bytes)
Task (done) @0x00007f6e31dbdae0

Note that I don’t need to use svdvals or anything to measure the time for “spinning up” a task; it’s the same for any task.

Repeatedly calling @time Threads.@spawn ... is not a good idea because it’d includes compile time and Threads.@spawn would create a new closure every time. Using @btime is always a good idea but doing this here is a bit tricky:

julia> const TASKREF = Ref(onlystart())
Base.RefValue{Task}(Task (done) @0x00007f41d66e2710)

julia> @btime (TASKREF[] = onlystart()) setup = wait(TASKREF[])
  162.337 ns (4 allocations: 723 bytes)
Task (done) @0x00007f41d4370010

I’m not sure if this is a relevant performance characteristic, though. You’d probably want to know the overhead of starting and waiting for a task:

julia> @btime wait(Threads.@spawn nothing)
  1.712 μs (5 allocations: 726 bytes)

julia> @btime foreach(wait, [Threads.@spawn nothing for _ in 1:Threads.nthreads()]);
  2.937 μs (21 allocations: 2.93 KiB)

Just in case you’ve understood everything already but was wondering why @time [Threads.@spawn svdvals(rand(1000, 1000)) for _ in 1:1000] doesn’t finish quickly, I think this is because Julia is using cooperative multitasking. Once all the tasks hits the BLAS function and ccalling the external function, they block Julia scheduler to do anything so it’d look like the scheduler just hangs. My guess is that all Julia threads hit ccall once @time tries to print something to stdout (which yields to the scheduler).


Yeah I need to get a better “work” function. Sleeping would prob be easier I suppose…

Ah, yes I think this is the answer I was looking for. I’m still not sure I understand why @time is so problematic… Each thread will invoke a ccall to BLAS, but why does the scheduler block before printing out the array of tasks?

Also thanks for the pointers to higher-level packages! I’ll check those out.

This is because the scheduler stops the current task (which is executing the REPL) and start executing the other tasks waiting for the CPU resource.

Here is a self-contained MWE:

function intensive_computation()
    started = time_ns()
    while time_ns() - started < 1_000_000_000

@show Threads.nthreads()
let t0 = time_ns()
    tasks = [Threads.@spawn(intensive_computation()) for _ in 1:Threads.nthreads()]
    t1 = time_ns()
    print("all ")
    t2 = time_ns()
    t3 = time_ns()
    foreach(wait, tasks)
    t4 = time_ns()
    @show (t1 - t0) / 1_000_000
    @show (t2 - t1) / 1_000_000
    @show (t3 - t2) / 1_000_000
    @show (t4 - t3) / 1_000_000

With julia1.5 -t2, it prints

Threads.nthreads() = 2
all spawned
(t1 - t0) / 1000000 = 38.178896
(t2 - t1) / 1000000 = 1002.921738
(t3 - t2) / 1000000 = 0.052758
(t4 - t3) / 1000000 = 7.701817

Notice that t1-to-t2 (which just calls print("all ")) is the longest and it matches with the total computation time required for this MWE. If you run this, you can see that it “hangs” briefly just after printing all. That is to say, the I/O print("all ") looks like blocking the execution. The julia runtime tries to share the CPU resource fairly to the tasks and switches to one of the Threads.@spawn(intensive_computation()) task, hoping they let go the CPU “soon enough.” However, this doesn’t happen with compute-intensive task, especially when you are ccalling external libraries.

I think this is happening in the @time example in the OP. Indeed, Base.time_print called from @time first prints empty spaces (typically)

So, it looks like print("all ") in the MWE that blocks printing the rest of the result. My guess is that you’d see what you expected (~100 ms) if you waited long enough for all the task to complete (~100 seconds?). Here is what I see with 10 tasks:

julia> @time svdvals(rand(1000, 1000));
  0.171515 seconds (11 allocations: 15.839 MiB)

julia> t0 = time_ns()
       @time [Threads.@spawn svdvals(rand(1000, 1000)) for _ in 1:10]
       t1 = time_ns()
       (t1 - t0) / 10^9
  0.032784 seconds (51.69 k allocations: 3.062 MiB)

Notice that the time to complete the full @time ... expression is much longer than what @time itself reports.