I did a test on the concurrency using @spawn, and find what the title suggests. I don’t know where is the twist. Here are my code
import Base.Threads.@spawn
Threads.nthreads() # use only one thread in this test
s(x) = sin(π * x/2);
c(x) = cos(π * x/2);
function cpu_intensive_work(N)
a = 0.5
for i = 1:Int(N)
a = a < 0.5 ? s(a) : c(a)
end
a
end
function cpu_eased_work()
sleep(5)
end
function test()
wtask = @spawn cpu_eased_work()
t0 = time()
ret = cpu_intensive_work(3e8) # about 4.5 seconds
Δt1 = time() - t0
wait(wtask)
Δt2 = time() - t0 - Δt1
Δt1, Δt2, ret
end
# First, we run this directly in REPL, the desired concurrency is achieved as expected
wtask = @spawn cpu_eased_work()
t0 = time();
ret = cpu_intensive_work(3e8);
Δt1 = time() - t0 # 4.722
wait(wtask)
Δt2 = time() - t0 - Δt1 # 0.258
# [comment] We can infer that concurrency happens since only 5 seconds is consumed
# Second, we run this inside a function, the concurrency is failed (why?)
test() # Notice the returned values: Δt1 = 4.436000108718872, Δt2 = 5.007999897003174
# [comment] Δt2 alone takes 5 seconds!
How to rewrite my test() function so that concurrency happens, just as the 6-line REPL experiment written above?
The typical approach is to @spawn all the tasks, not leave work in the originating function call. To not cheat, I wait/fetch for the tasks to complete before leaving the originating function call. The order of which tasks to wait/fetch matters for some reason, and @sync is how to get around that.
julia> @time test() # reproduces example
9.782188 seconds (857 allocations: 44.219 KiB, 0.05% compilation time)
(4.766000032424927, 5.019000053405762, 0.5316954226664747)
julia> function test2()
wtask = @spawn cpu_eased_work() # sleep(5)
wtask2 = @spawn cpu_intensive_work(3e8) # about 4.5 seconds
wait(wtask)
fetch(wtask2)
end
test2 (generic function with 1 method)
julia> @time test2()
9.754124 seconds (5.46 k allocations: 275.234 KiB, 0.11% compilation time)
0.5316954226664747
julia> function test3()
wtask = @spawn cpu_eased_work() # sleep(5)
wtask2 = @spawn cpu_intensive_work(3e8) # about 4.5 seconds
ret = fetch(wtask2)
wait(wtask)
ret
end
test3 (generic function with 1 method)
julia> @time test3()
5.014386 seconds (5.46 k allocations: 275.047 KiB, 0.21% compilation time)
0.5316954226664747
julia> function test4()
@sync begin # introduces let block!!
wtask = @spawn cpu_eased_work() # sleep(5)
wtask2 = @spawn cpu_intensive_work(3e8) # about 4.5 seconds
fetch(wtask2)
end
end
test4 (generic function with 1 method)
julia> @time test4()
5.011570 seconds (5.47 k allocations: 275.469 KiB, 0.21% compilation time)
0.5316954226664747
Why? Is this stipulated somewhere? I don’t think this is a reasonable requirement—I think the coding style in my test() makes sense for being concise—and it can be realized in REPL (thereby raising the question of inconsistency as the title suggests).
I see, as shown in your test2() vs test3(). But I don’t think this is a reasonable behavior (is it?).
I restate this point to make it easier to be aware of
function test5(wait_cpu_task_first)
sleep_task = @spawn cpu_eased_work() # sleep(5)
cpu_task = @spawn cpu_intensive_work(3e8) # about 4.5 seconds
if wait_cpu_task_first
wait(cpu_task)
wait(sleep_task)
else
wait(sleep_task)
wait(cpu_task)
end
end
julia> @time test5(true)
5.015835 seconds (5.45 k allocations: 274.453 KiB, 0.23% compilation time)
julia> @time test5(false)
9.436463 seconds (13 allocations: 1.250 KiB)
The thing you’re missing, and which I still don’t really understand because I don’t do concurrency all that much, is that tasks don’t automatically begin execution at @spawn. They’re scheduled by the task manager and are thus considered to be running concurrently, but they only really begin if they reach a thread. So the 2 possible orders of the 2 tasks are:
a.
The scheduler puts cpu_eased_work() on the thread. sleep(5) sets the start time and yields to the scheduler (many of Julia’s I/O methods are designed to do this).
Now the scheduler puts cpu_intensive_work(3e8) on the thread, and it runs to completion for ~4.5 seconds.
The scheduler puts the preexisting cpu_eased_work() back on the thread, and it only lasts for ~0.5 seconds, a total of ~5 seconds since the start time. The total time of the 2 tasks is thus ~5 seconds.
b.
The scheduler puts cpu_intensive_work(3e8) on the thread, and it runs to completion for ~4.5 seconds.
The scheduler puts cpu_eased_work() on the thread. sleep(5) sets the start time and yields to the scheduler. Whatever the scheduler does next, this task ends 5 or more seconds later. The total time of the 2 tasks is thus 9.5 or more seconds.
In the original test with only one wtask, it scheduled the task then continued onward to run cpu_intensive_work(3e8), only yielding to the task at wait(wtask) for effectively the same timeline as (b). The same thing would occur in a let or begin block; on the other hand, evaluating lines as separate expressions in the global scope did actually yield at @spawn.
I’ll amend the previous comment too, but I was wrong about @sync handling it for us. We get the slower scenario (b) if the @spawn statements are swapped:
julia> function test4ii()
@sync begin
wtask2 = @spawn cpu_intensive_work(3e8) # about 4.5 seconds
wtask = @spawn cpu_eased_work() # sleep(5)
fetch(wtask2)
end
end
test4ii (generic function with 1 method)
julia> @time test4ii()
9.789642 seconds (5.47 k allocations: 275.734 KiB, 0.12% compilation time)
0.5316954226664747
Switching the @spawn lines doesn’t change test2 but it does slow down test3. As you demonstrate, switching which task to wait/fetch first also matters, which I inadvertently do in test2 vs test3. I don’t really know what the scheduler is doing, but maybe there is some principle.
The easiest adjustment is to make every scheduled task yield immediately after starting any timers so they all get a quick turn with the thread right after the originating function waits for them. We could change cpu_intensive_work, but @spawn and other task-making macros already make zero-argument closures, so:
julia> function test4iii()
@sync begin
wtask2 = @spawn begin yield(); cpu_intensive_work(3e8) end # about 4.5 seconds
wtask = @spawn cpu_eased_work() # sleep(5)
fetch(wtask2)
end
end
test4iii (generic function with 1 method)
julia> @time test4iii()
5.018634 seconds (5.52 k allocations: 277.922 KiB, 0.23% compilation time)
0.5316954226664747
The change doesn’t help test5’s separate wait calls though.
What I guess happens here is that the wtask in test() is scheduled on the main thread, which immediately becomes busy with cpu_intensive_work. The wtask won’t even begin its sleep before the cpu_intensive_work starts. The wtask won’t be rescheduled to another thread because it doesn’t even reach a point where it can be rescheduled.
This particular pattern, with some cpu-bound tasks, and some waiting tasks, is problematic. It’s mentioned here Multi-Threading · The Julia Language, in the context of garbage collection:
Compute-bound, non-memory-allocating tasks can prevent garbage collection from running in other threads that are allocating memory. In these cases it may be necessary to insert a manual call to GC.safepoint() to allow GC to run. This limitation will be removed in the future.
To alleviate this, you should @spawn all tasks, and wait for them (or do a yield() now and then in the cpu-bound task).
In the REPL case, the wtask will manage to go to sleep before cpu_intensive_work starts, and the wtask will be scheduled on a different thread when woken up. In general, tasks can only migrate to another thread when it interacts with the julia runtime (io, sleep, lock, allocation, etc), because tasks do cooperative scheduling. I.e. the julia runtime can’t pause a cpu-bound task and move it to another thread.
function cpu_intensive_work(N)
a = 0.5
for i = 1:Int(N)
iszero(i & 0xff) && yield()
a = a < 0.5 ? s(a) : c(a)
end
a
end
julia> test()
(5.12465500831604, 2.1457672119140625e-6, 0.5316954226664747)
On a more general level, when running concurrent Julia code, you should never write code that runs for a significant amount of time and doesn’t yield. The problem in this thread is just one of the many bad things that can follow from not yielding.
The reason is that Julia uses cooperative scheduling. Each thread yields itself, and is not preempted by the scheduler.
I recommend not trying to game the scheduler and e.g. try to make sure that one task is run before an other. Instead, always, always make your tasks yield regularly to allow the scheduler to do its job.
The 5 seconds actually suggests the sleep-task was started first. I still don’t know how @spawn/wait/fetch affects when scheduled tasks gets onto threads, and I don’t even know if that’s even supposed to be consistent. As I mentioned, my lack of need for and inexperience with concurrency explains a lot of it, but some other things I don’t really know:
does @spawn/@async put tasks into the queue in a order, and does that order also determine which get dibs when the scheduler looks for a task to run on a free thread?
when I wait/fetch a task, does it yield to the whole task queue or run/restart the input task first?
what does yielding and blocking really do? “Tell the scheduler to run another task” and “stop this task until a condition is met” seems clear in isolation, but it doesn’t say much about how the scheduler handles this tasks among the other queued tasks or what’s supposed to happen when you do either from outside of a task (e.g. sleep obviously delays more from a non-task than a task).
What methods yield and don’t yield? yield, sleep vs Libc.systemsleep, and print/println are well-known, but there’s usually not much indication. A print that doesn’t yield would actually be very helpful in visualizing progress of concurrent tasks for these sort of educational experiments.
Some “settings” functions like g/set_zero_subnormals can be documented to “affect the current thread”. Does this mean they’re unsafe for task migration like threadid, and how would we accomplish the same effect in a migrate-able task?
I’m sure I could eventually figure this out if I put my mind to it, but the manual isn’t the best 101 source because it throws out a lot of generic terms without explaining them, and other sources may use these terms to mean different things or different terms to mean the same things (concurrent vs parallel, task vs coroutine vs green thread vs thread vs process), which is understandable because they may not be dealing with the same contexts (Julia, BLAS, and the OS have their own “threads” and I wish I understood it better). I know people have mentioned that the manual needs to be updated to recommend @spawn over @async, but I think it generally needs to be rewritten to be more informative to people starting from scratch. As much as it helps, making experiments like this to discover how concurrency works is time-consuming, error-prone, and doesn’t clarify which behaviors are actually reliable.
I agree what Benny said (I’ve also expressed many times my opinion about julia’s documentation—they are not as effective as its code).
I don’t have enough knowledge about what is a cpu-intensive work, therefore I devised the sin-cos one in my first post in this topic.
Now I speculate that an easier (perhaps the easiest) cpu-intensive work can be
function cpu_intensive_sleep(Δt)
t0 = time()
while time() - t0 < Δt
end
end
Am I correct?
I further did a test using the newly proposed cpu_intensive_sleep function above
That reaches a different concept: busy waiting, basically just making the processor do otherwise pointless work until a condition is met. The risk is the compiler may decide to optimize away everything for a trivial result, but you can at least time it on your machine. If you just want to halt a Julia process or task without yielding to another task, Libc.systemsleep lets eligible operating systems do that exact job without using more energy on the work. However, a busy processor is less liable to be interrupted by other processes, and staying in the same process could be more precise than sleeping and waking, especially when considering hardware timer precision. For concurrency, such precision is overkill; sleeping tasks still have to wait for an available thread after their input time already passed, and that delay is a small price for our tasks taking turns on threads over time.
It takes me some time to understand what this means. If let me restate, maybe
Julia uses cooperative scheduling—each thread will not be pre-empted automatically by the scheduler. Therefore, if needed, you should let each thread yield itself.
(well, the wording “pre-empt” is quite obscure for a noob at first. In my understanding, a similar verb could be “obstruct” (okay?).)
Well, the word “thread” is even more tricky and complex. Can anyone share a conclusion how many meanings are there for the word “thread”? Like OS thread, julia thread, green thread… is there a thread in the hardware sense?? The concepts are very complicated.
There is a thread in the hardware sense. It’s often called hyperthread. The idea is that two instruction streams are run simultaneously on the same cpu core. They share some resources, e.g. a multiply can be done in one of the threads simultaneously with an xor in the other thread. However, hardware is anyway very parallel nowadays. Even a single instruction stream (e.g. a julia program) can be executed partly in parallel and out of order. Both forks in an if-then-else can be executed simultaneously, awaiting evaluation of the if-condition, then the irrelevant one is discarded. Etc. etc.
An OS-thread is the unit of execution of the OS. Unless running in a real time OS, OS-threads can be pre-empted by the OS, i.e. put forcefully to sleep, and the OS will continue with another thread, or with its own urgent business (e.g. serving a physcal interrupt). An OS will receive a physical timer-interrupt every 0.01 or 0.001 second, loop through the running threads and might pause one thread and continue another. That’s “time sharing”.
A julia thread is, at least for now, an OS-thread. Which the OS controls. However, julia controls which julia task runs in the thread. But julia can’t forcefully pause a task and put another task to run on the thread. Rather, the running task must call yield() at some point. This will, at least conceptually, put the task in some julia queue, and some other (or the same) task can be started on the thread. A yield() is done inside julia io/sleep/lock routines, and can also be done manually in loops which do not call into any such julia routines (typically a tight loop).
Julia threads are persistent throughout the execution of a julia program, but julia tasks are created on the fly and queued for execution on some thread.
“Green thread” has been used for tasks created by @async, which can only run on the main thread, and then only when nobody else uses it. Typically it can be used for logging purposes when the main thread does various IO things, e.g. to report progress.
In C+±20 (probably elsewhere too), there’s a concept of “stackless thread” or “coroutine”, which is a very lightweight asynchronous animal with limited functionality, but with some uses.
When programming parallel algorithms in julia, it’s tasks which are important. You can spawn millions of them, and the julia scheduler will start one on each thread. When started, the task will run until it’s finished, or at some point it calls yield() (explicitly in user code, or implicitly by library functions).
The julia way of doing tasks, was inspired by an extension to C called “cilk” conceived of at MIT in the 1990’s. Cilk - Wikipedia. It’s unfortunate that the word “thread” is used in julia contexts, it’s not really about threads. In julia, threads are a static resource, used for running tasks.
The spawn keyword, when preceding a function call (spawn f(x)), indicates that the function call (f(x)) can safely run in parallel with the statements following it in the calling function.
This quote from the Cilk-Wikipedia Link above corresponds squarely to my idea above:
This pattern means the same in julia, but no spawning pattern guarantees concurrent execution. It depends on subtle details of julia’s scheduler. Scheduling workloads of unknown size is difficult. E.g. in linux a lot of work has gone into various scheduling policies, both for threads and io.
Julia has one scheduling policy whose details admittedly are a bit obscure (just like the typical OS-thread scheduler, be it in windows, macos or elsewhere). I sometimes use the pattern you’ve tried, i.e.
with success, whereas in other cases it’s something like
r = sum(t -> fetch(t)::Int, [@spawn work(i) for i in alli])
where all the work is spawned. In neither case is there any guarantee that certain tasks actually run concurrently. The julia scheduler does as best as it can, sometimes its strategy isn’t optimal. Just like any scheduler, which in reality attempts to solve the knapsack-problem with unknown sizes. Batch-systems on supercomputers have the same problem.
I just want to mention that the number of threads can be specified on the julia commandline or by some function calls, and depending of your julia version there are different defaults eg for interactive threads.