Blog post: Asynchronous programming in Julia

jakobnissen · February 8, 2025, 6:23pm

I’ve made a new blog post, this time an introduction to asyncronous programming in Julia. You can find it here: https://viralinstruction.com/posts/threads/

I’m interested in feedback, especially:

Any misunderstandings in the blog post
Typos and other errors
Content I should have covered but didn’t

I’ve found that async is a neat subject, but the knowledge of how to do it in Julia is scattered around, as opposed to in other ecosystems, where there are good thorough resources on the topic. I wrote the post not just to coalesce information one place, but also to motivate myself to understand the topic.

Even though the post is complete as-is, I’ll probably continue adding to, and modifying the post for some time.

nsajko · February 8, 2025, 7:02pm

Nice post! Created a bunch of PRs for typos etc.

One issue I’d raise is that in many places you mention how fetch(task) has bad return type inference, and advise using a type assert right after the fetch(task) call. That’s correct, however I think it may be better to say that fetch is a very low-level tool, and advise users to reach for Channel instead. Unless fetch is only being used for simple synchronization, with the return value ignored, which seems OK to do.

DNF · February 8, 2025, 8:04pm

Written 2025-12-26

Do you have more news from the future?

danielwe · February 8, 2025, 8:49pm

Thanks! I’m learning a lot. But I think update_bytes_good is incorrect as written. Here’s my take:

function update_bytes_good(v::Vector{UInt8}, offset::Int)
	len = cld(length(v), 8)
	chunkoffset = len * offset
	for _ in 1:1000, i in (1 + chunkoffset):min(len + chunkoffset, length(v))
		v[i] += 0x01
	end
end

jakobnissen · February 8, 2025, 9:04pm

You’re right - it was a simple typo. It should be offset*len+1:(offset+1)*len, not offset*len+1:(offset+1):len.
This also reveals a much, much larger difference in timing.
I’ll have it fixed soon.

danielwe · February 8, 2025, 9:36pm

A big confounder for the timings here is that the unit-stride loop can SIMD, unlike the interleaved loops. Here’s a screenshot showing the difference.

I modified the code to take the number of tasks as a variable, such that for n=1 the two variants are effectively identical; still, good is more than 17 times faster than bad. The difference is that in good, the loop is known to have unit stride at compile time. (A factor 17 is still a lot though; probably something more than just SIMD that gets enabled. This is on apple M4.)

This is even more reason to prefer contiguous rather than interleaved chunks, but it implies that false sharing is not the main explanation for the timing discrepancy.

EDIT: The code in the screenshot is only valid when n is a power of 2.

Tortar · February 8, 2025, 9:59pm

Very cool! So I could finally understand async programming well

little thing: I find strange to see the code block outputs at the top of the code blocks instead of the bottom, is this intended?

danielwe · February 8, 2025, 10:16pm

One way to get around the SIMD confounder is to only update every other element, like this:

It’s remarkable how much performance degrades when there’s false sharing between multiple tasks, even when compared to a single task doing all the work.

zdenek_hurak · February 9, 2025, 1:03am

Me too But this is a design decision of the (author of the) particular tool used to create those notebooks: https://plutojl.org. Some discussion and explanation here: Why is cell output above code? · Issue #205 · fonsp/Pluto.jl · GitHub.

liuyxpp · February 9, 2025, 6:59am

Thank you! I have always wished there was a comprehensive tutorial to learn async programming in Julia. I have read the low level part (till the tasks section). I have found it quite challenging to understand what the text tries to convey. Maybe add some illustrations (such as what is the before-after thing, how registers and stacks works, etc.) will be great help for beginners like me.

jakobnissen · February 9, 2025, 9:34am

I believe that code also SIMDs in the good case. I’ve changed the example (on the git repo, not yet on the blog) such that the step size is passed in as an argument.
Another change I’ve made is to make the vector smaller, but having it loop over it more times. The reason is that, with a long vector, false sharing will slow down one task, causing it to fall behind the others, and operate on elements in another section of the array. This will then reduce the impact of false sharing, making this example self-limiting.

danielwe · February 9, 2025, 10:07am

Some optimization certainly vanished when going from stride 1 to stride 2, since as you can see the difference when using a single task shrunk from a factor of 17 to within the noise. I’m by no means an expert on this, but the docstring for the @simd macro says

Accesses must have a stride pattern and cannot be “gathers” (random-index reads) or “scatters” (random-index writes).

The stride should be unit stride.

vchuravy · February 9, 2025, 2:50pm

Nice blog!

A pet peeve of mine is the use of “asynchronous” when “concurrency” is described

A CPU core can only run one or two threads at a time, so the number of threads is usually a small, fixed number corresponding to the core count of the CPU. You can check the number of current threads with the function Threads.nthreads():

I assume you are referring to Simultaneous multithreading
or what intel call “hyper-threading”. Tecnically a “CPU core” as presented to the OS can only run one thread at a time, this is sometimes called a “logical CPU core” and that is what the OS is scheduling threads onto. The hardware then internally multiplexes logical cores onto hardware cores and instructions (depending on slot utilization) may or may not execute in parallel.

Probably the most helpful thing is to say:

A CPU core can only run one threads at a time, so the number of threads is usually a small, fixed number corresponding to the core count of the CPU. You can check the number of current threads with the function Threads.nthreads():

And then discuss the complexity of SMT later on.

Therefore, calls to the GC.safepoint() is peppered across various functions in the Julia runtime, like memory allocation or IO.

For a while now we have been doing “safepoint-on-entry” e.g. every non-inlined function contains a safepoint that is triggered at the beginning of the function.

Mostly, people shouldn’t use spinlocks.

I think that deserves to be bold in flashing colors xD.

A function that doesn’t allocate, or only allocates in a foreign function call, does not need to be stopped for the GC to run.

Also see Allow for :foreigncall to transition to GC safe automatically by vchuravy · Pull Request #49933 · JuliaLang/julia · GitHub to make foreign calls as safe to execute the GC concurrently too.

foobar_lv2 · February 9, 2025, 11:18pm

Memory allocation, including during dynamic dispatch will occasionally yield

Is this actually correct? This was definitely not my understanding.

(“yield”: allow the scheduler to mount a different task onto the current thread. Notably different from GC safepoint, which allows the GC to put the thread to sleep until a stop-the-world pause is finished, but does afaiu not allow the scheduler to hand out the thread to a different task)

Pedro · February 10, 2025, 11:50am

Is it excepted that function observe_overwrite from the blog post, will sometimes return false in practice?

I have set up the following simple experiment,

function overwrite(a::Ref{Bool}, b::Ref{Bool})
    a[] = true
    b[] = a[]
end;

function observe_overwrite()
    a = Ref(false)
    b = Ref(false)
    t = Threads.@spawn overwrite(a, b)
    b[] ? a[] : true
end;


i = 0
while observe_overwrite()
    i += 1
end
@show i

and it never left the loop.

jakobnissen · February 10, 2025, 11:53am

Probably not in practice, no. I would expect that the statement b[] ? a[] : true will be executed before the spawned task begins, such that b[] is false when the statement runs.

It would be fun to try and find an example where the lack of happens-before between tasks can be directly observed, but I haven’t been able to find one.

Tamas_Papp · February 24, 2025, 8:27am

@jakobnissen, thank you for this very nice write writeup. As a scientist, my knowledge about asynchronous programming in general and in Julia comes from bits and pieces I picked up from various sources over time (usually investigating how I shot myself in the foot). Your blog post put a lot of it into a coherent framework.

I am still very far from using these constructs for anything but the common use cases, most of which are covered in OhMyThreads.jl anyway, but now I have an idea of how they are implemented and their relative cost. In particular, the discussion about implicit yielding, state stacks, and atomics was very useful, now I know what is happening under the hood so it helps me reason about performance better.

Topic		Replies	Views
Asynchronous tasks vs multi-threading New to Julia question , parallel-computing	16	365	August 10, 2025
Using Tasks 101 Teaching & Outreach tutorials , task	15	942	July 30, 2024
@sync usage and better documentation for Asynchronous Programming New to Julia task	1	606	December 7, 2020
@async macro not recommended? General Usage	5	456	April 3, 2024
Seeking advice on asynchronous programming problem General Usage async	4	190	August 10, 2025

Blog post: Asynchronous programming in Julia

Related topics