Blog post: Asynchronous programming in Julia

I’ve made a new blog post, this time an introduction to asyncronous programming in Julia. You can find it here: https://viralinstruction.com/posts/threads/

I’m interested in feedback, especially:

  • Any misunderstandings in the blog post
  • Typos and other errors
  • Content I should have covered but didn’t

I’ve found that async is a neat subject, but the knowledge of how to do it in Julia is scattered around, as opposed to in other ecosystems, where there are good thorough resources on the topic. I wrote the post not just to coalesce information one place, but also to motivate myself to understand the topic.

Even though the post is complete as-is, I’ll probably continue adding to, and modifying the post for some time.

54 Likes

Nice post! Created a bunch of PRs for typos etc.

One issue I’d raise is that in many places you mention how fetch(task) has bad return type inference, and advise using a type assert right after the fetch(task) call. That’s correct, however I think it may be better to say that fetch is a very low-level tool, and advise users to reach for Channel instead. Unless fetch is only being used for simple synchronization, with the return value ignored, which seems OK to do.

7 Likes

Written 2025-12-26

:eyes: Do you have more news from the future?

5 Likes

Thanks! I’m learning a lot. But I think update_bytes_good is incorrect as written. Here’s my take:

function update_bytes_good(v::Vector{UInt8}, offset::Int)
	len = cld(length(v), 8)
	chunkoffset = len * offset
	for _ in 1:1000, i in (1 + chunkoffset):min(len + chunkoffset, length(v))
		v[i] += 0x01
	end
end

You’re right - it was a simple typo. It should be offset*len+1:(offset+1)*len, not offset*len+1:(offset+1):len.
This also reveals a much, much larger difference in timing.
I’ll have it fixed soon.

1 Like

A big confounder for the timings here is that the unit-stride loop can SIMD, unlike the interleaved loops. Here’s a screenshot showing the difference.

I modified the code to take the number of tasks as a variable, such that for n=1 the two variants are effectively identical; still, good is more than 17 times faster than bad. The difference is that in good, the loop is known to have unit stride at compile time. (A factor 17 is still a lot though; probably something more than just SIMD that gets enabled. This is on apple M4.)

This is even more reason to prefer contiguous rather than interleaved chunks, but it implies that false sharing is not the main explanation for the timing discrepancy.

EDIT: The code in the screenshot is only valid when n is a power of 2.

Very cool! So I could finally understand async programming well :grin:

little thing: I find strange to see the code block outputs at the top of the code blocks instead of the bottom, is this intended?

One way to get around the SIMD confounder is to only update every other element, like this:

It’s remarkable how much performance degrades when there’s false sharing between multiple tasks, even when compared to a single task doing all the work.

Me too :slight_smile: But this is a design decision of the (author of the) particular tool used to create those notebooks: https://plutojl.org. Some discussion and explanation here: Why is cell output above code? · Issue #205 · fonsp/Pluto.jl · GitHub.

Thank you! I have always wished there was a comprehensive tutorial to learn async programming in Julia. I have read the low level part (till the tasks section). I have found it quite challenging to understand what the text tries to convey. Maybe add some illustrations (such as what is the before-after thing, how registers and stacks works, etc.) will be great help for beginners like me.

I believe that code also SIMDs in the good case. I’ve changed the example (on the git repo, not yet on the blog) such that the step size is passed in as an argument.
Another change I’ve made is to make the vector smaller, but having it loop over it more times. The reason is that, with a long vector, false sharing will slow down one task, causing it to fall behind the others, and operate on elements in another section of the array. This will then reduce the impact of false sharing, making this example self-limiting.

Some optimization certainly vanished when going from stride 1 to stride 2, since as you can see the difference when using a single task shrunk from a factor of 17 to within the noise. I’m by no means an expert on this, but the docstring for the @simd macro says

  • Accesses must have a stride pattern and cannot be “gathers” (random-index reads) or “scatters” (random-index writes).
  • The stride should be unit stride.
1 Like

Nice blog!

A pet peeve of mine is the use of “asynchronous” when “concurrency” is described :wink:

A CPU core can only run one or two threads at a time, so the number of threads is usually a small, fixed number corresponding to the core count of the CPU. You can check the number of current threads with the function Threads.nthreads():

I assume you are referring to Simultaneous multithreading
or what intel call “hyper-threading”. Tecnically a “CPU core” as presented to the OS can only run one thread at a time, this is sometimes called a “logical CPU core” and that is what the OS is scheduling threads onto. The hardware then internally multiplexes logical cores onto hardware cores and instructions (depending on slot utilization) may or may not execute in parallel.

Probably the most helpful thing is to say:

A CPU core can only run one threads at a time, so the number of threads is usually a small, fixed number corresponding to the core count of the CPU. You can check the number of current threads with the function Threads.nthreads():

And then discuss the complexity of SMT later on.

Therefore, calls to the GC.safepoint() is peppered across various functions in the Julia runtime, like memory allocation or IO.

For a while now we have been doing “safepoint-on-entry” e.g. every non-inlined function contains a safepoint that is triggered at the beginning of the function.

Mostly, people shouldn’t use spinlocks.

I think that deserves to be bold in flashing colors xD.

A function that doesn’t allocate, or only allocates in a foreign function call, does not need to be stopped for the GC to run.

Also see Allow for :foreigncall to transition to GC safe automatically by vchuravy · Pull Request #49933 · JuliaLang/julia · GitHub to make foreign calls as safe to execute the GC concurrently too.

6 Likes
  • Memory allocation, including during dynamic dispatch will occasionally yield

Is this actually correct? This was definitely not my understanding.

(“yield”: allow the scheduler to mount a different task onto the current thread. Notably different from GC safepoint, which allows the GC to put the thread to sleep until a stop-the-world pause is finished, but does afaiu not allow the scheduler to hand out the thread to a different task)

Is it excepted that function observe_overwrite from the blog post, will sometimes return false in practice?

I have set up the following simple experiment,

function overwrite(a::Ref{Bool}, b::Ref{Bool})
    a[] = true
    b[] = a[]
end;

function observe_overwrite()
    a = Ref(false)
    b = Ref(false)
    t = Threads.@spawn overwrite(a, b)
    b[] ? a[] : true
end;


i = 0
while observe_overwrite()
    i += 1
end
@show i

and it never left the loop.

Probably not in practice, no. I would expect that the statement b[] ? a[] : true will be executed before the spawned task begins, such that b[] is false when the statement runs.

It would be fun to try and find an example where the lack of happens-before between tasks can be directly observed, but I haven’t been able to find one.

2 Likes