There’s a bit of confusion here about what things work, how and why. Let’s see if we can’t crack this, starting with the original function:
julia> function f(running)
i = 0
while running[] == true
i = i + 1
end
println("I'm done")
end
f (generic function with 1 method)
and looking at what LLVM IR is produced here, when we put in a Ref
(don’t worry, I’m going to go through every single line):
; Function Signature: f(Base.RefValue{Bool}) ; this is the signature of this code
; @ REPL[5]:1 within `f` ; where it was defined
define void @julia_f_2169({}* noundef nonnull align 1 dereferenceable(1) %"running::RefValue") #0 {
top:
%0 = bitcast {}* %"running::RefValue" to i8* ; cast to i8*
%1 = load i8, i8* %0, align 1 ; load the Bool data
%2 = and i8 %1, 1 ; mask out the truthiness value
%.not = icmp eq i8 %2, 0 ; compare to false
; @ REPL[5]:3 within `f`
br i1 %.not, label %L8, label %L2 ; if was false, goto L8, else goto L2
; this is our loop!!!!
L2: ; preds = %L2, %top
br label %L2 ; jump to L2 _unconditionally_
L8: ; preds = %top
; @ REPL[5]:6 within `f`
call void @j_println_2175({}* @"jl_global#2176.jit")
ret void
}
The first two lines are purely informational comments, stating first what the Julia function signature is and where it was defined. The third line is the LLVM function definition of our f
. The Ref
is a Base.RefValue
, which in LLVM ends up as a pointer (note that the argument is called running::RefValue
). LLVM IR is built up of blocks of code, with single expressions inside. Jumping between blocks is how control flow like branching or loops are done.
The first block is called top
, it represents the entry point of the function. The second block is called L2
, in this case representing our while
loop. The third block is called L8
, in this case representing the code after the loop.
In Julia, boolean values are (currently) stored as a whole byte, of which only the lowest bit holds the true
/false
value. If the lowest bit is 1
, the value is true
, if it’s 0
, the value is false
. So starting from the top
, we immediately cast the incoming pointer of our running
variable to an i8*
(a pointer to a byte), storing that cast pointer in the variable %0
. We then load the truthiness value (stored in %1
) and mask out the only part of a boolean actually encoding the data (stored in %2
). Now, the == false
comparison of our original loop begins; first comparing to false
, and subsequently jumping to either the actual loop L2
or to the println
after the loop, depending on whether or not the comparison was true or not.
If you examine L2
very closely ( ) you’ll notice that except for jumping right back to L2
, it’s completely empty! So the loop spins forever and ever, never checking running
again.
This may sound odd - after all, the Ref
may change value! However, this is a consequence of the semantics of Ref
- it doesn’t guarantee that storing data into it is immediately (or ever) visible to other tasks or that other tasks must check for a new value every time they look at it. Ref
is chiefly a single threaded concept - to Ref
, there is only ever one thread of execution (its own), and since noone else is running and the Ref
is never written to inside of the loop, loading its value inside of the loops is superfluous - the semantics dictate that the value loaded from that Ref
ought to be the same, no matter if it’s done before the loop is run or during the loop is run. Because reloading the same value from a pointer over and over again is quite obvious redundant work, the load is moved to before the loop. This is an optimization called “Loop Invariant Code Motion” [1], because the operation (in this case, loading from the Ref
) is a never changing invariant from iteration to iteration.
So this is a very important point - RefValue
cannot be used for synchronization across threads because it doesn’t guarantee that any changed value is actually read by other tasks reading from it. [2][3] RefValue
is a single threaded concept.
So what does Atomic
enforce differently here? Again, let’s look at the LLVM IR, by passing in an Atomic{Bool}
:
; Function Signature: f(Base.Threads.Atomic{Bool})
; @ REPL[5]:1 within `f`
define void @julia_f_2439({}* noundef nonnull align 1 dereferenceable(1) %"running::Atomic") #0 {
top:
%0 = bitcast {}* %"running::Atomic" to i8* # cast to i8*
br label %L2
; this is our loop!!!!
L2: ; preds = %L2, %top
; @ REPL[5]:3 within `f`
; ┌ @ atomics.jl:358 within `getindex`
%rv.i = load atomic i8, i8* %0 acquire, align 16 # load the (possibly different) data
; └
; ┌ @ promotion.jl:620 within `==`
%.not = icmp eq i8 %rv.i, 1 ; comparison
; └
br i1 %.not, label %L2, label %L12 ; loop again/branch?
L12: ; preds = %L2
; @ REPL[5]:6 within `f`
call void @j_println_2447({}* @"jl_global#2448.jit")
ret void
}
Right away, we can see a bunch of differences - top
only contains the cast to i8*
, and immediately branches to L2
. Our loop now actually contains some instructions! If you compare the load
instructions from this version to the one above, you’ll see two differences:
%1 = load i8, i8* %0, align 1 # Base.RefValue{Bool}
%rv.i = load atomic i8, i8* %0 acquire, align 16 # Threads.Atomic{Bool}
- The load is now called
atomic
, which indicates that this must be done as a single atomic operation.
- The load has this
acquire
keyword after the variable we’re loading from (that’s the i8*
), indicating that there are multithreading possibilities for this load.
These two in combination mean that the optimization from above (Loop Invariant Code Motion) no longer applies - the acquire
keyword indicates that there is some memory ordering taking place (in this case, equivalent to taking a lock without actually having a lock), which in turn means that there is the potential for the loaded value to be different between these loads. As a consequence, the load can no longer be hoisted outside of the loop - it must be done over and over again.
You may ask, why is the behavior of Atomic
not the default? Well, this same optimization (Loop Invariant Code Motion) also hoists unnecessary loads from Ref
s and view
s in other loops and the same semantics allow the compiler to concretely evaluate such loads at compile time. If the behavior of Atomic
were the default, the compiler cannot assume that the load will be the same data - and hence, cannot concretely evaluate it at compile time. This would have a very big impact on performance & optimization overall.
So why does adding yield()
help? It’s a good question, and one I don’t have a good answer to. My gut says “because it’s an explicit point where control is given away from the current task”, and that causes the compiler to bail out of any of these sorts of optimizations. It shouldn’t need to though - in principle, it is still allowed to move the load, because there is nothing tying the yield()
to “the Ref
may change”. So I’d classify “adding yield
makes it work” as just as unsafe as not having a synchronization primitive at all.
[1] Anticipating some comments: Julia itself does not currently (to my knowledge) do LICM on the Julia level (though there is a PR introducing that). All mentions of LICM in this post refer to what happens on the LLVM IR level.
[2] @jakobnissen asked on Slack whether this means the semantics of the language are being violated:
You do see how this disturbs me, right? It sure looks like the semantics of the language is being violated: The user mutates an object and this mutation is deleted by the compiler, changing the semantics through an optimisation
This is not the case; Ref
doesn’t guarantee squat about multithreading. It’s not true that the write to the Ref
is being deleted, it’s that the written value is never seen by the other tasks, because they (rightly so) thought it was a variable from their own memory, that only they can mutate (and hence no synchronization is needed). That’s why Rust has Arc
(async reference) to use instead of Rc
(non-async reference) when you want to share data/memory across threads - and in fact forces you to use it through their borrow checker.
[3] @jakobnissen also asked whether this is true for any mutable struct:
Are you then saying that actually, it’s illegal in Julia to mutate a Ref in one task while another task is reading it? And if so, does this extend to all mutable, non-atomic structs?
In Julia, it is not illegal to do that, no - we’re not Rust (though I would personally like for this to be illegal! This requires a much more stringent compiler than we have though, and would likely make a lot of @threads
loops immediately much more difficult to write). You do get immediate weird/odd/counterintuitive behavior like the one in this thread, no matter whether it’s RefValue
or a custom user-defined struct
- the semantics (and thus the potential for data races etc) are the same. Barring any @atomic
fields, the struct is assumed to live and be contained entirely to the same task that originally allocated it. This is why synchronization primitives like locks and such are required to safeguard accesses to non-atomic datastructures to make accessing them threadsafe.