Stopping a task based on a variable

Hi, I am trying to stop a @spawn’ed task based on the value of a variable. Here is an example:

Using Base.Threads
function f(running)
    i = 0
    while running[] == true
        i = i + 1
    end
    println("I'm done")
end
running = Ref(true)
t = @spawn f(running)
running[] = false

Expectation would be that I am seeing “I’m done” printed once I am running the running[] = false line, but it isn’t working.

The real application is starting up a server, having it go to the background (so I can do other things in the REPL, like querying it) and then nicely shutting the server down from the REPL (i.e., finishing the task running the server).

Probably I am approaching the problem in the completely wrong way. Any pointers are appreciated.

1 Like

That is more or less exactly something I have tried lately, with success. 2 things:

  1. Use Threads.Atomic instead of Ref. It is very similar, but thread-safe.
  2. Does Threads.nthreads() return 1? If yes, the issue could be that your one and only thread gets stuck in the while true loop, and never gets a chance to yield. Adding a yield() manually might help in this case. However, if that is indeed the case, @async would be more appropriate than Threads.@spawn, and Ref is fine.

Thank you. This helped immensely!

For the record, here is a summary of different trials:


using Base.Threads

#try 1 - does not work
function f(running)
    i = 0
    while running[] == true
        i = i + 1
    end
    println("I'm done")
end

running = Ref(true)
t = @spawn f(running)
running[] = false

#try 2 - works
function f(running)
    i = 0
    while running[] == true
        i = i + 1
    end
    println("I'm done")
end

running = Threads.Atomic{Bool}(true)
t = @spawn f(running)
running[] = false

#try 3 - works
function f(running)
    i = 0
    while running[] == true
        i = i + 1
        yield()
    end
    println("I'm done")
end

running = Ref(true)
t = @async f(running)
running[] = false


#try 4 - does not work
function f(running)
    i = 0
    while running[] == true
        i = i + 1
    end
    println("I'm done")
end

running = Ref(true)
t = @async f(running)
running[] = false

Try 2 (Atomic, @spawn) and Try 3 (Ref, @async, with yield) work as expected, while Try 4 (like try 3, but without yield() in there) does not.

It’s noteworthy that the docstring of @async seems to strongly discourage the use of it (it sounds a little like “Don’t use if you don’t know what you are doing” :slight_smile: …).

1 Like

Right, so perhaps Threads.@spawn is indeed almost slways better…

Why did you switch to @async? I think you can use yield() no matter whether you used @async or @spawn. IIUC, both @async and @spawn create a new Task. It’s just that @async also messes with scheduling and is thus discouraged.

1 Like

There’s a bit of confusion here about what things work, how and why. Let’s see if we can’t crack this, starting with the original function:

julia> function f(running)
           i = 0
           while running[] == true
               i = i + 1
           end
           println("I'm done")
       end
f (generic function with 1 method)

and looking at what LLVM IR is produced here, when we put in a Ref (don’t worry, I’m going to go through every single line):

; Function Signature: f(Base.RefValue{Bool})    ;  this is the signature of this code
;  @ REPL[5]:1 within `f`    ; where it was defined
define void @julia_f_2169({}* noundef nonnull align 1 dereferenceable(1) %"running::RefValue") #0 {
top:
  %0 = bitcast {}* %"running::RefValue" to i8*   ; cast to i8*
  %1 = load i8, i8* %0, align 1                  ; load the Bool data
  %2 = and i8 %1, 1                              ; mask out the truthiness value
  %.not = icmp eq i8 %2, 0                       ; compare to false
;  @ REPL[5]:3 within `f`
  br i1 %.not, label %L8, label %L2    ; if was false, goto L8, else goto L2

; this is our loop!!!!
L2:                                               ; preds = %L2, %top
  br label %L2      ; jump to L2 _unconditionally_


L8:                                               ; preds = %top
;  @ REPL[5]:6 within `f`
  call void @j_println_2175({}* @"jl_global#2176.jit")
  ret void
}

The first two lines are purely informational comments, stating first what the Julia function signature is and where it was defined. The third line is the LLVM function definition of our f. The Ref is a Base.RefValue, which in LLVM ends up as a pointer (note that the argument is called running::RefValue). LLVM IR is built up of blocks of code, with single expressions inside. Jumping between blocks is how control flow like branching or loops are done.

The first block is called top, it represents the entry point of the function. The second block is called L2, in this case representing our while loop. The third block is called L8, in this case representing the code after the loop.

In Julia, boolean values are (currently) stored as a whole byte, of which only the lowest bit holds the true/false value. If the lowest bit is 1, the value is true, if it’s 0, the value is false. So starting from the top, we immediately cast the incoming pointer of our running variable to an i8* (a pointer to a byte), storing that cast pointer in the variable %0. We then load the truthiness value (stored in %1) and mask out the only part of a boolean actually encoding the data (stored in %2). Now, the == false comparison of our original loop begins; first comparing to false, and subsequently jumping to either the actual loop L2 or to the println after the loop, depending on whether or not the comparison was true or not.

If you examine L2 very closely ( :wink: ) you’ll notice that except for jumping right back to L2, it’s completely empty! So the loop spins forever and ever, never checking running again.

This may sound odd - after all, the Ref may change value! However, this is a consequence of the semantics of Ref - it doesn’t guarantee that storing data into it is immediately (or ever) visible to other tasks or that other tasks must check for a new value every time they look at it. Ref is chiefly a single threaded concept - to Ref, there is only ever one thread of execution (its own), and since noone else is running and the Ref is never written to inside of the loop, loading its value inside of the loops is superfluous - the semantics dictate that the value loaded from that Ref ought to be the same, no matter if it’s done before the loop is run or during the loop is run. Because reloading the same value from a pointer over and over again is quite obvious redundant work, the load is moved to before the loop. This is an optimization called “Loop Invariant Code Motion” [1], because the operation (in this case, loading from the Ref) is a never changing invariant from iteration to iteration.

So this is a very important point - RefValue cannot be used for synchronization across threads because it doesn’t guarantee that any changed value is actually read by other tasks reading from it. [2][3] RefValue is a single threaded concept.

So what does Atomic enforce differently here? Again, let’s look at the LLVM IR, by passing in an Atomic{Bool}:

; Function Signature: f(Base.Threads.Atomic{Bool})
;  @ REPL[5]:1 within `f`
define void @julia_f_2439({}* noundef nonnull align 1 dereferenceable(1) %"running::Atomic") #0 {
top:
  %0 = bitcast {}* %"running::Atomic" to i8*     # cast to i8*
  br label %L2


; this is our loop!!!!
L2:                                               ; preds = %L2, %top
;  @ REPL[5]:3 within `f`
; ┌ @ atomics.jl:358 within `getindex`
   %rv.i = load atomic i8, i8* %0 acquire, align 16    # load the (possibly different) data
; └
; ┌ @ promotion.jl:620 within `==`
   %.not = icmp eq i8 %rv.i, 1                ; comparison
; └
  br i1 %.not, label %L2, label %L12          ; loop again/branch?

L12:                                              ; preds = %L2
;  @ REPL[5]:6 within `f`
  call void @j_println_2447({}* @"jl_global#2448.jit")
  ret void
}

Right away, we can see a bunch of differences - top only contains the cast to i8*, and immediately branches to L2. Our loop now actually contains some instructions! If you compare the load instructions from this version to the one above, you’ll see two differences:

%1 = load i8, i8* %0, align 1                       # Base.RefValue{Bool}
%rv.i = load atomic i8, i8* %0 acquire, align 16    # Threads.Atomic{Bool}
  1. The load is now called atomic, which indicates that this must be done as a single atomic operation.
  2. The load has this acquire keyword after the variable we’re loading from (that’s the i8*), indicating that there are multithreading possibilities for this load.

These two in combination mean that the optimization from above (Loop Invariant Code Motion) no longer applies - the acquire keyword indicates that there is some memory ordering taking place (in this case, equivalent to taking a lock without actually having a lock), which in turn means that there is the potential for the loaded value to be different between these loads. As a consequence, the load can no longer be hoisted outside of the loop - it must be done over and over again.

You may ask, why is the behavior of Atomic not the default? Well, this same optimization (Loop Invariant Code Motion) also hoists unnecessary loads from Refs and views in other loops and the same semantics allow the compiler to concretely evaluate such loads at compile time. If the behavior of Atomic were the default, the compiler cannot assume that the load will be the same data - and hence, cannot concretely evaluate it at compile time. This would have a very big impact on performance & optimization overall.

So why does adding yield() help? It’s a good question, and one I don’t have a good answer to. My gut says “because it’s an explicit point where control is given away from the current task”, and that causes the compiler to bail out of any of these sorts of optimizations. It shouldn’t need to though - in principle, it is still allowed to move the load, because there is nothing tying the yield() to “the Ref may change”. So I’d classify “adding yield makes it work” as just as unsafe as not having a synchronization primitive at all.


[1] Anticipating some comments: Julia itself does not currently (to my knowledge) do LICM on the Julia level (though there is a PR introducing that). All mentions of LICM in this post refer to what happens on the LLVM IR level.

[2] @jakobnissen asked on Slack whether this means the semantics of the language are being violated:

You do see how this disturbs me, right? It sure looks like the semantics of the language is being violated: The user mutates an object and this mutation is deleted by the compiler, changing the semantics through an optimisation

This is not the case; Ref doesn’t guarantee squat about multithreading. It’s not true that the write to the Ref is being deleted, it’s that the written value is never seen by the other tasks, because they (rightly so) thought it was a variable from their own memory, that only they can mutate (and hence no synchronization is needed). That’s why Rust has Arc (async reference) to use instead of Rc (non-async reference) when you want to share data/memory across threads - and in fact forces you to use it through their borrow checker.

[3] @jakobnissen also asked whether this is true for any mutable struct:

Are you then saying that actually, it’s illegal in Julia to mutate a Ref in one task while another task is reading it? And if so, does this extend to all mutable, non-atomic structs?

In Julia, it is not illegal to do that, no - we’re not Rust (though I would personally like for this to be illegal! This requires a much more stringent compiler than we have though, and would likely make a lot of @threads loops immediately much more difficult to write). You do get immediate weird/odd/counterintuitive behavior like the one in this thread, no matter whether it’s RefValue or a custom user-defined struct - the semantics (and thus the potential for data races etc) are the same. Barring any @atomic fields, the struct is assumed to live and be contained entirely to the same task that originally allocated it. This is why synchronization primitives like locks and such are required to safeguard accesses to non-atomic datastructures to make accessing them threadsafe.

7 Likes

https://llvm.org/docs/Atomics.html

Is a pretty good read even for people not working on compilers.

Secondly the Julia Atomic manifesto Julia Atomics Manifesto - HackMD explains why we needed to add @atomic to the language in 1.8.

The first realization is that everything non-atomic states that the operations are only ordered with respect to a single thread. Doing two loads from the same location without doing a write to the location is the same as just doing one load, and as @Sukera showed above LLVM will happily optimize that for you.

This also holds true for stores!

function f(r)
          r[] += 1
          r[] += 1
          nothing
end
define void @julia_f_153({}* noundef nonnull align 8 dereferenceable(8) %0) #0 {
top:
;  @ REPL[1]:2 within `f`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %1 = bitcast {}* %0 to i64*
    %2 = load i64, i64* %1, align 8
; └└
;  @ REPL[1]:3 within `f`
; ┌ @ int.jl:87 within `+`
   %3 = add i64 %2, 2
; └
; ┌ @ refvalue.jl:60 within `setindex!`
; │┌ @ Base.jl:41 within `setproperty!`
    store i64 %3, i64* %1, align 8
; └└
;  @ REPL[1]:4 within `f`
  ret void
}

So we optimized two loads and two stores, to one load and one store.
This is a nifty optimization and in a serial program obviously legal.

Now in a concurrent program you may ask “what happens if another task writes to that location” and this code already has multiple data-races (even without optimizations). So in the presence of concurrency we need to use either atomic operations or locks. Locks are often easier to reason about, but atomics are more fun xD

One complaint you may have is that while these optimization may be nice, wouldn’t it be easier if the compiler didn’t do these for you? The problem is that modern CPUs play even more tricks like these under the hood. So as a language we would need to promote everything to atomic operations and people would really quickly complain about the real performance loss.

Mara Bos’s book on atomics is a great read and only requires very little Rust knowledge Rust Atomics and Locks by Mara Bos

5 Likes

From my POV, there are two issues with the manifesto:

That being said, the kinds of issues like those in this thread are exactly the reason why @atomic and other such constructs now exist on a language level, and not just as a particular struct implementation. The manifesto is a good resource for how that is done, I just wish it were better documented in the manual…

Yes, this really is a great reference. One caveat is that Rust can do some things that we can’t - for example, it’s tremendously difficult to guarantee semantically that only one mutable reference to a value exists, making some datastructures “build & pray” that noone messes with their internals. This property is used in a bunch of places in the book, and is pretty much the raison d’être for Rust as a language.

I would have assumed this was because yield() allows for task switching, hence the compiler can no longer assume that a Ref won’t change values in a single-threaded context either?

Well, that’s why I mentioned that nothing inherently ties the Ref to that yield though. The IR doesn’t suggest anything of the sort, so LLVM should have moved the load. We do run passes for dead code elimination (that’s why the counter got removed), so evidently not all optimizations are disabled by this.

My only guess left is that the LICM pass on the LLVM level in our pipeline is only run if there isn’t any yielding primitive in the examined code, as a safety net of sorts. This makes sense as a heuristic, because loops that yield are unlikely to be extremely hot themselves, making the penalty of having to load unnecessarily not as bad.

It’s just a guess though, I haven’t looked at the passes in detail. Semantically, the yield is not tied to the Ref, so in theory it is still possible to be moved.

No, it simpler than that.

LLVM models the side-effects of functions, currently the Julia compiler does not always communicate the side-effects particularly well.

Take f:


julia> function f(r)
           while r[]
              yield()
           end
       end

Using: @code_llvm raw=true dump_module=true f(Ref(true))

we see a loop:

L3:                                               ; preds = %top, %L3
;  @ REPL[2]:3 within `f`
  %9 = call swiftcc nonnull {} addrspace(10)* @j_yield_201({}*** nonnull swiftself %0), !dbg !35
;  @ REPL[2]:2 within `f`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:37 within `getproperty`
    %10 = load i8, i8 addrspace(11)* %6, align 1, !dbg !16, !tbaa !23, !alias.scope !27, !noalias !30
    %11 = and i8 %10, 1, !dbg !16
    %.not = icmp eq i8 %11, 0, !dbg !16
; └└
  br i1 %.not, label %L5, label %L3, !dbg !22

The attributes on j_yield are:

declare swiftcc nonnull {} addrspace(10)* @j_yield_201({}*** nonnull swiftself) #0
attributes #0 = { "frame-pointer"="all" "probe-stack"="inline-asm" }

So we tell LLVM nothing about this function’s memory semantics. LLVM can’t assume that the yield function (or any other function for that matter) doesn’t have a way to reach and modify the memory so even in a single-threaded program the folding operation is disallowed. Now it’s debatable what yield in particular should have as effect, but there may be an argument for it to be inaccessiblememonly.

TL;DR; most functions that are not inlined right now have under-specified memory effects and thus will block optimizations. This is also the reason why LLVM can’t perform LICM on Julia functions like sin.

5 Likes

The flip side of this is that if LLVM would know about the effects of yield, the pointer load would again be eligible for LICM, right?

Ignoring for a moment the precise effects for yield & co, and let’s substitute yield with sin for this discussion, then yes. If https://github.com/JuliaLang/julia/pull/50188 succeeds or we manage to infer effects of called functions on the LLVM level, then yes many more optimizations may occur, including this loop being folded away without atomics. yield is tricky…

2 Likes

If you are trying to also communicate with the running task, you might want to consider a Channel.

julia> function f(ch)
           i = 0
           while isopen(ch)
               i += 1
               isempty(ch) ||
                   println(take!(ch))
               yield()
           end
           println("I'm done")
       end
f (generic function with 1 method)

julia> ch = Channel{String}(f)
Channel{String}(0) (empty)

julia> put!(ch, "Hello world");
Hello world

julia> put!(ch, "Time to die");
Time to die

julia> close(ch)
I'm done

Note that Channel uses atomics under the hood and has a spawn keyword for multithreading.

3 Likes

Thanks for all the replies. It didn’t only solve my immediate issue, but I have even learnt a lot on the why/background. Great stuff!

5 Likes

Sounds like a typical Julia discourse experience to me :wink:

3 Likes