Elimination of duplicate atomic loads (more of an llvm question)

foobar_lv2 · February 7, 2025, 12:05pm

I am wondering about why llvm doesn’t optimize the following:

julia> mutable struct A
       @atomic x::Int
       end

julia> f(a) = (@atomic :acquire a.x) + (@atomic :acquire a.x);

julia> @code_llvm f(A(1))
; Function Signature: f(Main.A)
;  @ REPL[6]:1 within `f`
define i64 @julia_f_1489(ptr noundef nonnull align 8 dereferenceable(8) %"a::A") #0 {
top:
; ┌ @ Base_compiler.jl:78 within `getproperty`
   %"a::A.x" = load atomic i64, ptr %"a::A" acquire, align 8
   %"a::A.x1" = load atomic i64, ptr %"a::A" acquire, align 8
; └
; ┌ @ int.jl:87 within `+`
   %0 = add i64 %"a::A.x1", %"a::A.x"
   ret i64 %0
; └
}

The two adjacent loads look obviously redundant – it should be possible to transform this into

f_optim(a) = begin tmp = @atomic :acquire a.x; tmp + tmp end

However, as far as I understand, none of llvm/clang, gcc, icc, msvc do this optimization.

So my questions are two-fold:

Would that be an admissible optimization under the prevailing memory model?
Why is that not done ???

Now, the actual thing I want is a threadsafe lazy value that can be hoisted out of loops.

That is, something like the following:

julia> mutable struct Lazy{T,Init}
       const init::Init
       @atomic hasrun::UInt8 #0: uninit, 1: running, 2: error, 3: ready
       item::T #put something invalid here before init
       end
julia> @inline function getLazy(lazy::Lazy)
       if 3 != @atomic :acquire lazy.hasrun
         @noinline initLazy(lazy) # will throw on error and spin/wait if necessary
         @atomic :release lazy.hasrun = 3
       end
       return lazy.item
       end

The internals of initLazy should not matter at all: If I have a sequence

getLazy(lazy)
#something that llvm understands
getLazy(lazy)

then llvm should do its very best to reorder the second load @atomic :acquire lazy.hasrun upwards, across the block #something (this is allowed if the something-block doesn’t forbid it by its own barriers. If I didn’t want that I would have written @atomic :sequentially_consistent lazy.hasrun), see that it can either forward from the store or the previous load, and dead-code eliminate the entire second getLazy call, and forward the output of the first getLazy.

But llvm doesn’t even do this for the simplest #something-block, namely the empty one.

Am I insane? Can anyone link me to literature about that?

Sukera · February 7, 2025, 12:14pm

No, because another thread can semantically modify A between these two loads. That’s why they are not redundant! See also LLVM Atomic Instructions and Concurrency Guide — LLVM 21.0.0git documentation. The compiler must optimize this in the same way, no matter whether the A instance you pass in is actually shared across threads or not.

More intuitively, :acquire corresponds semantically to taking a lock (in fact, this is how locks are implemented), and you wouldn’t want the compiler to reorder your locks, would you?

You probably want :monotonic in this case, since non-atomics and :unordered operations are permitted to be reordered around this (i.e., the code in your loop is permitted to be “moved past” the monotonic load, which is effectively the same as hoisting the atomic).

foobar_lv2 · February 7, 2025, 12:21pm

Moving an acquire upwards is equivalent to moving more code inside the lock, which is obviously valid. (moving an acquire downwards would be very invalid, since it moves code outside the lock)

If I can move it upwards enough to hit the previous acquire/release, then I have achieved lock coarsening, which is not invalid if the compiler can prove that this doesn’t prevent forward progress.

As far as I understand :monotonic, this only imposes ordering guarantees with respect to the same memory address. So the compiler would be permitted to give me

julia> @inline function getLazy(lazy::Lazy)
       badload = lazy.item
       if 3 != @atomic :monotonic lazy.hasrun
         @noinline initLazy(lazy) # will throw on error and spin/wait if necessary
         @atomic :release lazy.hasrun = 3
          return lazy.item
       else return badload end
       end

which is very incorrect!

But this is somewhat academic, since the compiler doesn’t even eliminate duplicate adjacent :monotonic loads.

(but please correct me if my understanding of :monotonic is wrong and that would be enough to make the thing safe)

PS. If you want to check godbolt,

#include <atomic>

int atomicDuplicate(std::atomic<int>* some_atomic) {
    int s = 0;
    s += some_atomic->load(std::memory_order_relaxed);
    s += some_atomic->load(std::memory_order_relaxed);
    return s;
}

this doesn’t remove the duplicate load on any of icc, msvc, gcc, clang, for either arm64 nor x86_64. C++ relaxed is pretty much llvm/julia :monotonic.

Sukera · February 7, 2025, 1:32pm

Maybe there is a misunderstanding - I was commenting on your first f(a)
example, which has no :release matching the previous :acquire. That’s why the example I gave is only to build an intuition, it’s not exactly equivalent since you can’t take a lock twice without deadlocking or explicit deadlock prevention by making the acquisition explicitly succeed if you’re already holding the actual lock (which is not the same as loading some UInt8! This requires the lock to know who currently holds it).

To spin the example further, in order to be able to merge the :acquire loads into one, the compiler would have to prove that both matching :releases can also be merged. Your f(a) example doesn’t even have :release in it, so there is no chance for anything to be analyzed in the first place.

No, that is a wrong use of :release. Without :acquire, :release doesn’t provide any guarantees (see here). I mentioned :monotonic in regards to your first example, since reusing a load is a form of CSE, which is permitted with :monotonic (but notably not :acquire).

It seems to me like there is a bit of confusion about what @atomic is for. You mention in your second example that initLazy should “throw/spin/wait if necessary”, but why would that be the case? You’ve already “taken the lock” with :acquire. The entire purpose of @atomic is to have thread safe datastructures without an additional lock that you spin/wait on. Since you’ve already acquired access (i.e. no regular store will be moved into or out of the :acquire/:release region), you can safely init & return lazy.item. Of course, this can’t be optimized in the same way as non-atomic code can be.

Take a lock at base/lock.jl:145 to see how this is implemented for ReentrantLock, and base/lock.jl343 onwards to look at Base.Lockable (which seems to me to be what you’re trying to build).

Sukera · February 7, 2025, 1:42pm

foobar_lv2:

PS. If you want to check godbolt,
#include <atomic>

int atomicDuplicate(std::atomic<int>* some_atomic) {
    int s = 0;
    s += some_atomic->load(std::memory_order_relaxed);
    s += some_atomic->load(std::memory_order_relaxed);
    return s;
}
this doesn’t remove the duplicate load on any of icc, msvc, gcc, clang, for either arm64 nor x86_64. C++ relaxed is pretty much llvm/julia :monotonic.

Yes, why should it be permitted to remove the load? In between these two loads, a different thread may have stored a new value into the atomic. It’s impossible to tell without analyzing all uses of the incoming atomic object and inspecting all possible interleavings of threaded execution. That’s precisely why :monotonic is the weakest synchronization mechanism.

foobar_lv2 · February 7, 2025, 2:04pm

Maybe I do misunderstand! The way I understood acquire/release pairs is the following: Stuff after an acquire in program order happens-after it. Stuff before a release in program order happens-before it.

Hence, we must not ever transform

y = @atomic :acquire foo.y
x = foo.x

into

x = foo.x
y = @atomic :acquire foo.y

but the reverse transformation would be OK.

Likewise, it is forbidden to turn

foo.x = x
@atomic :release foo.y = y

into

@atomic :release foo.y = y
foo.x = x

but the reverse is admissible.

These are exactly the barriers we need for the above LazyValue: The load-acquire on the check for hasrun is paired with the store-release after initialization, and they order the memory such that we can only read the value after initialization (which happens in the noinline function, before the store-release in program order).

Because, in the C++ “what-if-rule” world, the compiler should be able to assume the most beneficial of all worlds. Because there was no other synchronization between the loads, there is a possible execution order where nobody wrote to some_atomic in between the loads, so the compiler can be happy to assume that.

    int s = 0;
    int v = some_atomic->load(std::memory_order_relaxed);
    s += v;
    s += v;
    return s;

The reverse transformation would of course be very invalid!

You have not been very clear on: Do you think that eliminating the duplicate load would be forbidden by the memory model? Or is it a compiler decision that optimizing that would be too annoying or surprising to users?

Sukera · February 7, 2025, 3:22pm

Yes, that is correct.

No, that is not correct. You need to be more specific - beneficial to whom/what goal, exactly? And under what semantics? Even C/C++ compilers don’t pick “the most beneficial of all worlds”.

That is the reasoning in a single threaded world, but this doesn’t hold in the multithreaded context that atomic operations imply. In a multithreaded context, “happens after” in an execution is a globally visible property, on all threads, not just the one executing the current function. Hence, looking at two atomic loads in isolation, you cannot decide whether the load is superfluous - it’s the global timeline and all possible interleavings, not just the one that happens to be your desired path that needs to obey the atomic semantics.

In a whole-program compilation, with the addition of knowing how many possible interleavings there are, it’s not technically forbidden. It’s just that this is neither the julia, nor C or C++ memory model, all of which currently assume single threaded execution in the absence of atomic operations and potentially unlimited parallel execution in the presence of atomic operations (remember, the number of tasks/threads entering any critical region is a runtime value!). On top of this, C/C++ don’t perform whole-program analysis (traditionally, or only in rare circumstances), so the kind of analysis that needs to be done to establish that this transform is legal in the first place has only recently come into fashion.

This can get very hairy and I’d love to link some Julia-specific resource, but alas, all I can do is point you to Document memory model of Per-field atomics · Issue #46739 · JuliaLang/julia · GitHub, which annoys me very much.

It’s not a generally applicable optimization, I think it is incorrect to perform this on the julia/LLVM level without further knowledge about global temporal properties. Not even Rust “optimizes” this (select “LLVM IR” in the hamburg-menu-dropdown in the top left):

Since Rust also doesn’t know about how limited the parallelism involved here is, there is no way to know on a purely per-function level that folding the loads is legal. While this function is running, a different thread could hold a mutable reference to the atomic and modify it midway through, which would change the result.

Both of your questions boil down to the same thing: How much parallelism must the compiler assume at any given point, and would a code transformation here lead to more correct or more incorrect programs? Purely looking at the function in isolation doesn’t give enough information to answer that question (it depends very much on what you’re doing), so it’s the safer choice not to transform the code. Julia currently says data races are UB, and introducing a race by removing a synchronization point (i.e., folding atomic loads without knowing that it’s safe to do so) would probably fall under that point.

foobar_lv2 · February 7, 2025, 5:00pm

I would be with you for any optimizations of

local a = @atomic :acquire foo.x
local b = @atomic :acquire foo.y
local c = @atomic :acquire foo.x

We cannot coalesce the two loads of x, because another thread could do

@atomic :release foo.x = 1
@atomic :release foo.y = 2

or

@atomic :release foo.y = 2
@atomic :release foo.x = 1

and we can’t know which without global program analysis.

(edit: it would be valid if x and y were 8 byte and known to be adjacent in memory to do a 16 byte wide atomic load of both of them together; or on a hypothetical architecture that permitted to atomically load from two addresses at the same time)

However, in

local a = @atomic :acquire foo.x
local b = @atomic :acquire foo.x

it is obviously a valid execution/interleaving if nothing else runs in between both loads. The optimizer is permitted to refine the behavior of the program, i.e. each valid execution of the optimized program must be a valid execution of the original program – but the reverse is not true, i.e. it is OK if some valid executions of the original program cannot happen in the optimized program anymore.

The original program could observe a != b, the optimized program can’t, that’s still a valid optimization that refines the semantics.

(“interleavings” is a very single-threaded way of thinking – that can’t even model stuff like store buffering on x86, or the much weaker architectures like ARM)

gbaraldi · February 7, 2025, 5:29pm

One optimization we could perform (but don’t currently) is that we could try and detect the scope of your allocation and if we prove it’s single threaded, remove hte allocations (and potentially even forward the stores to the loads directly, removing the allocation).

Sukera · February 7, 2025, 6:17pm

Yes, that is one valid interleaving, but so is performing the first load, spinning for ten years while other threads read & write to foo.x, and only then performing the second load. You don’t even need to spin for ten years - a cache eviction at just the right time is enough.

The point is that the compiler cannot assume that nothing else runs between those two loads, hence they cannot be combined into one. By only performing one actual load, the compiler would remove a synchronization point from the program, which is not a legal transform in the general case where you’re looking at more than just the local execution context.

In a setting with whole-program optimization including a fixed set of threads of execution (which, again, none of Julia, C, C++ or Rust have), you could optimize that, as long as you can prove that indeed, nothing else can write to that variable between those two loads. It’s just that none of these languages allow for those semantics. If you want to have that, look at something like TLA+, which can model & solve this just fine.

No, it doesn’t follow the atomic semantics, so it’s not a valid optimization to perform.

I don’t know why you think this - you can interleave threads of execution as much as you want, you can even overlap them. If the “interleaving of threads of execution” model is sufficient to show that your optimization is not valid, why not use that terminology?

Sukera · February 7, 2025, 6:18pm

There is no allocation here, this topic is purely about a single atomic load from an address.

Keno · February 7, 2025, 8:05pm

FWIW, I would also expect it to be legal to merge two directly adjacent acquire loads, but I’m no expert on memory models, so I may be wrong. That said, it’s a very specific case, so I wouldn’t be surprised if the compilers just don’t have the legality check for it, even if it is permissible.

Keno · February 7, 2025, 8:07pm

If you use the C-translated version of the example and ask over on the LLVM forums, somebody might be able to give you an authoritative answer.

foobar_lv2 · February 7, 2025, 11:11pm

Because it doesn’t show that the optimization is not valid.

Let’s first fix some very basic facts:

The outcome of many programs, especially multi-threaded ones, is not deterministic. Instead, there is a set of allowed outcomes, typically called “executions”. Many, but not all, of the valid outcomes of multi-threading can be interpreted as interleavings.
Some program transformations are “equivalence”: The set of allowed outcomes/executions of the original and transformed programs are exactly the same.
Some program transformations are “miscompilations”: There exist outcomes/executions that are valid/possible for the transformed program, that were not possible/valid for the original program. We want to avoid these!
Some program transformations are “refinements”: All valid executions of the transformed programs are also valid for the original program. But some executions that were originally valid are not allowed anymore for the transformed program.

Coalescing two adjacent atomic loads from the same address is, I am claiming, a refinement.

You answered “but what if an intervening store happens on another thread”? Well, you have just demonstrated that it’s not an equivalence transformation. Your entire argument structure is to take a valid execution of the original program, and start from there. This cannot show that something is a miscompilation! To do that, you would either have to start from a valid execution of the transformed program and show that it’s invalid for the original program, or if you like indirect proofs, start from an execution that is invalid for the original program and show that it’s valid for the transformed program.

Or is our disagreement that you would prefer if the compiler only did “equivalence” transformations and no “refinement” transformations? (then how would you handle UB?)

Sukera · February 8, 2025, 6:23am

We have to make a clear distinction between actual non-determinism where you source some data from outside your model (e.g. by fetching entropy from a hardware RNG) and branch based on that, and perceived non-determinism because of internal machine state that you don’t control.

The vast majority of programs are very much deterministic, in the sense that they don’t query a hardware RNG (no, /dev/urand is not usually a hardware RNG) and their execution can be completely replayed, given the same inputs (data loaded from disk, the network etc). The machine you’re executing your program on just has state that you traditionally don’t control, which ends up influencing the execution of your code. Luckily, there’s stuff like https://antithesis.com, who run your code in a deterministic (yes, down to cache misses etc.) hypervisor, eliminating exactly that uncontrollable state that is the source of your perceived “non-determinism”.

Here’s the thing - I generally don’t want the compiler to “optimize” or transform anything in the presence of UB. The input program is already invalid - in an ideal world, it should be rejected. Practically, this isn’t always possible, at least in the “just let the compiler pick the happy path” sense without introducing exactly those miscompilations you claim you want to avoid, in particular when seemingly benign transformations combine together to introduce a miscompilation where each individual transform does not.

If you want to look at your example/argument through the lens of UB, it could go like so:

I have a piece of code with a data race (loading twice from some address that another thread may write to concurrently), which could change the result depending on the execution context.
Data Races are UB in Julia.
The compiler should pick the “most benevolent” world of execution that could occur.
Therefore, my chosen transformation is valid.

The issue is in step 3 - you still haven’t defined a general principle on which this should happen - currently, your position seems to be “because that suits my usecase”, which is neither generally applicable, nor leading to a generally desirable outcome, because it removes a synchronization point between threads from the program. Even more troubling, it relies on the fact that there is a data race in the first place! So it seems to me that your stance is that UB is giving the compiler license to do whatever it desires.

On the other hand, my argument boils down to “there’s a data race here - if you permit that at all, don’t optimize it without certain knowledge that doing so is benign/invisible”. If you can prove that all other threads, at all points in time & over all possible interleavings/executions/weavings/however you want to call it, cannot prevent the thread executing f(a) from reading the same value twice, then you can remove the second load. Unfortunately, this means that you cannot look at these two loads in isolation and decide that you can combine them, because the available information local to f(a) is not sufficient to prove the above property.

This is notably different from volatile, which must persist, even if the program itself never writes to that address, because volatile is explicitly about external modifications to an address (which you cannot reason about, even with whole-program analysis), whereas atomic is about internal, i.e. modifications to an address introduced by the program itself (which you can in principle reason about with whole-program analysis, it’s often just prohibitively expensive).

Sukera · February 8, 2025, 7:13am

Slight self-correction here: Because the signature of f(a) in Rust would take an immutable reference (after all, we only read), the borrow checker already catches this failure mode. In order to modify the atomic, we need a mutable reference, but if we have that, we can’t have a live immutable reference to read from at the same time. We could of course have f(a) take a mutable reference, but then we can’t take another mutable reference to modify the variable. So Rust correctly rejects a program with such a data race, and doesn’t even get to the optimization stage where this could become an issue.

foobar_lv2 · February 8, 2025, 7:58pm

That’s nonsense. We’re talking about the following kind of situation:

mutable struct Flag @atomic start_the_race::Bool end
mutable struct IsAtomic @atomic x::Int end
const flag = Flag(false)
const atomic = IsAtomic(0)
res1 = nothing
res2 = nothing

writer = Threads.@spawn begin
while !(@atomic flag.start_the_race) ; end
@atomic atomic.x = 1
end

reader = Threads.@spawn begin
while !(@atomic flag.start_the_race) ; end
a = @atomic atomic.x
b = @atomic atomic.x
global res1 = a
global res2 = b
end


@atomic flag.start_the_race = true

wait(reader); wait(writer);

@show res1, res2

This is a data race. If IsAtomic.x was not atomic, then it would be UB. In the java world, it would be “something sane”, aka no “out of thin air” values, but afaiu the details (i.e. the semantics of @atomic :unordered) are an area of active research and debate, with the llvm langref intentionally vague (it’s afaiu no even settled whether :unordered has any coherent semantics beyond hand-wavy “try to be sane”).

Since it is atomic, it is not UB; it is a valid program. Nevertheless the outcome is indeterminate: First, the language semantics of llvm/julia do not uniquely determine the output; second, in the obvious compilation, not even processor manual or “well-known” micro-architectural state will determine the outcome; vagaries like manufacturing details and temperature of your cores will probably come in.

But nevertheless, the language semantics do restrict the set of possible outcomes. Permissible outcomes are 0, 0, 0, 1, 1, 1 and nothing else.

If the compiler decided to deduplicate the atomic loads, this would restrict the set of possible outcomes to 0, 0 and 1, 1. Each of these outcomes was permissible for the original program; so this is a “refinement”, and we’re generally happy with such transformations.

Oh, the general principle is: The compiler people have a bunch of heuristics for what code runs faster. Doing one load instead of two loads is obviously faster; and the “best of all worlds” argument was just me being handwavy about: In order to show that a transform is valid, you need to show that all outcomes of the transformed program are permissible for the original program. The back-to-back atomic loads seeing the same value is certainly permissible, so it’s OK to fix it to “guaranteed” and exclude 0, 1 from the set of possible outcomes.

Oscar_Smith · March 21, 2025, 9:14pm

This does seem like a missed optimization. LLVM issue created: LLVM doesn't optimize redundant atomics - IR & Optimizations - LLVM Discussion Forums

Sukera · March 28, 2025, 4:18pm

I’m not really sure I follow why this is a proof of missed optimizations. From the linked github discussion within:

Hmm… I thought it was true. From an implementation perspective it seems like it should be true; it seems unlikely that removing an unused load (of non-MMIO memory) could actually change behavior. But from a spec perspective, after trying to come up with a proof that it’s true, I’ve instead convinced myself that it’s not!
[…]
But if it was on a different thread, it can’t. According to both the C++ and LLVM specs, an atomic read can only pair with a fence if it is sequenced-before the fence, which requires being on the same thread. It’s not enough to just happen-before the fence.

Which I’m understanding as “you need to be able to assume the loads are on the same thread to be able to remove the second load”, which we can’t because the initialization may happen on any thread. What am I missing?

Topic		Replies	Views
Automatic Compiler Optimizations and Multithreading Julia at Scale multithreading , compilation	11	312	July 24, 2024
Why the compiler can't optimize this simple code? Performance	20	659	November 14, 2024
Reason for duplicated llvm code in pseudo-vtable Internals & Design question	7	330	February 25, 2024
On modifying immutables by reference in multithreaded code General Usage multithreading	2	768	October 20, 2017
Disable compiler optimization General Usage	7	533	April 20, 2023

Elimination of duplicate atomic loads (more of an llvm question)

Related topics