Automatic Compiler Optimizations and Multithreading

NonDairyNeutrino · July 22, 2024, 4:46am

I learned in my HPC class that for languages such as C/C++ and Fortran, invoking automatic compiler optimizations using the (gcc) -On flag could lead to unintended behaviors when compiling multithreaded programs; something along the lines of

The compiler will automatically optimize out a piece of code that would be unnecessary or sub-optimal for a single-threaded program, but necessary for multithreaded programs.

I really haven’t seen anything on the usage of Julia’s automatic compiler optimizations, but I’m curious how Julia’s JIT (or PackageCompiler.jl’s AOT) compilation process handles multithreaded programs compared to the others. If I didn’t know any better, I’d just assume it’s the same, but since Julia is created from the ground up with parallelism in mind, I assume it’s better tailored? Or are these things handled by the underlying LLVM and less customizable?

The only resources I could find is from the dev/JIT docs and this Discourse post from 2020.

Elrod · July 22, 2024, 11:37am

Maybe your course is saying that you need to use things like the volatile keyword in C/C++, which forces a variable to be loaded from memory every time it is read.
In all these languages, memory reads can often be optimized away. This is extremely desirable.

But, it could potentially break multithreaded code, e.g., you may have

julia> function foo(x, a)
          b = x[]
          c = a + b
          b == x[] ? a : 5c
       end
foo (generic function with 2 methods)

julia> @code_llvm debuginfo=:none foo(Ref(3), 5)

define i64 @julia_foo_426({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  ret i64 %1
}

This is a pretty fake example, but the idea is: the code looks like it considers the fact that x[] could be written to by another thread and thus the answer may change, however the code gets optimized into making that impossible. It turns into return a, because it assumes b == x[] is always true.

If you’re writing code like that, you need to use things like Threads.Atomic or Base.@atomic. Check the docstrings ? Base.@atomic for more info.

However, most of us write very little code like that.
You must have a very deliberate purpose if you’re having multiple threads read and write to the same memory, and must control this very carefully (with locks or atomics).

foobar_lv2 · July 22, 2024, 3:55pm

This is a true-ism even without compilers: Precise memory models of CPU architectures are complex and in flux.

Assembly / architectural machine code (e.g. “amd64 with the following extensions…”) are high-level languages on top of a proprietary incompletely documented micro-architecture which is itself incompletely stable (e.g. micro-code updates, or the recent retro-active spec change that made aligned 16 byte vector ops atomic on intel/amd – these happened to be atomic on all extant implementations, and are now guaranteed reliably atomic on all extant and future implementations).

But the general point is correct: The most common race-conditions are unconstrained UB in C++ and julia, and are somewhat constrained indeterminate behavior in e.g. JVM.

This is because julia chooses to emit by default non-atomic memory ops, instead of “unordered”.

Afaiu the main reason for that is C++: Non-clang is somewhat second-class citizen in llvm, and they use NotAtomic. Hence many optimizations needlessly fail with unordered, and it would be quite (unneccessarily) costly to use unordered instead of NotAtomic.

nsajko · July 22, 2024, 4:48pm

This is almost surely a misunderstanding or miscommunication.

NonDairyNeutrino · July 23, 2024, 5:47pm

Ah, I see. As my course was primarily focused on using C (not C++), atomics (as far as I know) weren’t considered and more focused on manually creating, using, and managing mutex locks.

So I guess it boils down to “unless you’re writing a potential data race, automatic compiler optimizations shouldn’t be a problem”?

nsajko · July 23, 2024, 9:25pm

For programming languages in general, compiler optimizations by definition are not supposed to be observable in the behavior of the program (except performance etc.). So if there is a problem, it’s either a compiler bug or user error. Of which user errors are more prevalent.

foobar_lv2 · July 24, 2024, 9:14am

It is very useful to have another category for “questionable optimizations”:

Optimizations based on undefined behavior that are arguably correct under prevailing interpretations by compiler-authors of the language standards (rules-as-written! user error!), while being arguably counter-intuitive and counter-productive under reasonable-ish interpretations of the language standards by practitioners (man, I just want to code, not language-lawyer ).

This is a quite contentious category. But I think that we should all be able to agree that it is, in fact, a sensible category to talk about, and about what class of optimizations belongs into it, even if we may disagree about whether these optimizations are desirable and whether it is a sub-category of “user-error” or “compiler-error”.

The most egregious offender in that category is, in fact, C, and the second most egregious offender is C++. Since LLVM and GCC both primarily serve C/C++, we do inherit some of the same issues.

I recommend reading the variety of Linus Torvald’s rants about that, for the side of the thesis “compilers are unpragmatically strict in their language interpretation, and too willing to blame users”.

I fully concur with these rants, to the point that C vaguely resembles an awesome cool language, while actually being a catastrophic mess.

Arguably the language of the linux kernel is not “C”, it is a dialect “linux kernel C” with its own memory model, and arguable the language of LLVM is not C++, it’s a dialect “LLVM C++” with its own variant of inheritance.

nsajko · July 24, 2024, 11:44am

You say “questionable optimizations”, but what you’re really complaining about are the semantics of the programming langauge, not the optimizations.

A good example of questionable semantics is the forward progress guarantee, of which both Julia and C++ got rid of.

I mostly see this like:

programmer doesn’t know about some part of the language, or misunderstands some part
programmer does a bug as a result
programmer complains about the compiler and/or language spec instead of just fixing the bug

nsajko · July 24, 2024, 11:54am

Not a fan of his rants. Sometimes he seems simply misinformed or outdated. In the case of his complaints about C, I guess he’s in the right, in the sense that C isn’t a very good language, making it hard to do some things within C. But that doesn’t excuse the silly attacks on compiler authors.

Besides, Torvalds intensely dislikes C++, barring it from the kernel, while sticking to the rusty C. The worst thing about C++ is its standard library, which wouldn’t even apply to using C++ in the kernel.

Elrod · July 24, 2024, 12:41pm

That is a take I haven’t seen before.

sgaure · July 24, 2024, 1:25pm

We had a user at a super computer centre who fell victim to this. He had a habit of always compiling with the highest optimization, despite warnings that this did could do some dubious things. His program ran orders of magnitude slower than expected. The supercomputer was useless, he claimed. It turned out that his fortran program tried to estimate the machine precision, and used the machine precision to estimate how many Chebyshev polynomials he needed in his basis. Higher precision meant more polynomials.

To estimate the precision he did something like this:

delta = 1
prec = 0
while 1 + delta > 1
    delta = delta / 2
    prec = prec + 1
end

At the highest optimization level, the compiler was happy to replace while 1 + delta > 1 with while delta > 0. He ended up with 340 polynomials instead of 15, or something.

foobar_lv2 · July 24, 2024, 1:29pm

You can view it that way for each specific thing. The other implicit complaint is of course about language culture:

There is a view that rules-as-written-and-interpreted, i.e. language semantics, are paramount and these are all user errors.

There is a different culture in e.g. the java sphere: That it is the obligation of compiler/runtime maintainers, as ecosystem stewards, to engage with common practice and to not turn “broken” programs and common practices into security vulnerabilities. That there is shared responsibility to find good solutions to empirically counter-intuitive semantics (i.e. misuse is wide-spread), instead of telling the users off.

Let me give a fun example from julia/llvm. One of my favorite x86 instructions is pmovmskb. The natural julia / llvmcall idiom for that is to bitcast an <i1 x N> into an iN, and then truncate / zext it to i32. Sounds natural, right? It’s often also what clang generates with -emit-llvm.

Sounds reasonable, right?

Lol nope. This works for some specific blessed values of N, and otherwise llvm silently emits broken garbage code that produces incorrect results.

Whose bug is that? Let’s look at the spec:

The conversion is done as if the value had been stored to memory and read back as type ty2.

Yeah, i1 does not have a memory representation. As the llvm mailing lists patiently explains, this is obviously broken UB code, not an llvm misbehavior.

And language specs are all like this! They are not written for people like me who lack the reading comprehension to immediately see that.

This is a “me” problem, and not a necessarily a “spec” problem, but I hope you can emphasize with the fact that people like me (and torvalds) whine about it.

Topic		Replies	Views
Automatic parallelization/ multithreading General Usage	3	1054	March 6, 2019
Automatic Parallelization in Julia Julia at Scale	5	1803	July 5, 2020
AOT Compiling? Performance Comparison General Usage question	3	719	June 6, 2020
Questions about Compiler and Compiling Modules New to Julia	21	2085	January 17, 2019
Elimination of duplicate atomic loads (more of an llvm question) Internals & Design multithreading	16	257	February 8, 2025

Automatic Compiler Optimizations and Multithreading

Related topics