Automatic Compiler Optimizations and Multithreading

I learned in my HPC class that for languages such as C/C++ and Fortran, invoking automatic compiler optimizations using the (gcc) -On flag could lead to unintended behaviors when compiling multithreaded programs; something along the lines of

The compiler will automatically optimize out a piece of code that would be unnecessary or sub-optimal for a single-threaded program, but necessary for multithreaded programs.

I really haven’t seen anything on the usage of Julia’s automatic compiler optimizations, but I’m curious how Julia’s JIT (or PackageCompiler.jl’s AOT) compilation process handles multithreaded programs compared to the others. If I didn’t know any better, I’d just assume it’s the same, but since Julia is created from the ground up with parallelism in mind, I assume it’s better tailored? Or are these things handled by the underlying LLVM and less customizable?

The only resources I could find is from the dev/JIT docs and this Discourse post from 2020.

Maybe your course is saying that you need to use things like the volatile keyword in C/C++, which forces a variable to be loaded from memory every time it is read.
In all these languages, memory reads can often be optimized away. This is extremely desirable.

But, it could potentially break multithreaded code, e.g., you may have

julia> function foo(x, a)
          b = x[]
          c = a + b
          b == x[] ? a : 5c
       end
foo (generic function with 2 methods)

julia> @code_llvm debuginfo=:none foo(Ref(3), 5)
define i64 @julia_foo_426({}* noundef nonnull align 8 dereferenceable(8) %0, i64 signext %1) #0 {
top:
  ret i64 %1
}

This is a pretty fake example, but the idea is: the code looks like it considers the fact that x[] could be written to by another thread and thus the answer may change, however the code gets optimized into making that impossible. It turns into return a, because it assumes b == x[] is always true.

If you’re writing code like that, you need to use things like Threads.Atomic or Base.@atomic. Check the docstrings ? Base.@atomic for more info.

However, most of us write very little code like that.
You must have a very deliberate purpose if you’re having multiple threads read and write to the same memory, and must control this very carefully (with locks or atomics).

2 Likes

This is a true-ism even without compilers: Precise memory models of CPU architectures are complex and in flux.

Assembly / architectural machine code (e.g. “amd64 with the following extensions…”) are high-level languages on top of a proprietary incompletely documented micro-architecture which is itself incompletely stable (e.g. micro-code updates, or the recent retro-active spec change that made aligned 16 byte vector ops atomic on intel/amd – these happened to be atomic on all extant implementations, and are now guaranteed reliably atomic on all extant and future implementations).

But the general point is correct: The most common race-conditions are unconstrained UB in C++ and julia, and are somewhat constrained indeterminate behavior in e.g. JVM.

This is because julia chooses to emit by default non-atomic memory ops, instead of “unordered”.

Afaiu the main reason for that is C++: Non-clang is somewhat second-class citizen in llvm, and they use NotAtomic. Hence many optimizations needlessly fail with unordered, and it would be quite (unneccessarily) costly to use unordered instead of NotAtomic. :person_shrugging:

2 Likes

This is almost surely a misunderstanding or miscommunication.

Ah, I see. As my course was primarily focused on using C (not C++), atomics (as far as I know) weren’t considered and more focused on manually creating, using, and managing mutex locks.

So I guess it boils down to “unless you’re writing a potential data race, automatic compiler optimizations shouldn’t be a problem”?

For programming languages in general, compiler optimizations by definition are not supposed to be observable in the behavior of the program (except performance etc.). So if there is a problem, it’s either a compiler bug or user error. Of which user errors are more prevalent.

It is very useful to have another category for “questionable optimizations”:

Optimizations based on undefined behavior that are arguably correct under prevailing interpretations by compiler-authors of the language standards (rules-as-written! user error!), while being arguably counter-intuitive and counter-productive under reasonable-ish interpretations of the language standards by practitioners (man, I just want to code, not language-lawyer :frowning: ).

This is a quite contentious category. But I think that we should all be able to agree that it is, in fact, a sensible category to talk about, and about what class of optimizations belongs into it, even if we may disagree about whether these optimizations are desirable and whether it is a sub-category of “user-error” or “compiler-error”.

The most egregious offender in that category is, in fact, C, and the second most egregious offender is C++. Since LLVM and GCC both primarily serve C/C++, we do inherit some of the same issues.

I recommend reading the variety of Linus Torvald’s rants about that, for the side of the thesis “compilers are unpragmatically strict in their language interpretation, and too willing to blame users”.

I fully concur with these rants, to the point that C vaguely resembles an awesome cool language, while actually being a catastrophic mess.

Arguably the language of the linux kernel is not “C”, it is a dialect “linux kernel C” with its own memory model, and arguable the language of LLVM is not C++, it’s a dialect “LLVM C++” with its own variant of inheritance.

2 Likes

You say “questionable optimizations”, but what you’re really complaining about are the semantics of the programming langauge, not the optimizations.

A good example of questionable semantics is the forward progress guarantee, of which both Julia and C++ got rid of.

I mostly see this like:

  1. programmer doesn’t know about some part of the language, or misunderstands some part
  2. programmer does a bug as a result
  3. programmer complains about the compiler and/or language spec instead of just fixing the bug

Not a fan of his rants. Sometimes he seems simply misinformed or outdated. In the case of his complaints about C, I guess he’s in the right, in the sense that C isn’t a very good language, making it hard to do some things within C. But that doesn’t excuse the silly attacks on compiler authors.

Besides, Torvalds intensely dislikes C++, barring it from the kernel, while sticking to the rusty C. The worst thing about C++ is its standard library, which wouldn’t even apply to using C++ in the kernel.

That is a take I haven’t seen before.

We had a user at a super computer centre who fell victim to this. He had a habit of always compiling with the highest optimization, despite warnings that this did could do some dubious things. His program ran orders of magnitude slower than expected. The supercomputer was useless, he claimed. It turned out that his fortran program tried to estimate the machine precision, and used the machine precision to estimate how many Chebyshev polynomials he needed in his basis. Higher precision meant more polynomials.

To estimate the precision he did something like this:

delta = 1
prec = 0
while 1 + delta > 1
    delta = delta / 2
    prec = prec + 1
end

At the highest optimization level, the compiler was happy to replace while 1 + delta > 1 with while delta > 0. He ended up with 340 polynomials instead of 15, or something.

3 Likes

You can view it that way for each specific thing. The other implicit complaint is of course about language culture:

There is a view that rules-as-written-and-interpreted, i.e. language semantics, are paramount and these are all user errors.

There is a different culture in e.g. the java sphere: That it is the obligation of compiler/runtime maintainers, as ecosystem stewards, to engage with common practice and to not turn “broken” programs and common practices into security vulnerabilities. That there is shared responsibility to find good solutions to empirically counter-intuitive semantics (i.e. misuse is wide-spread), instead of telling the users off.

Let me give a fun example from julia/llvm. One of my favorite x86 instructions is pmovmskb. The natural julia / llvmcall idiom for that is to bitcast an <i1 x N> into an iN, and then truncate / zext it to i32. Sounds natural, right? It’s often also what clang generates with -emit-llvm.

Sounds reasonable, right?

Lol nope. This works for some specific blessed values of N, and otherwise llvm silently emits broken garbage code that produces incorrect results.

Whose bug is that? Let’s look at the spec:

The conversion is done as if the value had been stored to memory and read back as type ty2.

Yeah, i1 does not have a memory representation. As the llvm mailing lists patiently explains, this is obviously broken UB code, not an llvm misbehavior.

And language specs are all like this! They are not written for people like me who lack the reading comprehension to immediately see that.

This is a “me” problem, and not a necessarily a “spec” problem, but I hope you can emphasize with the fact that people like me (and torvalds) whine about it.

1 Like