How I learned to stop worrying about being fastest and love microbenchmarks

I love microbenchmarks.

Once in a while on this forum, some Julian discovers a GitHub repo with a microbenchmark that compares the speed of different programming languages. Typically, Julia does poorly compared to C or Rust. The horror!

Fortunately, Julians from all over the world quickly pour in to right the wrong, and make a PR to improve Julia’s timings. After which typically follows a long discussion in the thread, where people try to squeeze the last microseconds out of the benchmark using advanced techniques like LoopVectorization, unsafe code, or what have you.

And I get that, I do! Getting Julia on the number one spot in the benchmarks is fun because it feels like winning. It’s also a fun puzzle to figure out how squeeze the microseconds, and on top of that, you learn more about Julia and low-level computing when you study how to push your code to the limits.

But none of those things is why I love microbenchmarks: I love them because they pull the wool from your eyes and reveal just how fast your favorite pacakge or programming language really is, after you’re done with all the excuses.

For sure, simply looking at the numbers in the microbechmark isn’t all that informative. It’s not the numbers themselves that matter, it’s the numbers when you pay close attention to what actually takes time. To show you what I mean, let me give some examples:

The Seq benchmark

In 2019, a paper was published announcing the Seq programming language - a high performance intended for bioinformatics. The paper including microbenchmarks that showed Seq handily beat BioJulia. Oh no! As authors of BioJulia, Sabrina Ward and myself has to look into this, and indeed, we were able to write a faster implementation in Julia of the microbenchmark and beat Seq. Hooray! Job done?

No! The Seq authors were fundamentally right that idiomatic Seq beat idiomatic BioJulia. The fact that it was possible for us to beat Seq using Julia was beside the point. Fact is, we believed that BioJulia was fast, and someone showed us that it wasn’t.

We eventually subjected BioSequences.jl to a performance review after the challenge, which we wouldn’t have done if it wasn’t for the Seq microbenchmark. You can read more about this in a BioJulia blogpost

The biofast benchmark

In 2020, famous bioinformatician Heng Li wrote a blog post about fast, high level languages, and included a microbenchmark where Julia featured, and, as usual, was beaten by other compiled languages.

One of the benchmarks consists of parsing a 1.1 GB FASTQ file, where FASTX.jl’s FASTQ parser got absolutely destroyed by Rust’s needletail. Unbelievably, Rust manages to parse the file in 600 ms(!), whereas FASTX takes 10.6 seconds (2.35s when disregarding compilation). That’s nearly 20 times longer!

Again, I was shocked - I thought FASTX had a really fast parser - I thought we would be hard to beat. How is it possible to be four times faster, even after compiling?
That rabbit hole took me deep into the implementation of how to write fast parsers. I eventually ended up being the maintainer of FASTX.jl and Automa.jl, and rewriting large parts of those packages.

Here too, the benchmark gave me a rude awakening: Two years later, after excruciating optimization and months of work in my free time, FASTX still takes 2.2 seconds to parse the file, although latency improvements in Julia means the total time has been reduced from 10.6 to 3.2 seconds.

Digging deep into the microbenchmarks revealed that it was possible to close a large part of the gap between Rust and Julia by writing a kernel in x86 assembly, but that Julia unfortunately does not provide a reliable way to write platform-specific code, whereas Rust does.

There are more examples of these kinds of lessons, of course.

One time, I found my Julia implementation got trounced by Rust in a simple Advent of Code microbenchmark, because SubArray remarkably didn’t have an efficient iterate method.
Another time, someone (I can’t remember who) found a microbenchmark where Julia decompressed gzipped files slower than other languages. It turned out that Zlib_jll was compiled with the wrong flags.

You get the idea.

Ask not what you can do to make Julia look good, ask what Julia can do for you

Regrettably, when I read threads on microbenchmarks on this forum, I mostly just see people trying to be Julia to be number one. Rarely do people ask: Why isn’t Julia already number one in the benchmark? Why is it that whenever outsiders to Julia write a microbenchmark, Julia usually ends up near the middle or the bottom?

Next time you see Julia perform not-so-well in a microbenchmark, I encourage you to compare the Julia implementation to one of the fastest in the benchmark. If the implementations are not significantly different, see if you can track down where the inefficiency of Julia comes from, and if this can be addressed in Julia itself.

It’s the same optimisation challenge, but the stakes are higher. Now, you don’t just get to make a PR to put Julia on the top of some list in a guy’s repo, you get to actually make Julia, the language, faster. Who knows - maybe your efforts will be what makes some functionality in Base run lightning fast, to the benefit of thousands of other Julia users.

96 Likes

:100: . This is why I bow out of microbenchmarks. I’ll teach people how to optimize code, but going through every microbenchmark isn’t a good use of anybody’s time.

Real programmers use libraries. The purpose of benchmarks is to make libraries better. Whether it’s Base Julia, the standard library, or some more obscure package, the purpose of the benchmarks should be to understand algorithms and make things better.

People far too often focus on raw code performance. But raw code performance isn’t even what makes algorithms fast. The formula is more complex, but not that complex:

Fast Implementations + Fast Algorithms = Fast Code

People often ask me how SciML solvers pretty routinely outperform C and Fortran codes. A large part of that is just better algorithms: better time stepping adaptivity methods, better RK tableaus, etc. Someone can make a Fortran version of the solver that matches performance. Obviously someone can, I know how to do it. Things compile to the same code at the end of the day, it’s not surprising. But what matters is that it’s actually done in Julia. It’s not just a single explicit Runge Kutta method to show off in Fortran that “yeah we can match those Julia guys!”, it’s a whole cohesive library that mixes the newest algorithms, ecosystem integrations (GPUs, sparse linear solvers, plotting, etc.)… with the performance.

Raw code performance is one piece of the puzzle. It’s a necessary part of really good libraries. We need it to scale, and bad performance will cause a library to never scale no matter the features around it. But the reason why Julia is good for building libraries is not because it can hit raw C performance. It’s because it can get to 90% of C performance with 10% of the work, and therefore if you allocate your time appropriately you will get a faster algorithm in Julia because it’s 90% efficient but a 10x better algorithm.

It can take 125% of the effort it would take in C to get the completely optimal algorithm, but 10% of the time to get the 90%. Use that Pareto curve to your advantage and focus on the end goal: solving problems.

44 Likes

Doesn’t inline assembly in LLVM.jl let us do that?

No. That lets us write code that crashes when run on a machine that doesn’t have those assembly instructions. And CPUHostFeatures only works as long as the compiled code is allowed to use all features of the host CPU, which is not true in some HPC contexts where code may be compiled on one CPU, but execute on another.

5 Likes

Was assembly code being used in the Rust implementation, and was that a reason for the performance difference?

1 Like

The Rust implementation in needletail uses a different algorithm than FASTX.
Needletail calls memchr to identify a set of key delimiters in the file buffer, and then returns a record which contains views into this buffer at positions determined by memchr.
The Rust implementation of memchr is highly optimised, and indeed uses platform-specific assembly code.

We don’t use the same algorithm in Julia, for two reasons:
First, we can’t return views into the file buffer - partly because the buffer of IOStream is not even available in Julia, and partly because it would be totally reckless to return a view into a buffer that will be mutated at the next IO operation. This is safe in Rust, because mutating the buffer while the view exists is prevented by the borrowchecker.
Second, I just don’t like the algorithm. It’s not rigorous enough - for BioJulia, I want the parser to validate every single byte, not just skip to a few delimiting bytes using memchr.

FASTX instead compiles a state machine (using Automa) that examines each byte to find the delimiters, then copies the bytes into the record (after first having copied the byte from the IOStream into the buffer that Automa reads from). However, there is nothing algorithmically preventing this state machine from emitting highly optimised code approching that of memchr. Indeed, good regex implementations work exactly that way. So one could expect FASTX to reach performance close to, if not quite matching, needletail.

To reach this goal, I wrote ScanByte.jl, a memchr implementation in Julia, which uses x86 intrinsics. I then rewrote parts of Automa to enable it to emit this memchr code, and used this in FASTX.jl.
However, I was only able to reach around half the speed of Libc’s memchr. And I eventually had to revert all my memchr optimizations in FASTX after a user reported miscompilations that arose not compiling to the native target. I then concluded that there is no real way in Julia of checking the available instructions for the target.

19 Likes

I find that interesting. In a way, with any programming language that encodes sufficient information in its types, I guess you should be able to reach similar speeds given a sufficiently “clever” compiler. But in practice that’s almost never the case. So people, like you in ScanByte.jl for example, have to drop down to a closer-to-the-metal level at some point, to help out by specifying the best SIMD instructions etc. And I don’t think that level is ever really “idiomatic” because it’s not abstract enough anymore.

One question then is, given an “idiomatic” and abstract top-level interface to a library, can people still make that work for their own special use cases with high performance. And I would say Julia is pretty good at that, the whole SciML idea is basically that, putting user defined functions into predefined solvers and compiling that to really fast machine code. So it might be a bit less common in Julia to find highly optimized “fully vertically integrated” libraries because people are so used to injecting their own code deep into whatever a library is doing. Which also makes it a bit harder to optimize maybe?

2 Likes

Thanks for another insightful post, Jakob!

It’s usually possible to get good performance out of Julia in principle. In practice, Julia regularly gets its ass kicked by C, C++, or Rust code, typically because Julia makes performance-oriented code a huge PITA. In Rust, the compiler and IDE tooling make it easy to write high-performance code–slow code either fails to compile or throws an error. In Julia, there’s no warning unless you explicitly opt-in to JET. (Even then, I usually find its annotations less-than-helpful, since it usually marks dynamic dispatches rather than the origins of type instabilities, and it’s often hard to tell what caused the problem.)

I’m starting to think that Julia’s dynamic-by-default nature is what’s held it back for so long. In the entire niche of “Fast, easy-to-use compiled languages,” the only dynamic language I’m aware of is Julia. Go, Kotlin, Swift, Elm, and Nim all use static inferred types, which cover 95% of the ease-of-use (arguably 125%, since static typing also makes it easier to write bug-free code) without the performance penalty of accidental dynamic dispatches.

I don’t think there’s anything wrong with having optional dynamic dispatch, but the way it’s handed out as an easy-to-miss footgun is the #1 complaint I hear from experienced Julia developers, and IME it’s also something new users used to MyPy and TypeScript don’t enjoy having to put up with. Possibly the most eye-opening experience I had was a few days ago, when an ML developer I was talking to said that they were switching from Julia to Rust, because trying to write code in Julia was “too complicated.” It’s not that the language itself was complicated–it’s that the complexity of Rust is managed very well by the compiler and IDE, which together make it clear why your code is misbehaving, whenever it is.

8 Likes

Why are you using intrinsics for pmovmskb in ScanByte, like e.g. https://github.com/jakobnissen/ScanByte.jl/blob/e8b2c7ae80477f1113e1668195a92d99cdf34b12/src/codegen.jl#L131 ?

Afaiu the “natural” idiom looks like

%trunc = trunc <64 x i8> %0 to <64 x i1>
%cast = bitcast <64 x i1> %trunc to i64
ret i64 %cast

Or am I wrong on that?

2 Likes

You’re right! I made a PR. Thanks for noticing this.

3 Likes

More generally, I think you could use just “portable” (lol) llvm vectorized code, without any intrinsics.

That way, everything should automatically compile correctly irregardless of uarch (esp. avx512, avx2, sse and arm-neon). Everybody whines about the missing pmovmskb on arm, but the compiler emitted workarounds are supposedly surprisingly non-terrible, performance-wise.

Your current code is avx2 targeted? Try out using 64 wide vectors and let the compiler worry about that. If there are no register spills that’s probably same perf as 32 wide vectors on avx2 (or even faster because more ilp) and much faster on avx512. (but benchmark / consult @code_native before – you really cannot afford register spills)

edit: My trick for discovering llvm idioms is immintrin.h + godbolt. Write a small C function using the intrinsic, compile to llvm bitcode, and read the result.

5 Likes

I don’t think it’s possible to use the vpshufb-based algortihm that I use, using only LLVM code. But I’m not deep into LLVM so I’m not sure.

1 Like

Support for shuffles in llvm is really really good. I am willing to bet that there is no problem expressing avx2 shuffles in pure llvm. Consider using SIMD.jl as a dependency and maybe ask @schnetter for help.

The bitcast idiom for pmovmskb is a little sketch: It works, but afaiu there is no consensus on the llvm mailing lists whether that is guaranteed by the spec or coincidental (“the result is the same as memcpy” is somewhat underspecced for types like i1 that have no in-memory representation). But clang/llvm use that idiom internally, so it can’t be that bad :wink:

1 Like

Thanks for your post, I learn a lot from this forum.

I faced the problem of “how do can I branch my code depending on whether the CPU that executes it has AVX2, AVX512”. I ended up not doing it (I don´t have enough expertice to do this properly) but I would love to learn how to do it. I´ve seen your code

# Discover if the system CPU has SSSE or AVX2 instruction sets
let
    llvmpath = if VERSION ≥ v"1.6.0-DEV.1429"
        Base.libllvm_path()
    else
        only(filter(lib->occursin(r"LLVM\b", basename(lib)), Libdl.dllist()))
    end
    libllvm = Libdl.dlopen(llvmpath)
    gethostcpufeatures = Libdl.dlsym(libllvm, :LLVMGetHostCPUFeatures)
    features_cstring = ccall(gethostcpufeatures, Cstring, ())
    features = split(unsafe_string(features_cstring), ',')
    Libc.free(features_cstring)

    # Need both SSE2 and SSSE3 to process v128 vectors.
    @eval const SSSE3 = $(any(isequal("+ssse3"), features) & any(isequal("+sse2"), features))
    @eval const AVX2 = $(any(isequal("+avx2"), features))

    # Prefer 32-byte vectors because larger vectors = higher speed
    @eval const DEFVEC = if get(ENV, "JULIA_CPU_TARGET", "native") != "native"
        nothing
    elseif AVX2
        v256
    elseif SSSE3
        v128
    else
        nothing
    end
end

would it make sense to have this sort of logic in a package sort of CpuInstructions.jl that developers can use to check the availability of certain instructions?

What source/blog/book (or code in a package) would you recommend to learn how to leverage Base.llvmcall or ccall to do low level code optimizations?

In my case I wanted to port this package to julia GitHub - matsui528/rii: Fast and memory-efficient ANN with a subset-search functionality but I had issues trying to figure out how to port certain parts such as https://github.com/matsui528/rii/blob/main/src/distance.h , precisely because of low level code that is optimized with particular instructions.

We already have HostCPUFeatures.jl that does this. The problem is that users may choose to compile code to a target that is not the host CPU.
Really, I think Julia, the language, should expose the target to the user.

9 Likes

Could you share the link to this package? Did you mean HostCPUFeatures.jl ?

2 Likes

I have been thinking about this but it/s not easy to add that to the language.
I do think we should have something like 2045-target-feature - The Rust RFC Book and also use that yo get away from

if Sys,iswindows()
   f() = ...
else
   f() = ...
end

Which may allow us to get to cross-compilation. But that’s a big lift.

8 Likes

Yup. Hard agree with OP and Chris. Another reason I quite like reading benchmark threads and Para is because I learn a ton. Having very experienced programmers bending the edges of what’s possible in Julia is a great way to discover new tricks.

I would not program like that in my use cases, as I usually need fast prototyping, maintainability, robustness, readability, future-proofing, …, more than pure speed. But I like to know what can be done eventually.

And, a lot of times, knowing that I’m not too far from extremely optimized code is more than enough. If Julia was 100x slower than Rust, I would be worried. If it is 2x, and readable as the comic book version of a math textbook, I’m happy.

8 Likes

Totally get and agree with this. It requires users to be expert in the language, and comfortable making changes and opening PRs to Base and the standard libraries in order to do this, though. I’ve found a number of speed issues within those modules, and raised issues in github, but I’ve looked into what would be required to make the improvements myself, and from both a process and understanding perspective, it’s beyond me at present.

Again, I totally get and agree with this. It requires users to be experts in the algorithms, and have the ability to improve those algorithms, though.

These are both high bars, and most users won’t be able to meet them. That’s fine - good to have some high bars. But in order to attract and foster the level of users required to meet these bars, Julia (by which I mean the language, development environment and community) needs to be something they want to use and engage in.

In short, I see 3 things that together would/do make Julia a compelling language for people to adopt:

  • A language & ecosystem that is a joy to use and easy to contribute to
  • Libraries that offer state of the art performance
  • User code that generally runs fast

Not sure about the order of priority. Perhaps there are more, too.

This wasn’t what the topic was about, specifically, but I feel that both @jakobnissen and @ChrisRackauckas’ posts hinted at what would help make Julia better, and hence more appealing.

4 Likes

I think quite the opposite - the developers of the ecosystem need to be experts, not the users. One of the benefits of being able to abstract away the details and write generic code is that as a user, I can plug into the same system and get the latest solver without really understanding the implementation.

And at a different level, experts in the algorithms don’t need to be experts in the language. Since the underlying ecosystem is robust, a mathy person that has a great new idea doesn’t need to understand all of the nuances of writing performant code.

15 Likes