Are branches in Julia a problem for performance?

KZiemian · December 26, 2023, 5:18pm

Due to how modern CPU and computer memory works, branching, like using if statement, can produce high cost for the performance and produce cache misses. At basic level I understand that, but I’m deficiently not expert in HPC.

More specific, I don’t know what is the cost of branching in Julia and when we should start even bother about such things. I quickly checked Performance Tips and didn’t find relevant informations.

Zentrik · December 26, 2023, 5:36pm

The cost of branching in Julia is the same as in languages like C++; as the articles mention you can increase speed by not branching. However this is an optimization that generally will be on the order of nanoseconds and so is probably one of the last things you would do to improve performance.

AMJ · December 26, 2023, 7:15pm

When you start to talk about branchless performance you are counting CPU cycles, i.e. L3 cache miss penalty is 50~70 CPU cycles or around 10 nanoseconds in modern CPUs. Before arriving at this level, you had to have thought about avoiding kernel space as much as possible by making the least amount of syscall()s(ranging from 20 nanoseconds to 300 nanoseconds on average). Now these are hardware-level/OS-specific protocols that are beyond a programming language’s scope. Julia’s performance tips will take you to 80% of the absolute performance, more than that you need to tinker with low-level stuff that is not that different than a C look-a-like code.

Edit: Also to add, usually when you are fetching data from L1 L2 cache repeatedly is worth the idea to go branchless, if you are fetching data from RAM your branchfull code will be faster most of the times.

jakobnissen · December 27, 2023, 9:26am

Yes, branches in Julia are a problem for performance (as they are in all high-performance programming languages). You can read more about how Julia and hardware interacts in an old, but still relevant blogpost of mine.

However, note that, like other compiled languages, Julia’s compiler will often remove branches where possible. For example, consider the step function of the Collatz Conjecture:

f(x) = isodd(x) ? 3x + 1 : x ÷ 2

Here, Julia’s LLVM backend will compile away the if/else statement to a cmove instruction, compiling the function to:

        push    rbp
        mov     rcx, rdi
        lea     rax, [rdi + 2*rdi + 1]
        mov     rbp, rsp
        shr     rcx
        test    dil, 1
        cmove   rax, rcx
        pop     rbp
        ret

Also, I don’t think branches can produce cache misses. They can produce branch mispredictions, which are also a source of CPU latency, but they are much less a delay than L3 cache misses.

Also also, modern CPU branch predictors are really really good, and I’ve often found that removing a branch in a tight loop doesn’t improve timing markedly, probably because it’s predicted with almost 100% accuracy.

StefanKarpinski · December 27, 2023, 11:04am

Cache misses can occur when the CPU mispredicts what memory to fetch. It won’t do anything that writes back to memory until it knows which branch is actually taken, but it will fetch memory and execute instructions speculatively and then just throw out the work if a predicted branch happens to have been wrong. This can lead to cache misses when the prediction is wrong.

sob · December 27, 2023, 1:09pm

For reference, to those who haven’t seen it yet:

KZiemian · December 27, 2023, 2:10pm

Fantastic! I will show it my students in C class!

sob · December 27, 2023, 6:36pm

I first discovered it as it was linked on one of @ChrisRackauckas 's online courses.

johnh · December 27, 2023, 6:51pm

My 2 cents worth. Look at CPU core pinning before you start to think about branches.

Ie. Keep those processes busy on the same core and don’t let the Is stall them and move to different cores

sob · December 29, 2023, 8:18am

And I’m not sure if this applies to Windows, but under Linux you can pass isolcpus as a kernel parameter to prevent some cores from being used by the scheduler without intervention. You could then manually force julia onto the excluded cores.

jar1 · December 29, 2023, 8:55am

Branchless can be faster but often isn’t faster at all because branch predictors can predict which way you’re gonna branch anyway. You can measure branch misses with perf stat. It’s good to measure them before trying to remove them.

Topic		Replies	Views
Branching in loop decrease perf by ~x5 Performance	6	778	December 13, 2020
Branch elimination in Julia Performance question	7	1072	July 22, 2020
Out of order branch execution prediction slower? Performance	17	1012	May 26, 2020
Evaluate Piecewise Functions Effciently Performance question	3	848	February 23, 2021
CPU cycles and syscalls Performance performance	4	665	February 12, 2019

Are branches in Julia a problem for performance?

Related topics