Are branches in Julia a problem for performance?

Due to how modern CPU and computer memory works, branching, like using if statement, can produce high cost for the performance and produce cache misses. At basic level I understand that, but I’m deficiently not expert in HPC.

More specific, I don’t know what is the cost of branching in Julia and when we should start even bother about such things. I quickly checked Performance Tips and didn’t find relevant informations.

The cost of branching in Julia is the same as in languages like C++; as the articles mention you can increase speed by not branching. However this is an optimization that generally will be on the order of nanoseconds and so is probably one of the last things you would do to improve performance.


When you start to talk about branchless performance you are counting CPU cycles, i.e. L3 cache miss penalty is 50~70 CPU cycles or around 10 nanoseconds in modern CPUs. Before arriving at this level, you had to have thought about avoiding kernel space as much as possible by making the least amount of syscall()s(ranging from 20 nanoseconds to 300 nanoseconds on average). Now these are hardware-level/OS-specific protocols that are beyond a programming language’s scope. Julia’s performance tips will take you to 80% of the absolute performance, more than that you need to tinker with low-level stuff that is not that different than a C look-a-like code.

Edit: Also to add, usually when you are fetching data from L1 L2 cache repeatedly is worth the idea to go branchless, if you are fetching data from RAM your branchfull code will be faster most of the times.

1 Like

Yes, branches in Julia are a problem for performance (as they are in all high-performance programming languages). You can read more about how Julia and hardware interacts in an old, but still relevant blogpost of mine.

However, note that, like other compiled languages, Julia’s compiler will often remove branches where possible. For example, consider the step function of the Collatz Conjecture:

f(x) = isodd(x) ? 3x + 1 : x ÷ 2

Here, Julia’s LLVM backend will compile away the if/else statement to a cmove instruction, compiling the function to:

        push    rbp
        mov     rcx, rdi
        lea     rax, [rdi + 2*rdi + 1]
        mov     rbp, rsp
        shr     rcx
        test    dil, 1
        cmove   rax, rcx
        pop     rbp

Also, I don’t think branches can produce cache misses. They can produce branch mispredictions, which are also a source of CPU latency, but they are much less a delay than L3 cache misses.

Also also, modern CPU branch predictors are really really good, and I’ve often found that removing a branch in a tight loop doesn’t improve timing markedly, probably because it’s predicted with almost 100% accuracy.


Cache misses can occur when the CPU mispredicts what memory to fetch. It won’t do anything that writes back to memory until it knows which branch is actually taken, but it will fetch memory and execute instructions speculatively and then just throw out the work if a predicted branch happens to have been wrong. This can lead to cache misses when the prediction is wrong.


For reference, to those who haven’t seen it yet:


Fantastic! I will show it my students in C class! :grinning:

I first discovered it as it was linked on one of @ChrisRackauckas 's online courses.

1 Like

My 2 cents worth. Look at CPU core pinning before you start to think about branches.

Ie. Keep those processes busy on the same core and don’t let the Is stall them and move to different cores

1 Like

And I’m not sure if this applies to Windows, but under Linux you can pass isolcpus as a kernel parameter to prevent some cores from being used by the scheduler without intervention. You could then manually force julia onto the excluded cores.

1 Like

Branchless can be faster but often isn’t faster at all because branch predictors can predict which way you’re gonna branch anyway. You can measure branch misses with perf stat. It’s good to measure them before trying to remove them.