Due to how modern CPU and computer memory works, branching, like using if statement, can produce high cost for the performance and produce cache misses. At basic level I understand that, but I’m deficiently not expert in HPC.
More specific, I don’t know what is the cost of branching in Julia and when we should start even bother about such things. I quickly checked Performance Tips and didn’t find relevant informations.
The cost of branching in Julia is the same as in languages like C++; as the articles mention you can increase speed by not branching. However this is an optimization that generally will be on the order of nanoseconds and so is probably one of the last things you would do to improve performance.
When you start to talk about branchless performance you are counting CPU cycles, i.e. L3 cache miss penalty is 50~70 CPU cycles or around 10 nanoseconds in modern CPUs. Before arriving at this level, you had to have thought about avoiding kernel space as much as possible by making the least amount of syscall()s(ranging from 20 nanoseconds to 300 nanoseconds on average). Now these are hardware-level/OS-specific protocols that are beyond a programming language’s scope. Julia’s performance tips will take you to 80% of the absolute performance, more than that you need to tinker with low-level stuff that is not that different than a C look-a-like code.
Edit: Also to add, usually when you are fetching data from L1 L2 cache repeatedly is worth the idea to go branchless, if you are fetching data from RAM your branchfull code will be faster most of the times.
Yes, branches in Julia are a problem for performance (as they are in all high-performance programming languages). You can read more about how Julia and hardware interacts in an old, but still relevant blogpost of mine.
However, note that, like other compiled languages, Julia’s compiler will often remove branches where possible. For example, consider the step function of the Collatz Conjecture:
f(x) = isodd(x) ? 3x + 1 : x ÷ 2
Here, Julia’s LLVM backend will compile away the if/else statement to a cmove instruction, compiling the function to:
push rbp
mov rcx, rdi
lea rax, [rdi + 2*rdi + 1]
mov rbp, rsp
shr rcx
test dil, 1
cmove rax, rcx
pop rbp
ret
Also, I don’t think branches can produce cache misses. They can produce branch mispredictions, which are also a source of CPU latency, but they are much less a delay than L3 cache misses.
Also also, modern CPU branch predictors are really really good, and I’ve often found that removing a branch in a tight loop doesn’t improve timing markedly, probably because it’s predicted with almost 100% accuracy.
Cache misses can occur when the CPU mispredicts what memory to fetch. It won’t do anything that writes back to memory until it knows which branch is actually taken, but it will fetch memory and execute instructions speculatively and then just throw out the work if a predicted branch happens to have been wrong. This can lead to cache misses when the prediction is wrong.
And I’m not sure if this applies to Windows, but under Linux you can pass isolcpus as a kernel parameter to prevent some cores from being used by the scheduler without intervention. You could then manually force julia onto the excluded cores.
Branchless can be faster but often isn’t faster at all because branch predictors can predict which way you’re gonna branch anyway. You can measure branch misses with perf stat. It’s good to measure them before trying to remove them.