Do C++ performance optimisations apply to Julia?

jacobusmmsmit · June 10, 2022, 2:19pm

Hello everyone

I came across a set of tips for optimising C++ raytracing code written by Chris Wyman and wondered how much of it applies to Julia. In total there are 27 tips which I’ll share below along with my thoughts on them. What I would like to gain from this post is to ask to fill in the numerous gaps in my knowledge about many of the optimisations as they relate to lower level or core components of the Julia language or computer architecture in general.

My ultimate goal is to make a blog/Forem post about some more subtle, uncommon, or counterintuitive (coming from other languages) optimisations that one can make to Julia code. The bonus goal of this post is to compare how these optimisations apply to Python. A lot of newer programmers and college students (myself included) love to make code go fast and I think it would be fun way to introduce them to Julia: favourably comparing something they are familiar with, namely Python and/or C++, to something that most wouldn’t have considered otherwise.

The original pdf can be found here but I will summarise (sometimes copying sentences directly, obviously I wouldn’t do that for the article) the points without further ado:

Some general rules that apply to all languages:

Remember Ahmdal’s Law: \text{speedup} := \frac{\text{time}_\text{old}}{\text{time}_\text{new}} = \frac{1}{(1-\text{func}_\text{cost}) + \text{func}_\text{cost}/\text{func}_\text{speedup}}
- Clearly this still applies in Julia as it’s just a good thing to keep in mind before foolishly trying to optimise a piece of code that is rarely called.
- “Make the common case fast and the rare case correct”.
Code for correctness first, then optimize!
- Write for correctness, then if you know the function will be called frequently perform obvious optimisations then find and remove bottlenecks. (paraphrasing the original)
People I know who write very efficient code say they spend at least twice as long optimizing code as they spend writing code.

Some more specific tips:

Jumps/branches are expensive. Minimize their use whenever possible.
- In C++ function calls require two jumps and some stack memory manipulation so Chris advises to use iteration instead of recursion as well as inlining short functions.
- One comment that immediately struck me as potentially suboptimal in Julia is the advice that
  for(i=0; i<100; i++) DoSomething();
  should be changed into
  DoSomething(){for(i=0; i<100; i++){ ... }}.
  In Julia, I would prefer the first as dispatch happens at the function barrier, and so you could run into some suboptimal code generation or runtime slowdown doing the second way if you weren’t actively thinking about the type of variables in the loop.
- Another piece of advice is the use of switch statements to replace if...elseif...elseif chains as these also require jumps, whereas a switch can be optimised into a table lookup with a single jump.
- What I’m not sure about is how many jumps does a function call have in Julia? Is iteration still preferred (for performance) over recursion when you keep everything type stable? Does @inline behave exactly the same as in C++? What about the comment on switch statements? I’ve seen chains of ternary operators recommended before and I think it looks pretty, but is it optimal?
Think about the order of array indices.
- Here’s one that almost everyone using Julia knows (and for good reason). Obviously, the advice for C++ is the opposite to that of Julia as C++ is row-major.
Think about instruction-level-parallelism.
- Modern CPUs can simultaneously execute a bunch of instructions so we should encourage parallelism by making blocks of code (between jumps) have enough independent instructions to allow the CPU to be fully utilized.
- One suggestion Chris has is to think about loop unrolling, but I’ve seen this discussed before in reference to Julia and I’m pretty sure it doesn’t make such a huge difference given how unroll-happy and SIMD-aggressive Clang/LLVM is. We have higher level tools like @turbo to do a lot of this kind of stuff but I don’t know enough about it to say whether there’s ever a use case for manual unrolling.
Avoid/reduce the number of local variables.
- In C++, local variables are stored on the stack unless there are few enough that they can be stored in registers which speeds up memory accesses and avoids overhead of setting up a stack frame.
- The advice given encourages the use of global variables to avoid using too many local ones, which appears to go directly goes against the first rule of Julian law. However, if we are already inside a locally isolated scope, is there reason to avoid allocating local variables in - for example - a function defined in this scope?
Reduce the number of function parameters
- Paraphrasing: “Same reason as above, these are also stored on the stack.”

(From here on out my knowledge gets really spotty as I’ve never really learned about memory or used C/C++ etc.)

Pass structures by reference, not by value
- Here’s where we definitely reach the end of my low-ish level knowledge. Pointers/Refs definitely don’t work the same way in C++ and Julia so I’d appreciate as much help on this one as possible.
If you do not need a return value from a function, do not define one
- Definitely not the Julian way: functions are everywhere because of dispatch and many exist purely to mutate state of pre-existing variables.
- If there’s any deeper truth to this that I’m missing or some additional context that would be useful please let me know!
Try to avoid casting where possible
- “Integer and floating point instructions often operate on different registers, so a cast requires a copy.”
- “Shorter integer types (char and short) still require the use of a full-sized register, and they need to be padded to 32/64-bits and then converted back to the smaller size before storing back in memory. (However, this cost must be weighed against the additional memory cost of a larger data type.)”
- This is incredibly low level but not fundamentally language specific. Still applies to Julia but probably not worth thinking about as the benefit of dispatching on the best type outweighs the cost of a copy.
Be very careful when declaring C++ object variables.
- “Use initialization instead of assignment (Color c(black); is faster than Color c; c = black;).”
- I’ve never seen someone do this in Julia. Possibly because it’s dumb but certainly in part because structs are immutable by default so initialisation in this way is not even an option.
Make default class constructors as lightweight as possible
- Applies directly to Julia as well. Make commonly used initialisation as lightweight as possible.
Use shift operations instead of integer multiplication and division, where possible.
- It probably makes more sense to tell the compiler what you want to do and let it optimise the code for the specific architecture.
- Benchmarking I find that 2 << 3 is 11% faster than 2 ^ 4.
Be careful using table-lookup functions
- Chris states that, while precomputed value lookups are encouraged by many, the memory lookup is often more expensive than just recomputing the value.
- This is just a general piece of programming advice but applies more to fast languages than slower ones.
For most classes, use the operators += , -= , *= , and /= , instead of the operators + , - , * , and / .
- Certainly applies to mutable structs and structs that contain mutable objects, however sometimes things that live entirely on the stack can be optimised away and are never initialised in the first place. If someone has any more in depth or specific knowledge in this area I’d appreciate it!
For basic data types, use the operators + , - , * , and / instead of the operators += , -= , *= , and /= .
- Using += assumes mutability which is generally something to be avoided unless you’re working with large arrays where it doesn’t matter much.
Delay declaring local variables
- Chris advises that variables should only be declared on branches where they are necessary, but I’m sceptical about this line specifically. I’m not sure if branches are handled by LLVM or only the machine itself, but am I correct in saying that the more similar branches are, the more likely the computer is to be able to evaluate both and not have to wait to see which branch is reached? I probably just said something dumb but please educate me!
For objects, use the prefix operator (++obj) instead of the postfix operator (obj++).
- We don’t have these in Julia so no problemo. The justification Chris gives for C++ is that copies of objects are made (and destroyed) when calling a postfix, but prefix operators don’t need a temporary copy.
Be careful using templates
- In Julia, users are encouraged to use packages for what they are trying to do rather than implement it themselves (finite difference methods for differentiation, for example) and then raise any issues with the packages when they discover something suboptimal. This is in part because most Julia packages are written entirely in Julia and I absolutely love this.
- Unfortunately, most Julia packages are written by academics with specific use-cases. Specifically, this means that code is sometimes hard to read even though it’s written in a high-level language. This has been discussed to death so I probably won’t write much about it for the article.
- Focussing on the positives, Julia makes it easy to read source code to understand exactly what a certain function is doing and lets you easily extend the language itself to optimise (almost) anything to your use case.
Avoid dynamic memory allocation during computation.
- This is another commonly given piece of advice in Julia. Prefer immutability over mutability unless it is unavoidable, in which case embrace it entirely and do as much as possible in-place.
- The fact that stack allocations can sometimes disappear entirely is even more reason to avoid dynamic memory allocations in Julia.
Find and utilize information about your system’s memory cache
- “If a data structure fits in a single cache line, only a single fetch from main memory is required to process the entire class”
- This is really cool stuff but I’ve never thought about it in Julia. How lightweight are structs in Julia? That is to say, how much overhead is there when allocating a struct compared to allocating its contents individually? Can I reliably predict how many bytes a struct will be?
Avoid unnecessary data initialization.
- “If you must initialize a large chunk of memory, consider using memset().”
- From what I know, memset is a fundamental part of C++ optimisation, but Julians don’t need to worry about it.
- If anything, preallocating large arrays is one of the most Julian things to do and is something that even MATLAB users have in their optimisation toolkit.
Try to early loop termination and early function returns.
- It was funny to see this sort of simple stuff so late in the list after all of the hardcore memory and low-level tips.
- Eliminating common cases first to avoid computation is just general good practice, especially in the raytracing context that the original list was written where most rays will miss most objects.
Simplify your equations on paper!
- In Julia, we can do one better: Simplify your equations with ModelingToolkit.jl!
The difference between math on integers, fixed points, 32-bit floats, and 64-bit doubles is not as big as you might think.
- I’ve seen many applications where using 32-bit floats has resulted in huge speedups without loss of precision. In fact, it’s the standard in some domains (deep learning).
- What about SIMD? Even if the original operations aren’t that much faster, the fact that we can do more of them is surely a bonus. Given that Julia (and everything compiled by Clang) heavily relies on SIMD code the benefit should be noticeable or at least tested for.
Consider ways of rephrasing your math to eliminate expensive operations.
- Again, this is general advice, but the difference between cheap and expensive operations on modern hardware isn’t that big so this could result in code being less readable for a minimal benefit.

And that’s everything. I wish I had more to say about many of these, but I hope that with the community’s advice I can write something insightful about most of these comments, and hopefully give people looking to scratch their optimisation itch while using a high-level language more of a reason to use Julia!

As I’m sure you can tell, my knowledge of the inner workings of Julia, C++, and the very computers that I use every day are limited so I’m definitely not the best, nor most qualified person to write this article. That said, the more content there is about my favourite language the better!

gbaraldi · June 10, 2022, 2:41pm

In summary, most of those apply, julia doesn’t generate code that much different from C++ when you compare assembly. Specifically about inlining, julia is very agressive when it comes to inlining, so you generally don’t need to worry about it. There are some edge cases where julia get’s it wrong but they are just that.

The thing about local variables is odd, I would avoid doing that even in C++ since global variables make code harder regardless of performance.

As you said smaller floats and integers can make massive differences due to being less memory and better SIMD.

Julia structs are basically like C/C++ structs, and when parametric they are similar to templates.

Initializing memory is not allocating memory, we preallocate, but sometimes allocating with undef is faster than initializing it with something like zeros.

In relation to templates, julia code uses generics for everything and emits specialized code if it’s type stable so making sure your code is type stable is enough.

Some of the optimizations mentioned are routinely done by the compiler so I wouldn’t worry too much.

Edit: About python, the optimization is to call code that isn’t python .

Henrique_Becker · June 10, 2022, 2:57pm

I needs someone else to confirm this, but I believe that these chains can be optimised by the compiler to the equivalent of a switch statements if all conditionals compare the same variable to a single value each.

This is pretty nonsensical to Julia, I believe. In C++ each local variable is kinda guaranteed to be a real memory position (either in the stack or the registers), in Julia the variables are bindings, just names for values, that do not give the same guarantees (both for better and for worse).

From what I have seen in this forum, in Julia we often are worried about putting things in the stack instead of the heap, not in the registers instead of the stack. This is basically going one extra level of optimization that I do not know if it can reliably be achieved by Julia.

You do not have the same level of control over this in Julia than in other languages, and I do not really agree with this general statement. Very large structs may be better passed as reference, but (i) you will be paying the cost of de-reference and (ii) this is decided by Julia itself (mutable are always passed by reference, and immutable are left to the compiler to decide). You can wrap values with Ref but I think the extra indirection layer will almost always be detrimental to performance. I also believe this comment may be a little old, many compiler optimisations nowadays take advantage of the invariants that come from immutability.

I am not sure if I did not understand the advice or your comment. It seems like the advice is about returning void (i.e., not having a return value) but I may be wrong. If it is about not defining functions per se it seems a bit nonsensical to me. You never really need the “return value from a function” you can always expand the function at your context using the variables that would be the arguments (if it is the call cost you are worried).

I really think that in Julia, the rare local i; i = 1 will be no different than local i = 1, the compiler is not braindead.

Again, I would be surprised if the highest level of compiler optimization did not do this by you.

Again, this seems like something the compiler should be able to deal with, and in Julia the if statements are leaky, so I do not think it applies (in C/C++ they aren’t if my memory do not fail me).

I think the Julian equivalent is using Vector{T}(undef, n) instead of fill(value, n) when the initialization is unnecessary.

vchuravy · June 10, 2022, 4:34pm

This is very subtle. Recursion is easier for the inference algorithm to reason about, so sometimes you will see rather tricky recursion patterns used in code that in particular deals with tuples. But often iteration is better if you are dealing with homogenous types and you avoid that the user gives you a very large input and thus blow the stack-limit.

The closest corollary in Julia is the use of StaticArrays. Sometimes the question arises: What is to big for an SArray, SArray tend to live on the stack and so there is a usable limit here and it might be more efficient to use a sized array.

Functions · The Julia Language I think the thing to discuss here is struct vs mutable struct.

The equivalent of templates is parametric types/type parameters in a function/and to a more extreme extend generated functions.

mkitti · June 10, 2022, 6:37pm

You are conflating memory allocation and initialization. They are not the same.

In Julia, we have nice tools to track memory allocation. To optimize, one should avoid allocating memory as much as possible. Preallocation is mainly used to create buffers and to avoid frequent allocation such as in loops.

In Julia, memset is called fill!:

github.com

JuliaLang/julia/blob/ab4c8cf616f222ae39767a46957ba06c4b77a0f9/base/array.jl#L425-L428


      
          function fill!(a::Union{Array{UInt8}, Array{Int8}}, x::Integer)
              ccall(:memset, Ptr{Cvoid}, (Ptr{Cvoid}, Cint, Csize_t), a, convert(eltype(a), x), length(a))
              return a
          end

This call could be optimized away by the compiler. Because of this, we have Base.securezero!:

help?> Base.securezero!
  securezero!(o)

  securezero! fills the memory associated with an object o with zeros. Unlike fill!(o,0) and similar code, which might be optimized away by the compiler for objects about to be discarded, the
  securezero! function will always be called.

Like in C++, it preferable to not initialize if not needed. Array{T}(undef, ...) is preferred over zeros(T, ...). If needed, fill! (aka memset) is a good way to do so. It may be optimized out, which in some cases may be a problem.

lmiq · June 10, 2022, 6:50pm

At the same time, it is rarely the case that memory allocation or initialization of an allocatable array is a major part of a performance critical function. If it is, one should probably be preallocating outside the critical part, and thus initializing or not is probably irrelevant. Most times this is likely just a premature optimization of very minor importance.

oxinabox · June 10, 2022, 7:34pm

I am dubious as to if avoiding memory on the stack does anything meaningful to optimize C++.
Especially with modern optimizing compilers.

Ensuring memory ends up on the stack and not the heap is a good optimization in julia.
Since you speed things up by removing the time taken to allocate, and deallocate and garbage collect (the former two also apply in C++, the last doesn’t).

Elrod · June 10, 2022, 8:50pm

Because SArray are immutable and the conversion to and from mutable MArrays are expensive, and also because operations on SArray tend to be defined to be entirely unrolled, they scale very poorly.
If it weren’t for these two factors, the usable size range of SArrays could be larger.
Although, if enough people were to stack allocate 50x50 arrays, we’d likely start blowing the stack, so maybe it’s better we don’t encourage people to put big arrays on the stack.

I am dubious as to if avoiding memory on the stack does anything meaningful to optimize C++.

I do think using stack memory helps in C++, which is why LLVM makes extensive use of data types like llvm::SmallVector internally. The llvm::SmallVector is a dynamically sized array that allocates some memory on the stack. If the size of the array is small enough to fit on this stack space, it is stack allocated. Otherwise, it’ll allocate on the heap.

It is very common for Julia program runtimes be dominated by allocations.
It’s not uncommon to see >50% GC time.
If one intends to write fast Julia code, thinking about memory management is a critical part of the design.
But I think that is true of any language where writing fast code is on the table.

This is a big part of the difference in performance between SimpleChains.jl and Flux.jl.

jacobusmmsmit · June 10, 2022, 9:40pm

Thank you so much everyone for the replies so far. Apologies that I posted and then didn’t engage at all! I’ll start collating ideas and write a first draft which you can all pick apart if you like : - )

Benny · June 11, 2022, 12:40am

So how much of this applies to immutable structs? Nobody would bother writing one with 2500 fields, but I am wondering how immutables are implemented. Probably off-topic, but my curiosity started over whether a compiler could edit an individual field in an immutable struct instance (which may also be an Array element rather than living on the stack) under the hood rather than the semantic action of creating a new instance. It seems like a useful optimization for setfield!-like syntax and operations if there’s no need for the change to be shared by multiple references (a job for mutable structs with all their downsides). I know about Setfield.jl but I have yet to figure out how to edit a field of an Array element without copying the whole array.

mkitti · June 11, 2022, 9:55am

Thank you for checking in with the community on this. My general sense is that the 27 points are likely more applicable to Julia than not. The main question is how important are they. If they are not applicable to Julia, I would point to a specific Julia feature rather than philosophical differences.

In Julia, the advice to avoid is global dynamically-typed variables. In some cases, global constant references may be a useful optimization. Constant references with concrete types are useful because they can be statically typed in that they can take new values but cannot change type. Another approach would be passing reusing an argument rather creating a new variable. Methods ending with ! particularly emphasize this pattern.

Note that structs in Julia are by default immutable. This is an important language feature.

Mutable structs are passed by reference in Julia.

Having type-stable stable functions is important in Julia. If the compiler can determine the return type from the argument types, then less dynamic dispatch is needed. Thus, this advice applies to Julia as well. If no return value is needed, consider returning nothing.

Some operations can be optimized through fusion. In Julia also consider is the use of .= and operations such as .*=. For example, note the allocation when using f! as opposed to g!.

julia> f!(A) =  A .= A*2
f! (generic function with 1 method)

julia> g!(A) =  A .*= 2
g! (generic function with 1 method)

...

julia> @time f!(A);
  0.000018 seconds (1 allocation: 8.125 KiB)

julia> @time g!(A);
  0.000011 seconds

I always find this comment weird. Academics write a lot of code in many languages and often release it as open source. The overall prevalence of open source code written by academics is high. It really comes down to the individual experience of the each author.

Julia structs are often directly mappable to C structs. We rely on this for interoperability with C via @ccall. See the function sizeof.

We think about this a lot in Julia because numerical computing is a major application of the language. We also have a type promotion system. On a small scale, it probably does not matter. On a large scale, memory can be important due to processor memory caches.

If you use smaller types, you can potential fit more of them into a 256 or 512 bit SIMD register and thus do more operations per cycle. Not all of this is as automatic as one would like though, so sometimes using SIMD explicitly is helpful.

giordano · June 11, 2022, 10:16am

No, Julia (like many lisps, Java, Perl, Python, Ruby, and others) uses passing by sharing. This is also documented.

paulmelis · June 11, 2022, 10:28am

How is that different from pass-by-reference?

Function arguments themselves act as new variable bindings (new locations that can refer to values), but the values they refer to are identical to the passed values.

giordano · June 11, 2022, 10:35am

Quoting from the linked Wikipedia article:

The semantics of call by sharing differ from call by reference: “In particular it is not call by value because mutations of arguments performed by the called routine will be visible to the caller. And it is not call by reference because access is not given to the variables of the caller, but merely to certain objects”.[36] So, for example, if a variable was passed, it is not possible to simulate an assignment on that variable in the callee’s scope.[37] However, since the function has access to the same object as the caller (no copy is made), mutations to those objects, if the objects are mutable, within the function are visible to the caller, which may appear to differ from call by value semantics. Mutations of a mutable object within the function are visible to the caller because the object is not copied or cloned—it is shared.

paulmelis · June 11, 2022, 10:54am

Okay, I see the subtle difference between pass-by-reference and pass-by-sharing, but I still think the original quote that mutable structs are passed by reference is true (and more insightful). In that specific case a reference to the struct is passed (and so the struct can be mutated in the caller), even though pass-by-sharing is more accurate when conceptually describing how values of all types are passed in Julia.

skleinbo · June 11, 2022, 11:10am

It’s not though. If it were, the following would return [1,2,3,]

function modx(x)
  x = [1,2,3]
end
function f()
  x = [0]
  modx(x)
   ## returns [0] because x is passed-by-sharing, not reference
  return x
end

Compare pass-by-reference in C++

#include <iostream>
#include <string>
#include <vector>

void modx(std::vector<int>& x){
    x = {1,2,3};
}
void f() {
    std::vector<int> x = {0};
    modx(x);  
    for (int n: x) {
        std::cout<<n<<",";
    }
}

int main()
{
  f(); ## prints '1,2,3,'
}

While the former behavior is sane and passing-by-reference as the default will lead to disaster (IMHO), telling someone who comes from e.g. C that Julia is pass-by-reference, might mislead and confuse them.

paulmelis · June 11, 2022, 11:28am

Hmm, seems I overlooked the case of assigning to the reference. I was mostly thinking of this usage (in which I see no difference between pass-by-sharing and pass-by-reference):

julia> mutable struct S
           f::Float32
       end

julia> s = S(1.234)
S(1.234f0)

julia> function mods(s)
           s.f = 456
       end
mods (generic function with 1 method)

julia> s
S(1.234f0)

julia> mods(s)
456

julia> s
S(456.0f0)

But okay, I learned something today on pass-by-sharing

giordano · June 11, 2022, 11:43am

Note that assignment never modifies the memory object pointed to by a variable but rebinds the variable to something else. Your modx doesn’t really modify in-place anything. If for example you push! to x you’re actually modifying it and the caller would see it.

paulmelis · June 11, 2022, 11:50am

I think that was the point @skleinbo was trying to make about Julia’s pass-by-sharing. In contrast, in C++ you can use a reference variable to modify in-place (as the second example shows).

giordano · June 11, 2022, 11:52am

But that doesn’t have anything to do with evaluation strategy, but with assignment semantic.

Topic		Replies	Views
Performance optimisation: Julia vs C++ Performance unrolling	31	2671	April 21, 2021
Ray Tracing in a week-end - Julia vs SIMD-optimized C++ Performance gpu , graphics , benchmark , pluto , staticarrays	81	13878	August 9, 2024
I translated Peter Shirley's Raytracer to Julia. C++ 1min -> Julia 2m30s -> 15s (@threads) Visualization images , graphics	27	3157	February 21, 2022
Looking for Some Best Practices for Optimizing Julia Code Performance? General Usage	5	370	December 23, 2024
Are there any fundamental parts of Julia's design or implementation that could limit its performance? Performance	6	833	June 13, 2019

Do C++ performance optimisations apply to Julia?

Related topics