When Julia gets within 1-3x of C/C++ speed, why is C/C++ usually faster?

A notion that I’m getting from blogs and published papers comparing Julia with other languages in practical use cases is that Julia gets within 1-3x of the speed of the fastest implementation in C/C++ (though it’s worth mentioning that Julia is invariably more readable and does rarely beat C/C++). To be clear, I’m talking specifically about writers who made the effort to read the performance tips and do the things the Julia way: type stability, concrete fields, limited allocations. If they didn’t, they’d get a slowdown of several orders of magnitude, as many other topics here can demonstrate.

It seems that we can already give the compiler the sort of information that people broadly say is what makes compiled languages like C/C++ more efficient. The only big difference I can think of is Julia’s garbage collector, but I can’t say how that factors into anything because I don’t know any low-level languages. I’m hoping someone who does can give general reasons for the remaining bit of difference in performance.

7 Likes

It’s very hard to speak about this in generalities because every individual little problem is actually a neverending rabbit hole of potential microoptimizations, special cases, and weird details you would never have guessed.

However, in general I would say that in my experience (as someone who reads a fair amount about this, but doesn’t use or know C) the two main reasons are

  1. It’s awkward. You can basically write C flavoured julia code if you really want to, it’s just ugly and awkward. It feels like cheating if the goal is to compare julia to C, but then to write horrific, unidiomatic unsafe julia code that is basically just C.

  2. Missing optimizations: There are just some optimizations that are possible in julia but not yet implemented because it’s hard or nobody has gotten around to it yet. Many of these things are missing optimizations in LLVM, or things that are awkward for us to communicate properly to LLVM. A great example of this would be vectorization and SIMD. It turns out you can squeeze some pretty insane performance out of julia loops if you use https://github.com/chriselrod/LoopVectorization.jl and you can often completely smoke all but the most clever, handwritten custom assembly iteration schemes. This basically happens by bypassing LLVM’s looping stuff and getting Chris Elrod to do your code generation instead.

29 Likes

If it’s within a factor of 3 than GC is no an issue.

4 Likes

the wording of that last sentence made me chuckle

3 Likes

Only the finest handmade artisanal SIMD code from @Elrod.

9 Likes

With substantial software, when Julia gets within 1-3x of C/C++ speed, why has Julia obtained working, reliable, collaboratively written code 2-5x faster?

4 Likes

I think the biggest reason for this is that to get the power of multiple dispatch, you would need to write all your code using C++ templates. Doing so to the extent Julia does would absolutely kill your compile times due to the lack of a JIT. Julia’s macros are also a huge part of the story here. Tools like LoopVectorization mean that idiots like me can write the equivalent to hand optimized assembly for any loop that is even a vague hotspot.

4 Likes

Are there any plans to integrate LoopVectorization into Julia?

2 Likes

Not in the short term. LoopVectorization is improving very quickly, and incorporating it into base would almost certainly massively slow progress. Furthermore, there is not a ton of benefit of adding it to Base.

4 Likes

I don’t know if this will help you, but Jeff Bazanson talks about Julica vs C, C++ speed around 16:45 of State of Julia Jeff Bezanson & Stefan Karpinski. Whole presentation is worth to hear for so many reasons.

2 Likes

I read the v1.5 release highlights, but I didn’t think the allocation optimization would apply to how tuples would be stored. Just to clarify, am I correct in thinking that particular part of the video is showing that while v1.4 allocated a million tuples and an array that points to them, v1.5 allocated just the array that stored the tuples directly?

2 Likes

Exactly.