Performance comparison of a custom gcd() function in Julia, C, and Rust

Here is the naive implementation of the gcd (Greatest Common Divisor) in Julia:

function mygcd(a::Int64, b::Int64)::Int64
   if a == b 
       return a
   elseif a > b
       return mygcd(a-b, b)
   else
       return mygcd(a, b-a)
   end
end

Note that the implementation is not the best one, the gcd() function of Julia is far better than this implementation. My motivation here is just to make a fair (? is it really fair) comparison of languages.

The C version of the same code is

long gcd(long a, long b) {
   if (a == b){
        return a;
     } else if (a > b) {
        return gcd(a-b, b);
     } else {
        return gcd(a, b-a);
   }
}

and here is the Rust implementation of the same function:

#[no_mangle]
pub unsafe extern "C" fn gcd(a: i64, b: i64) -> i64 {
    if a == b {
        return a;
    }else if a > b {
        return gcd(a - b, b);
    }else{
        return gcd(a, b - a);
    }
}

For the comparison issues I start Julia with the highest optimization level 3

julia --optimize=3

and the Makefile for C and Rust libraries have the similar compilation options:

cc -shared -fPIC -O3 -Wall -o libgcd.so gcd.c
cd rustgcd && cargo build --release & cd ..

and the Cargo.toml (of Rust) file includes the option

[profile.release]
opt-level = 3

so I think the optimization levels match well.

Thank to the beautiful FFI of Julia, I can call the compiled functions using the ccall function. Here is how I use FFI for C and Rust implementations and compare these three implementations in Julia:

function cversion(a::Int64, b::Int64)::Int64
    ccall((:gcd, "./libgcd.so"), Int64, (Int64, Int64), a, b)
end

function rustversion(a::Int64, b::Int64)::Int64
    ccall((:gcd, "./rustgcd/target/release/librustgcd.so"), Int64, (Int64, Int64), a, b)
end

And here is the comparison of functions:

a = 10000 * 12 * 13 * 14
b = 512 * 12 * 13 * 14

@info mygcd(a, b) == cversion(a, b) == rustversion(a, b) == gcd(a, b)

@info "Custom Julia"
@btime mygcd(a, b);

@info "C"
@btime cversion(a, b);

@info "Rust"
@btime rustversion(a, b);

Note that the line

@info mygcd(a, b) == cversion(a, b) == rustversion(a, b) == gcd(a, b)

is not mandatory, I just wanted to be sure the functions work in the same way or at least produce the same output.

I get the following output:

[ Info: true
[ Info: Custom Julia
  34.763 ns (0 allocations: 0 bytes)
[ Info: C
  15.764 ns (0 allocations: 0 bytes)
[ Info: Rust
  15.988 ns (0 allocations: 0 bytes)

Of course, in each run, I get different results, but the ratios remain almost the same. (I turned the performance option on in my Ubuntu distribution.)

I had expected the times consumed by C and Rust to be similar but the Julia implementation is constantly more than 2x slower than the C and Rust counterparts.

Are these results normal? If so, are they okay for us? No allocations, no external libraries. What is the reason of these time differences? Is it because of slow calling of functions repeatedly in Julia?

My Julia version used here is v1.11.1.

Thank you in advance.

2 Likes

A cursory look at the assembly created reveals that gcc rewrites the recursion to a loop. Julia does not. I haven’t checked, but I guess that rust also does this kind of tail recursion optimization.

So, this huge difference is typical for tail-recursive functions doing very little at each stage.

8 Likes

Thank you for your response.

It’s more clear to me now that it’s important for a language to be elegant, but it’s clear that having a smart compiler is just as crucial. In fact, a well-designed compiler can even compensate for some of the language’s shortcomings, like we see in C.

There have been numerous discussions about this particular rewriting in julia. Here is one of them: Tail-call recursion

6 Likes

You win some, you lose some; 2x slower here, 2x faster there. :slight_smile:

julia> function whilegcd(a, b)
           while true
               if a > b
                   a -= b
               elseif b > a
                   b -= a
               else
                   return a
               end
           end
       end
whilegcd (generic function with 1 method)

julia> run(pipeline(`cc -shared -fPIC -O3 -Wall -o libwhilegcd.dylib -x c -`, stdin=IOBuffer("""
       long gcd(long a, long b) {
           while (1) {
               if (a > b) {
                   a -= b;
               } else if (b > a) {
                   b -= a;
               } else {
                   return a;
               }
           }
       }""")))
Process(`cc -shared -fPIC -O3 -Wall -o libwhilegcd.dylib -x c -`, ProcessExited(0))

julia> using BenchmarkTools

julia> a = 10000 * 12 * 13 * 14
21840000

julia> b = 512 * 12 * 13 * 14
1118208

julia> @btime whilegcd($a,$b)
  12.929 ns (0 allocations: 0 bytes)
34944

julia> function cwhileversion(a::Int64, b::Int64)::Int64
           ccall((:gcd, "./libwhilegcd.dylib"), Int64, (Int64, Int64), a, b)
       end
cwhileversion (generic function with 1 method)

julia> @btime cwhileversion($a, $b)
  24.072 ns (0 allocations: 0 bytes)
34944
3 Likes

In short, it’s better to use loops, and recursion & repeated function self-calls should be avoided as much as possible - just because the compiler optimizations don’t do much. I think a performant Julia code author should always think of the assembly counterpart of the code (this is fair and current in many cases for the other languages).

No. Recursion is often a great way to code, if it is used appropriately. Recursion in Julia — bad idea? - #5 by stevengj

But the interesting and useful cases for recursion (in an imperative language) are mostly not tail recursive, so they won’t be optimized away by any compiler in any language. Instead, you often have to make them fast by coarsening the base case (which no production compiler will do for you, AFAIK).

“Yes” in a more narrow sense: use loops instead of self tail calls, which are trivially rewritten as loops in an imperative language like Julia. Recursion adds no value in this case.

2 Likes

Now it’s actually 4.5 times faster:

(@v1.11) pkg> add https://github.com/TakekazuKATO/TailRec.jl

using TailRec

@tailrec function mygcd(a, b)
   if a == b 
       return a
   elseif a > b
       return mygcd(a-b, b)
   else
       return mygcd(a, b-a)
   end
end

i.e. for

julia> @benchmark mygcd(10000 * 12 * 13 * 14, 512 * 12 * 13 * 14)

Same speed:

julia> @btime mygcd($a, $b);
  37.755 ns (0 allocations: 0 bytes)

julia> @btime mygcd(a, b);  # incorrect way without $, also adds one allocation:
  63.804 ns (1 allocation: 16 bytes)

You don’t need all the types, they do not speed up, and:

@tailrec function mygcd(a::Int64, b::Int64)::Int64

actually the last one screws up (because of the the macro?), but not if it's the only one give...:

julia> @benchmark mygcd(10000 * 12 * 13 * 14, 512 * 12 * 13 * 14)
ERROR: UndefVarError: `Int64` not defined in local scope
Suggestion: check for an assignment to a local variable that shadows a global of the same name.
Hint: a global variable of this name also exists in Core.

Julia could choose to do such TCO, by default, but doesn’t (and people are against, it has (pros and) mostly cons (at least when developing).

You can always get the same speed by rewriting into a loop manually, or just choose to apply the macro from the package, but that example is pretty much worst case (or conversely most speed-up) for Julia with tail-recursive (without TCO), most real-world code would to more, have some memory access,

I tested with -O3 and Julia’s default -O2, and you get same speed. You likely get worse compilation time with the former, it’s very rarely needed (I think maybe never for integer code, might work on floating-point based, but then dangerous? I’ve never seen exactly what optimizations it adds, I think all of them can, and should be applied locally, e.g. @simd).