Why is Julia faster than C++ for quicksort?

See the following pieces of code for quicksort, adapted from GitHub - JuliaLang/Microbenchmarks: Microbenchmarks comparing the Julia Programming language with other languages. As you can see in the benchmark output, julia runs the sorting in about 62 ms while C++ does it in about 83 ms. So julia is about 25% faster than C++. How is this possible?
julia version is 1.9.2
I have a Windows10 machine and compiled c++ with VisualStudio2019 in release mode.

using BenchmarkTools

# quicksort version with tail recursion: https://stackoverflow.com/questions/19094283/quicksort-and-tail-recursive-optimization
function qsort!(a,lo,hi)
    i, j = lo, hi
    @inbounds while i < hi
        pivot = a[(lo+hi)>>>1]
        while i <= j
            while a[i] < pivot; i += 1; end
            while a[j] > pivot; j -= 1; end
            if i <= j
                a[i], a[j] = a[j], a[i]
                i, j = i+1, j-1
            end
        end
        if lo < j; qsort!(a,lo,j); end
        lo, j = i, hi
    end
    return a
end

const N::Int32 = 1000000

@benchmark qsort!(a, 1, N) setup=(a = rand(Int32.(1:N), N))

Julia output:

BenchmarkTools.Trial: 69 samples with 1 evaluation.
 Range (min … max):  58.021 ms … 72.308 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     62.288 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   62.607 ms Β±  2.842 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

       β–ƒ      ▁   β–ƒ β–† β–β–ˆ    β–ˆ
  β–„β–β–β–„β–‡β–ˆβ–„β–„β–„β–‡β–β–‡β–ˆβ–β–„β–„β–ˆβ–β–ˆβ–β–ˆβ–ˆβ–‡β–‡β–„β–„β–ˆβ–β–β–‡β–β–„β–‡β–β–‡β–„β–β–‡β–„β–β–„β–β–„β–β–β–β–„β–β–β–β–β–β–„β–„β–„β–β–β–β–„ ▁
  58 ms           Histogram: frequency by time        69.6 ms <

 Memory estimate: 208 bytes, allocs estimate: 13.

C++ code:

#include <chrono>
#include <vector>
#include <iostream>
#include <iomanip>
#include <random>

void quicksort(std::vector<int>& a, int lo, int hi) {
    int i = lo;
    int j = hi;
    while (i < hi) {
        double pivot = a[(lo + hi) / 2];
        // Partition
        while (i <= j) {
            while (a[i] < pivot) {
                i = i + 1;
            }
            while (a[j] > pivot) {
                j = j - 1;
            }
            if (i <= j) {
                double t = a[i];
                a[i] = a[j];
                a[j] = t;
                i = i + 1;
                j = j - 1;
            }
        }

        // Recursion for quicksort
        if (lo < j) {
            quicksort(a, lo, j);
        }
        lo = i;
        j = hi;
    }
}

int main() {
    int cnt = 60;
    float duration = 0.0;
    for (int i = 0; i < cnt; ++i) {
        int N = 1000000;
        std::vector<int> d(N);

        // Seed with a real random value, if available
        std::random_device r;
        // Choose a random mean between 1 and 6
        std::default_random_engine e1(r());
        std::uniform_int_distribution<int> uniform_dist(1, N);

        for (int i = 0; i < d.size(); i++)
        {    
            d[i] = uniform_dist(e1);
        }

        auto start = std::chrono::high_resolution_clock::now();
        quicksort(d, 0, N - 1);
        auto end = std::chrono::high_resolution_clock::now();

        double time_taken =
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

        duration += time_taken * 1e-06; 
    }

    std::cout << "Average time taken by program is : " << std::fixed
        << duration/cnt << std::setprecision(3) << " ms" << std::endl;
}
    

C++ output:

Average time taken by program is : 83.069237 ms
3 Likes

There were some double to int conversions in the c++ code. pivot and the temp variable t shall be integers, just like the values in the array. Yes, we shall always check the compiler warnings :slight_smile: . The corrected c++ code runs as fast as Julia:

Average time taken by program is : 63.753696 ms

The corrected code:

#include <chrono>
#include <vector>
#include <iostream>
#include <iomanip>
#include <random>

void quicksort(std::vector<int>& a, int lo, int hi) {
    int i = lo;
    int j = hi;
    while (i < hi) {
        int pivot = a[(lo + hi) / 2]; ######### pivot was of type double before
        // Partition
        while (i <= j) {
            while (a[i] < pivot) {
                i = i + 1;
            }
            while (a[j] > pivot) {
                j = j - 1;
            }
            if (i <= j) {
                int t = a[i]; ############ t was of type double before
                a[i] = a[j];
                a[j] = t;
                i = i + 1;
                j = j - 1;
            }
        }

        // Recursion for quicksort
        if (lo < j) {
            quicksort(a, lo, j);
        }
        lo = i;
        j = hi;
    }
}

int main() {
    int cnt = 60;
    float duration = 0.0;
    for (int i = 0; i < cnt; ++i) {
        int N = 1000000;
        std::vector<int> d(N);

        // Seed with a real random value, if available
        std::random_device r;
        // Choose a random mean between 1 and 6
        std::default_random_engine e1(r());
        std::uniform_int_distribution<int> uniform_dist(1, N);

        for (int i = 0; i < d.size(); i++)
        {  
            d[i] = uniform_dist(e1);
        }

        auto start = std::chrono::high_resolution_clock::now();
        quicksort(d, 0, N - 1);
        auto end = std::chrono::high_resolution_clock::now();

        double time_taken =
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

        duration += time_taken * 1e-06; 
    }

    std::cout << "Average time taken by program is : " << std::fixed
        << duration/cnt << std::setprecision(3) << " ms" << std::endl;
}
18 Likes

Dang, so Julia is only exactly as fast as C++?

18 Likes

You found a flaw in the C++ code, so they’re now the same speed. But there’s no reason why Julia can’t be faster than C++ in some cases, there’s nothing impossible about that.

Well-written (especially non-allocating) Julia code should be roughly the same speed as C++, some times a bit slower, some times a bit faster.

13 Likes

I’m not sure this is equivalent even after the type fixes. The C++ quicksort takes in a vector of int and int lo and int hi, which is all signed 32-bit integers IIRC. The Julia qsort! however was benchmarked with Vector{Int32}, Int, Int, the latter two being 64-bit integers on 64-bit systems. I’m not sure how indexing with Int64 versus Int32 is handled in Julia vs C++ though.

The int pivot = a[(lo + hi) / 2]; and pivot = a[(lo+hi)>>>1] lines looked different, and it doesn’t seem like the Julia compiler generates the same instructions for (lo+hi)>>>1 as div((lo+hi), 2) given signed integers, so maybe the C++ compiler isn’t either, is it possible to check the instructions like @code_llvm or @code_native does in Julia?

https://godbolt.org/ is the C++ equivalent (it also works for julia, but since @code_native exists, it’s not as useful there)

1 Like

The output looks different, which is a problem because I’m the opposite of versed in lower level instructions. But it looks like the C++ integer division has an add step that Julia’s div has but >>> doesn’t. The algorithm is calling for a division by 2, and the benchmark is stated to compare algorithms not hand-optimized code, so maybe it’s fairer to change _ >>> 1 to div(_, 2) in the Julia version.

oh the extra instruction is that for negative numbers you need to do something a little different for divide by 2.

Only 1.7.3 and 1.8.5 - I tried to get it ready for 1.9+, but ran into an issue with the way they pass arguments to julia:

I have modified the code on both sides such that they are nearly identical. Julia code uses uint32 indexes and the C++ code to uses int32 indexes. Its 62 ms for Julia vs 65 ms for C++. In my point of view, the difference is minor and I do consider this result as a tie.
I have also tried Julia with int32 indexes and it’s sligthly slower (64ms). C++ with uint32 indexes is slower than int32 (70ms).
The (lo+hi) >> 1 vs (lo+hi)/2 do not matter as this instruction is executed less often in our code.

Julia code:

using BenchmarkTools

# quicksort version with tail recursion: https://stackoverflow.com/questions/19094283/quicksort-and-tail-recursive-optimization
function qsort!(a,lo,hi)
    i, j = lo, hi
    _1 = eltype(hi)(1)
    @inbounds while i < hi
        pivot = a[(lo+hi) >> 1]
        while i <= j
            while a[i] < pivot; i += _1; end
            while a[j] > pivot; j -= _1; end
            if i <= j
                a[i], a[j] = a[j], a[i]
                i, j = i+_1, j-_1
            end
        end
        if lo < j; qsort!(a,lo,j); end
        lo, j = i, hi
    end
    return a
end

const N::UInt32 = 1000000

@benchmark qsort!(a, UInt32(1), UInt32(N)) setup=(a = rand(Int32.(1:N), N))
# @benchmark qsort!(a, Int32(1), Int32(N)) setup=(a = rand(Int32.(1:N), N)) # this is for int32 indices

C++ code:

typedef int ind_type;

void quicksort(std::vector<int>& a, ind_type lo, ind_type hi) {
    ind_type i = lo;
    ind_type j = hi;
    
    while (i < hi) {
        int pivot = a[(lo + hi) >> 1];
        // Partition
        while (i <= j) {
            while (a[i] < pivot) {
                i = i + 1;
            }
            while (a[j] > pivot) {
                j = j - 1;
            }
            if (i <= j) {
                int t = a[i];
                a[i] = a[j];
                a[j] = t;
                i = i + 1;
                j = j - 1;
            }
        }

        // Recursion for quicksort
        if (lo < j) {
            quicksort(a, lo, j);
        }
        lo = i;
        j = hi;
    }
}

int main() {
    #..........................................
    std::vector<int> d(N+1); # one more element       
    #..........................................
    quicksort(d, 1, N); # sort elements 1 to N; needed in case index type is uint32
    #..........................................
}

The runtimes are as follows:
Julia:

BenchmarkTools.Trial: 70 samples with 1 evaluation.
 Range (min … max):  59.434 ms … 65.182 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     61.876 ms              β”Š GC (median):    0.00%        
 Time  (mean Β± Οƒ):   62.056 ms Β±  1.364 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

             ▁▁▄▄  ▁▄ β–ˆ   β–„β–ˆ  ▄▁▁▁   β–„      ▁▁ ▁   ▁     ▁ β–„   
  β–†β–β–†β–β–β–†β–†β–β–†β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–β–β–ˆβ–ˆβ–β–ˆβ–†β–β–†β–ˆβ–ˆβ–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–β–†β–β–ˆβ–β–†β–β–†β–†β–†β–ˆβ–ˆβ–β–ˆβ–†β–β–β–ˆβ–†β–†β–β–β–β–ˆβ–β–ˆβ–† ▁
  59.4 ms         Histogram: frequency by time        64.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

C++
Average time taken by program is : 65.399590 ms

4 Likes

I’d also really recommend Cthulhu for your introspection needs.
It’s great to be able to switch between typed Julia IR, LLVM, and native code, and then descend into non-inlined functions.

It’s really nice when you want to look at functions where the action is hidden behind a function call or two, like sum

using Cthulhu
x = rand(127);
@code_native sum(x)
@descend_code_typed debuginfo = :none annotate_source =
      false iswarn = true sum(x)

The @turbo macro also generally places the interesting code behind a __turbo__!.

I don’t like the β€œnew” @descend, so I use my own macro @d that’s equivalent to the above.
The important argument is annotate_source = false, you can ignore the rest.

It should show mapreduce_impl as the only option to descend into. Descend, and then hit N to view the native code of sum.

I’m not sure how to actually use @descend itself to see the code of interest…

2 Likes

Is this using MSVC?
The only experience I have with it is looking at godbolt every now and then.
It’s generally considered to produce worse code than GCC or Clang.
Clang uses LLVM, and will thus likely give comparable results to Julia for similar code.

1 Like

I’d argue that ”is Julia faster than C++ in a single-threaded Quicksort” is the wrong question.

The right question is β€œis Julia faster than C++ in sorting an array”.

You can see that I’m opening up multi-threaded execution and GPUs (if any).

Or single threaded SIMD… I found some active projects on Google and it looks like a rabbit hole to go down.

Note that Julia’s default sort uses a radix sort for large suitable inputs, which is algorithmically better, and should thus win for large enough sizes.

For SIMD c++ sorts, there are:

Some comparisons:

IIRC, some of vqsort’s small size performance issues have been fixed (vqsort is the first of the two links above).

Has anyone tried implementing a SIMD sort in Julia?

4 Likes

I did some work with SIMD sorting, check out my presentation at JuliaCon 2019. https://www.youtube.com/watch?v=_bvb8X4DT90&t=402s

Code on GitHub: GitHub - nlw0/ChipSort.jl: Sorting deeds done down the chip

RadixSort really tends to work great when it can be used, which is most cases in my experience. I’m still not sure what makes it so great, though, I suspect some compiler optimizations may be applied (such as vectorization), but I never checked what exactly happens. In my research I found out this CombSort algorithm gets very well optimized, but it can’t beat Radix.

OP might be interested in looking at that work. I’d like to point out that practical quicksort tends to work better if you switch to insertion sort for small lists, so you might like to modify both implementations and see if it gets even faster on both languages.