Why is Julia faster than C++ for quicksort?

mariusd · August 10, 2023, 9:46am

See the following pieces of code for quicksort, adapted from GitHub - JuliaLang/Microbenchmarks: Microbenchmarks comparing the Julia Programming language with other languages. As you can see in the benchmark output, julia runs the sorting in about 62 ms while C++ does it in about 83 ms. So julia is about 25% faster than C++. How is this possible?
julia version is 1.9.2
I have a Windows10 machine and compiled c++ with VisualStudio2019 in release mode.

using BenchmarkTools

# quicksort version with tail recursion: https://stackoverflow.com/questions/19094283/quicksort-and-tail-recursive-optimization
function qsort!(a,lo,hi)
    i, j = lo, hi
    @inbounds while i < hi
        pivot = a[(lo+hi)>>>1]
        while i <= j
            while a[i] < pivot; i += 1; end
            while a[j] > pivot; j -= 1; end
            if i <= j
                a[i], a[j] = a[j], a[i]
                i, j = i+1, j-1
            end
        end
        if lo < j; qsort!(a,lo,j); end
        lo, j = i, hi
    end
    return a
end

const N::Int32 = 1000000

@benchmark qsort!(a, 1, N) setup=(a = rand(Int32.(1:N), N))

Julia output:

BenchmarkTools.Trial: 69 samples with 1 evaluation.
 Range (min … max):  58.021 ms … 72.308 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     62.288 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   62.607 ms ±  2.842 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▃      ▁   ▃ ▆ ▁█    █
  ▄▁▁▄▇█▄▄▄▇▁▇█▁▄▄█▁█▁██▇▇▄▄█▁▁▇▁▄▇▁▇▄▁▇▄▁▄▁▄▁▁▁▄▁▁▁▁▁▄▄▄▁▁▁▄ ▁
  58 ms           Histogram: frequency by time        69.6 ms <

 Memory estimate: 208 bytes, allocs estimate: 13.

C++ code:

#include <chrono>
#include <vector>
#include <iostream>
#include <iomanip>
#include <random>

void quicksort(std::vector<int>& a, int lo, int hi) {
    int i = lo;
    int j = hi;
    while (i < hi) {
        double pivot = a[(lo + hi) / 2];
        // Partition
        while (i <= j) {
            while (a[i] < pivot) {
                i = i + 1;
            }
            while (a[j] > pivot) {
                j = j - 1;
            }
            if (i <= j) {
                double t = a[i];
                a[i] = a[j];
                a[j] = t;
                i = i + 1;
                j = j - 1;
            }
        }

        // Recursion for quicksort
        if (lo < j) {
            quicksort(a, lo, j);
        }
        lo = i;
        j = hi;
    }
}

int main() {
    int cnt = 60;
    float duration = 0.0;
    for (int i = 0; i < cnt; ++i) {
        int N = 1000000;
        std::vector<int> d(N);

        // Seed with a real random value, if available
        std::random_device r;
        // Choose a random mean between 1 and 6
        std::default_random_engine e1(r());
        std::uniform_int_distribution<int> uniform_dist(1, N);

        for (int i = 0; i < d.size(); i++)
        {    
            d[i] = uniform_dist(e1);
        }

        auto start = std::chrono::high_resolution_clock::now();
        quicksort(d, 0, N - 1);
        auto end = std::chrono::high_resolution_clock::now();

        double time_taken =
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

        duration += time_taken * 1e-06; 
    }

    std::cout << "Average time taken by program is : " << std::fixed
        << duration/cnt << std::setprecision(3) << " ms" << std::endl;
}

C++ output:

Average time taken by program is : 83.069237 ms

mariusd · August 10, 2023, 10:50am

There were some double to int conversions in the c++ code. pivot and the temp variable t shall be integers, just like the values in the array. Yes, we shall always check the compiler warnings . The corrected c++ code runs as fast as Julia:

Average time taken by program is : 63.753696 ms

The corrected code:

#include <chrono>
#include <vector>
#include <iostream>
#include <iomanip>
#include <random>

void quicksort(std::vector<int>& a, int lo, int hi) {
    int i = lo;
    int j = hi;
    while (i < hi) {
        int pivot = a[(lo + hi) / 2]; ######### pivot was of type double before
        // Partition
        while (i <= j) {
            while (a[i] < pivot) {
                i = i + 1;
            }
            while (a[j] > pivot) {
                j = j - 1;
            }
            if (i <= j) {
                int t = a[i]; ############ t was of type double before
                a[i] = a[j];
                a[j] = t;
                i = i + 1;
                j = j - 1;
            }
        }

        // Recursion for quicksort
        if (lo < j) {
            quicksort(a, lo, j);
        }
        lo = i;
        j = hi;
    }
}

int main() {
    int cnt = 60;
    float duration = 0.0;
    for (int i = 0; i < cnt; ++i) {
        int N = 1000000;
        std::vector<int> d(N);

        // Seed with a real random value, if available
        std::random_device r;
        // Choose a random mean between 1 and 6
        std::default_random_engine e1(r());
        std::uniform_int_distribution<int> uniform_dist(1, N);

        for (int i = 0; i < d.size(); i++)
        {  
            d[i] = uniform_dist(e1);
        }

        auto start = std::chrono::high_resolution_clock::now();
        quicksort(d, 0, N - 1);
        auto end = std::chrono::high_resolution_clock::now();

        double time_taken =
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

        duration += time_taken * 1e-06; 
    }

    std::cout << "Average time taken by program is : " << std::fixed
        << duration/cnt << std::setprecision(3) << " ms" << std::endl;
}

tbeason · August 10, 2023, 12:42pm

Dang, so Julia is only exactly as fast as C++?

DNF · August 10, 2023, 1:43pm

You found a flaw in the C++ code, so they’re now the same speed. But there’s no reason why Julia can’t be faster than C++ in some cases, there’s nothing impossible about that.

Well-written (especially non-allocating) Julia code should be roughly the same speed as C++, some times a bit slower, some times a bit faster.

Benny · August 10, 2023, 7:35pm

I’m not sure this is equivalent even after the type fixes. The C++ quicksort takes in a vector of int and int lo and int hi, which is all signed 32-bit integers IIRC. The Julia qsort! however was benchmarked with Vector{Int32}, Int, Int, the latter two being 64-bit integers on 64-bit systems. I’m not sure how indexing with Int64 versus Int32 is handled in Julia vs C++ though.

The int pivot = a[(lo + hi) / 2]; and pivot = a[(lo+hi)>>>1] lines looked different, and it doesn’t seem like the Julia compiler generates the same instructions for (lo+hi)>>>1 as div((lo+hi), 2) given signed integers, so maybe the C++ compiler isn’t either, is it possible to check the instructions like @code_llvm or @code_native does in Julia?

Oscar_Smith · August 10, 2023, 7:43pm

https://godbolt.org/ is the C++ equivalent (it also works for julia, but since @code_native exists, it’s not as useful there)

Benny · August 10, 2023, 8:00pm

The output looks different, which is a problem because I’m the opposite of versed in lower level instructions. But it looks like the C++ integer division has an add step that Julia’s div has but >>> doesn’t. The algorithm is calling for a division by 2, and the benchmark is stated to compare algorithms not hand-optimized code, so maybe it’s fairer to change _ >>> 1 to div(_, 2) in the Julia version.

Oscar_Smith · August 10, 2023, 8:02pm

oh the extra instruction is that for negative numbers you need to do something a little different for divide by 2.

Sukera · August 10, 2023, 8:09pm

Only 1.7.3 and 1.8.5 - I tried to get it ready for 1.9+, but ran into an issue with the way they pass arguments to julia:

mariusd · August 10, 2023, 11:58pm

I have modified the code on both sides such that they are nearly identical. Julia code uses uint32 indexes and the C++ code to uses int32 indexes. Its 62 ms for Julia vs 65 ms for C++. In my point of view, the difference is minor and I do consider this result as a tie.
I have also tried Julia with int32 indexes and it’s sligthly slower (64ms). C++ with uint32 indexes is slower than int32 (70ms).
The (lo+hi) >> 1 vs (lo+hi)/2 do not matter as this instruction is executed less often in our code.

Julia code:

using BenchmarkTools

# quicksort version with tail recursion: https://stackoverflow.com/questions/19094283/quicksort-and-tail-recursive-optimization
function qsort!(a,lo,hi)
    i, j = lo, hi
    _1 = eltype(hi)(1)
    @inbounds while i < hi
        pivot = a[(lo+hi) >> 1]
        while i <= j
            while a[i] < pivot; i += _1; end
            while a[j] > pivot; j -= _1; end
            if i <= j
                a[i], a[j] = a[j], a[i]
                i, j = i+_1, j-_1
            end
        end
        if lo < j; qsort!(a,lo,j); end
        lo, j = i, hi
    end
    return a
end

const N::UInt32 = 1000000

@benchmark qsort!(a, UInt32(1), UInt32(N)) setup=(a = rand(Int32.(1:N), N))
# @benchmark qsort!(a, Int32(1), Int32(N)) setup=(a = rand(Int32.(1:N), N)) # this is for int32 indices

C++ code:

typedef int ind_type;

void quicksort(std::vector<int>& a, ind_type lo, ind_type hi) {
    ind_type i = lo;
    ind_type j = hi;
    
    while (i < hi) {
        int pivot = a[(lo + hi) >> 1];
        // Partition
        while (i <= j) {
            while (a[i] < pivot) {
                i = i + 1;
            }
            while (a[j] > pivot) {
                j = j - 1;
            }
            if (i <= j) {
                int t = a[i];
                a[i] = a[j];
                a[j] = t;
                i = i + 1;
                j = j - 1;
            }
        }

        // Recursion for quicksort
        if (lo < j) {
            quicksort(a, lo, j);
        }
        lo = i;
        j = hi;
    }
}

int main() {
    #..........................................
    std::vector<int> d(N+1); # one more element       
    #..........................................
    quicksort(d, 1, N); # sort elements 1 to N; needed in case index type is uint32
    #..........................................
}

The runtimes are as follows:
Julia:

BenchmarkTools.Trial: 70 samples with 1 evaluation.
 Range (min … max):  59.434 ms … 65.182 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     61.876 ms              ┊ GC (median):    0.00%        
 Time  (mean ± σ):   62.056 ms ±  1.364 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁▁▄▄  ▁▄ █   ▄█  ▄▁▁▁   ▄      ▁▁ ▁   ▁     ▁ ▄   
  ▆▁▆▁▁▆▆▁▆▁▁████▁▁██▁█▆▁▆██▆▆████▁▆▁█▁▆▁▆▆▆██▁█▆▁▁█▆▆▁▁▁█▁█▆ ▁
  59.4 ms         Histogram: frequency by time        64.6 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

C++
Average time taken by program is : 65.399590 ms

Elrod · August 11, 2023, 3:59am

I’d also really recommend Cthulhu for your introspection needs.
It’s great to be able to switch between typed Julia IR, LLVM, and native code, and then descend into non-inlined functions.

It’s really nice when you want to look at functions where the action is hidden behind a function call or two, like sum

using Cthulhu
x = rand(127);
@code_native sum(x)
@descend_code_typed debuginfo = :none annotate_source =
      false iswarn = true sum(x)

The @turbo macro also generally places the interesting code behind a __turbo__!.

I don’t like the “new” @descend, so I use my own macro @d that’s equivalent to the above.
The important argument is annotate_source = false, you can ignore the rest.

It should show mapreduce_impl as the only option to descend into. Descend, and then hit N to view the native code of sum.

I’m not sure how to actually use @descend itself to see the code of interest…

Elrod · August 11, 2023, 4:05am

Is this using MSVC?
The only experience I have with it is looking at godbolt every now and then.
It’s generally considered to produce worse code than GCC or Clang.
Clang uses LLVM, and will thus likely give comparable results to Julia for similar code.

StatisticalMouse · August 12, 2023, 7:44am

I’d argue that ”is Julia faster than C++ in a single-threaded Quicksort” is the wrong question.

The right question is “is Julia faster than C++ in sorting an array”.

You can see that I’m opening up multi-threaded execution and GPUs (if any).

greatpet · August 12, 2023, 9:27am

Or single threaded SIMD… I found some active projects on Google and it looks like a rabbit hole to go down.

Elrod · August 12, 2023, 2:15pm

Note that Julia’s default sort uses a radix sort for large suitable inputs, which is algorithmically better, and should thus win for large enough sizes.

For SIMD c++ sorts, there are:

Some comparisons:

github.com

Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md

# 10~17x faster than what? A performance analysis of Intel's x86-simd-sort (AVX-512)

Author: Lukas Bergdoll @Voultapher  
Date: 10-06-2023 (DD-MM-YYYY)

This is a performance analysis of the recently popularized [[1](https://www.phoronix.com/news/Intel-AVX-512-Quicksort-Numpy)] Intel AVX-512 sort implementation.

> Intel Publishes Blazing Fast AVX-512 Sorting Library, Numpy Switching To It For 10~17x Faster Sorts

This analysis will look at the performance of Intel' x86-simd-sort and how it compares to various other generic sort implementations such as the C++ standard library `std::sort` and vqsort another high-performance manually vectorized sort implementation. Breaking down complex performance characteristics into a single number is tricky and might have little predictive power. This analysis aims to put that '10~17x' number into perspective and how it relates to other high-performance implementations.

TL;DR: Benchmarking is tricky. If you are using x86-simd-sort, you can get better overall performance and avoid catastrophic scaling for certain input patterns by using vqsort + Clang. In addition it is shown that hardware specific manual vectorization with wide AVX-512 SIMD is not the only way to write efficient software. ipnsort demonstrates comparable performance to x86-simd-sort while being generic, optimized for more than only peak performance and only using up to SSE2 instructions.

---

Bias disclaimer. The author of this analysis is the author of ipnsort.

The words sort implementation and sort algorithm, are expressly *not* used interchangeably. Practically all high-performance implementations are hybrids, using multiple sort algorithms. As such, the words 'sort algorithm' will only be used to talk about the algorithmic nature of specific building blocks.

Graphs with logarithmic axis are marked as such, these are primarily useful to examine the change of a property, *not* its absolute values.

This file has been truncated. show original

IIRC, some of vqsort’s small size performance issues have been fixed (vqsort is the first of the two links above).

Has anyone tried implementing a SIMD sort in Julia?

xor0110 · August 15, 2023, 3:24pm

I did some work with SIMD sorting, check out my presentation at JuliaCon 2019. https://www.youtube.com/watch?v=_bvb8X4DT90&t=402s

Code on GitHub: GitHub - nlw0/ChipSort.jl: Sorting deeds done down the chip

RadixSort really tends to work great when it can be used, which is most cases in my experience. I’m still not sure what makes it so great, though, I suspect some compiler optimizations may be applied (such as vectorization), but I never checked what exactly happens. In my research I found out this CombSort algorithm gets very well optimized, but it can’t beat Radix.

OP might be interested in looking at that work. I’d like to point out that practical quicksort tends to work better if you switch to insertion sort for small lists, so you might like to modify both implementations and see if it gets even faster on both languages.

Topic		Replies	Views
When Julia gets within 1-3x of C/C++ speed, why is C/C++ usually faster? Performance	11	2279	November 2, 2020
Recursive Fibonacci Benchmark using top languages on Github Performance recursion	11	2802	September 29, 2018
Julia Performance vs. C Performance	8	4106	June 15, 2018
`sort` vs `copy`+`sort!` Performance sort	3	381	December 20, 2022
Compare julia sum to a cpp implementation - julia is extremely slow?! Performance question	35	1760	October 7, 2019

Why is Julia faster than C++ for quicksort?

Related topics