# Why is Julia faster than C++ for quicksort?

See the following pieces of code for quicksort, adapted from GitHub - JuliaLang/Microbenchmarks: Microbenchmarks comparing the Julia Programming language with other languages. As you can see in the benchmark output, julia runs the sorting in about 62 ms while C++ does it in about 83 ms. So julia is about 25% faster than C++. How is this possible?
julia version is 1.9.2
I have a Windows10 machine and compiled c++ with VisualStudio2019 in release mode.

``````using BenchmarkTools

# quicksort version with tail recursion: https://stackoverflow.com/questions/19094283/quicksort-and-tail-recursive-optimization
function qsort!(a,lo,hi)
i, j = lo, hi
@inbounds while i < hi
pivot = a[(lo+hi)>>>1]
while i <= j
while a[i] < pivot; i += 1; end
while a[j] > pivot; j -= 1; end
if i <= j
a[i], a[j] = a[j], a[i]
i, j = i+1, j-1
end
end
if lo < j; qsort!(a,lo,j); end
lo, j = i, hi
end
return a
end

const N::Int32 = 1000000

@benchmark qsort!(a, 1, N) setup=(a = rand(Int32.(1:N), N))
``````

Julia output:

``````BenchmarkTools.Trial: 69 samples with 1 evaluation.
Range (min β¦ max):  58.021 ms β¦ 72.308 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     62.288 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   62.607 ms Β±  2.842 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

β      β   β β ββ    β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
58 ms           Histogram: frequency by time        69.6 ms <

Memory estimate: 208 bytes, allocs estimate: 13.
``````

C++ code:

``````#include <chrono>
#include <vector>
#include <iostream>
#include <iomanip>
#include <random>

void quicksort(std::vector<int>& a, int lo, int hi) {
int i = lo;
int j = hi;
while (i < hi) {
double pivot = a[(lo + hi) / 2];
// Partition
while (i <= j) {
while (a[i] < pivot) {
i = i + 1;
}
while (a[j] > pivot) {
j = j - 1;
}
if (i <= j) {
double t = a[i];
a[i] = a[j];
a[j] = t;
i = i + 1;
j = j - 1;
}
}

// Recursion for quicksort
if (lo < j) {
quicksort(a, lo, j);
}
lo = i;
j = hi;
}
}

int main() {
int cnt = 60;
float duration = 0.0;
for (int i = 0; i < cnt; ++i) {
int N = 1000000;
std::vector<int> d(N);

// Seed with a real random value, if available
std::random_device r;
// Choose a random mean between 1 and 6
std::default_random_engine e1(r());
std::uniform_int_distribution<int> uniform_dist(1, N);

for (int i = 0; i < d.size(); i++)
{
d[i] = uniform_dist(e1);
}

auto start = std::chrono::high_resolution_clock::now();
quicksort(d, 0, N - 1);
auto end = std::chrono::high_resolution_clock::now();

double time_taken =
std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

duration += time_taken * 1e-06;
}

std::cout << "Average time taken by program is : " << std::fixed
<< duration/cnt << std::setprecision(3) << " ms" << std::endl;
}

``````

C++ output:

``````Average time taken by program is : 83.069237 ms
``````
3 Likes

There were some `double` to `int` conversions in the c++ code. `pivot` and the temp variable `t` shall be integers, just like the values in the array. Yes, we shall always check the compiler warnings . The corrected c++ code runs as fast as Julia:

`Average time taken by program is : 63.753696 ms`

The corrected code:

``````#include <chrono>
#include <vector>
#include <iostream>
#include <iomanip>
#include <random>

void quicksort(std::vector<int>& a, int lo, int hi) {
int i = lo;
int j = hi;
while (i < hi) {
int pivot = a[(lo + hi) / 2]; ######### pivot was of type double before
// Partition
while (i <= j) {
while (a[i] < pivot) {
i = i + 1;
}
while (a[j] > pivot) {
j = j - 1;
}
if (i <= j) {
int t = a[i]; ############ t was of type double before
a[i] = a[j];
a[j] = t;
i = i + 1;
j = j - 1;
}
}

// Recursion for quicksort
if (lo < j) {
quicksort(a, lo, j);
}
lo = i;
j = hi;
}
}

int main() {
int cnt = 60;
float duration = 0.0;
for (int i = 0; i < cnt; ++i) {
int N = 1000000;
std::vector<int> d(N);

// Seed with a real random value, if available
std::random_device r;
// Choose a random mean between 1 and 6
std::default_random_engine e1(r());
std::uniform_int_distribution<int> uniform_dist(1, N);

for (int i = 0; i < d.size(); i++)
{
d[i] = uniform_dist(e1);
}

auto start = std::chrono::high_resolution_clock::now();
quicksort(d, 0, N - 1);
auto end = std::chrono::high_resolution_clock::now();

double time_taken =
std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();

duration += time_taken * 1e-06;
}

std::cout << "Average time taken by program is : " << std::fixed
<< duration/cnt << std::setprecision(3) << " ms" << std::endl;
}
``````
18 Likes

Dang, so Julia is only exactly as fast as C++?

18 Likes

You found a flaw in the C++ code, so theyβre now the same speed. But thereβs no reason why Julia canβt be faster than C++ in some cases, thereβs nothing impossible about that.

Well-written (especially non-allocating) Julia code should be roughly the same speed as C++, some times a bit slower, some times a bit faster.

13 Likes

Iβm not sure this is equivalent even after the type fixes. The C++ `quicksort` takes in a vector of `int` and `int lo` and `int hi`, which is all signed 32-bit integers IIRC. The Julia `qsort!` however was benchmarked with `Vector{Int32}, Int, Int`, the latter two being 64-bit integers on 64-bit systems. Iβm not sure how indexing with `Int64` versus `Int32` is handled in Julia vs C++ though.

The `int pivot = a[(lo + hi) / 2];` and `pivot = a[(lo+hi)>>>1]` lines looked different, and it doesnβt seem like the Julia compiler generates the same instructions for `(lo+hi)>>>1` as `div((lo+hi), 2)` given signed integers, so maybe the C++ compiler isnβt either, is it possible to check the instructions like `@code_llvm` or `@code_native` does in Julia?

https://godbolt.org/ is the C++ equivalent (it also works for julia, but since `@code_native` exists, itβs not as useful there)

1 Like

The output looks different, which is a problem because Iβm the opposite of versed in lower level instructions. But it looks like the C++ integer division has an add step that Juliaβs `div` has but `>>>` doesnβt. The algorithm is calling for a division by 2, and the benchmark is stated to compare algorithms not hand-optimized code, so maybe itβs fairer to change `_ >>> 1` to `div(_, 2)` in the Julia version.

oh the extra instruction is that for negative numbers you need to do something a little different for divide by 2.

Only 1.7.3 and 1.8.5 - I tried to get it ready for 1.9+, but ran into an issue with the way they pass arguments to julia:

https://github.com/compiler-explorer/compiler-explorer/pull/4595#issuecomment-1411891590

I have modified the code on both sides such that they are nearly identical. Julia code uses uint32 indexes and the C++ code to uses int32 indexes. Its 62 ms for Julia vs 65 ms for C++. In my point of view, the difference is minor and I do consider this result as a tie.
I have also tried Julia with int32 indexes and itβs sligthly slower (64ms). C++ with uint32 indexes is slower than int32 (70ms).
The `(lo+hi) >> 1` vs `(lo+hi)/2` do not matter as this instruction is executed less often in our code.

Julia code:

``````using BenchmarkTools

# quicksort version with tail recursion: https://stackoverflow.com/questions/19094283/quicksort-and-tail-recursive-optimization
function qsort!(a,lo,hi)
i, j = lo, hi
_1 = eltype(hi)(1)
@inbounds while i < hi
pivot = a[(lo+hi) >> 1]
while i <= j
while a[i] < pivot; i += _1; end
while a[j] > pivot; j -= _1; end
if i <= j
a[i], a[j] = a[j], a[i]
i, j = i+_1, j-_1
end
end
if lo < j; qsort!(a,lo,j); end
lo, j = i, hi
end
return a
end

const N::UInt32 = 1000000

@benchmark qsort!(a, UInt32(1), UInt32(N)) setup=(a = rand(Int32.(1:N), N))
# @benchmark qsort!(a, Int32(1), Int32(N)) setup=(a = rand(Int32.(1:N), N)) # this is for int32 indices
``````

C++ code:

``````typedef int ind_type;

void quicksort(std::vector<int>& a, ind_type lo, ind_type hi) {
ind_type i = lo;
ind_type j = hi;

while (i < hi) {
int pivot = a[(lo + hi) >> 1];
// Partition
while (i <= j) {
while (a[i] < pivot) {
i = i + 1;
}
while (a[j] > pivot) {
j = j - 1;
}
if (i <= j) {
int t = a[i];
a[i] = a[j];
a[j] = t;
i = i + 1;
j = j - 1;
}
}

// Recursion for quicksort
if (lo < j) {
quicksort(a, lo, j);
}
lo = i;
j = hi;
}
}

int main() {
#..........................................
std::vector<int> d(N+1); # one more element
#..........................................
quicksort(d, 1, N); # sort elements 1 to N; needed in case index type is uint32
#..........................................
}
``````

The runtimes are as follows:
Julia:

``````BenchmarkTools.Trial: 70 samples with 1 evaluation.
Range (min β¦ max):  59.434 ms β¦ 65.182 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     61.876 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   62.056 ms Β±  1.364 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

ββββ  ββ β   ββ  ββββ   β      ββ β   β     β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
59.4 ms         Histogram: frequency by time        64.6 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

C++
`Average time taken by program is : 65.399590 ms`

4 Likes

Iβd also really recommend Cthulhu for your introspection needs.
Itβs great to be able to switch between typed Julia IR, LLVM, and native code, and then descend into non-inlined functions.

Itβs really nice when you want to look at functions where the action is hidden behind a function call or two, like `sum`

``````using Cthulhu
x = rand(127);
@code_native sum(x)
@descend_code_typed debuginfo = :none annotate_source =
false iswarn = true sum(x)
``````

The `@turbo` macro also generally places the interesting code behind a `__turbo__!`.

I donβt like the βnewβ `@descend`, so I use my own macro `@d` thatβs equivalent to the above.
The important argument is `annotate_source = false`, you can ignore the rest.

It should show `mapreduce_impl` as the only option to descend into. Descend, and then hit `N` to view the native code of `sum`.

Iβm not sure how to actually use `@descend` itself to see the code of interestβ¦

2 Likes

Is this using MSVC?
The only experience I have with it is looking at godbolt every now and then.
Itβs generally considered to produce worse code than GCC or Clang.
Clang uses LLVM, and will thus likely give comparable results to Julia for similar code.

1 Like

Iβd argue that βis Julia faster than C++ in a single-threaded Quicksortβ is the wrong question.

The right question is βis Julia faster than C++ in sorting an arrayβ.

You can see that Iβm opening up multi-threaded execution and GPUs (if any).

Or single threaded SIMDβ¦ I found some active projects on Google and it looks like a rabbit hole to go down.

Note that Juliaβs default sort uses a radix sort for large suitable inputs, which is algorithmically better, and should thus win for large enough sizes.

For SIMD c++ sorts, there are:

Some comparisons:

IIRC, some of vqsortβs small size performance issues have been fixed (vqsort is the first of the two links above).

Has anyone tried implementing a SIMD sort in Julia?

4 Likes

I did some work with SIMD sorting, check out my presentation at JuliaCon 2019. https://www.youtube.com/watch?v=_bvb8X4DT90&t=402s

Code on GitHub: GitHub - nlw0/ChipSort.jl: Sorting deeds done down the chip

RadixSort really tends to work great when it can be used, which is most cases in my experience. Iβm still not sure what makes it so great, though, I suspect some compiler optimizations may be applied (such as vectorization), but I never checked what exactly happens. In my research I found out this CombSort algorithm gets very well optimized, but it canβt beat Radix.

OP might be interested in looking at that work. Iβd like to point out that practical quicksort tends to work better if you switch to insertion sort for small lists, so you might like to modify both implementations and see if it gets even faster on both languages.