Squeeze out the last 10% of performance for a sorting function?

I wrote a straightforward implementation of the insertion sort algorithm in Julia below. Then I wrote the exact translation in C to compare with Julia. I noticed a performance gap of about 10% in favor of C implementation. I can’t figure out what is causing this gap or what prevents Julia from reaching C here. I guess the difference comes from arrays in C being fixed-size and stack allocated vs. dynamic arrays in Julia but I’m not sure.

I use -O3 --inline=yes --check-bounds=no in Julia benchmark and enable aggresive optimizations -Ofast -marh=native in C too.

function insertion_sort!(x)
    for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end

m = 10^5
x = rand(1:m,m)
@btime insertion_sort!(y) setup = y = copy(x)
  # 1.090 s (0 allocations: 0 bytes)

The C code is below for reference.

Summary
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define M 100000

void insertion_sort(long* x) {
    for (int i = 1; i < M; ++i) {
        int c = x[i];
        int j = i;
        while (j > 0 && c < x[j-1]) {
            x[j] = x[j-1];
            j -= 1;
        }
        x[j] = c;
    }
}

int main(int argc, char *argv[]) {
    long* x;
    x = malloc(M * sizeof(long));
    for (int i = 0; i < M; ++i)
        x[i] = M * (double)rand() / RAND_MAX + 1;
    for (int i = 0; i < 10; ++i) printf("%ld  ", x[i]);
    clock_t start = clock();
    insertion_sort(x);
    clock_t end = clock();
    double total = (double)(end - start) / CLOCKS_PER_SEC;
    printf("\nExecution time: %f\n", total);
    for (int i = 0; i < 10; ++i) printf("%ld  ", x[i]);
}

// 84019  39439  78310  79845  91165  19756  33523  76823  27778  55397  
// Execution time: 1.000000
// 1  2  2  3  3  5  5  6  9  10

What happens in the C code if instead of hard-coding M you set it at runtime as a parameter? The performance remains the same?

1 Like

That is interesting, I see a small difference of about 3% oscillating.

84019  39439  78310  79845  91165  19756  33523  76823  27778  55397  
Execution time: 1.031250
1  2  2  3  3  5  5  6  9  10

FWIW, for me (on a MacBook Pro 16) the C code and the Julia code are equally fast:

➜  ~/Desktop/asd  gcc-11 -Ofast -march=native code.c -o code

➜  ~/Desktop/asd  ./code
1  13154  75561  45866  53277  21896  4705  67887  67930  93470  
Execution time: 1.535767
1  1  1  1  4  4  5  5  6  7  %                                                                                                                                                                                                     

➜  ~/Desktop/asd  julia --check-bounds=no -E 'include("code.jl")'
  1.509 s (0 allocations: 0 bytes)
nothing

Oh, and of course there fluctuations for these timings on a noisy machine. In particular, you seem to be timing the C function just once.

1 Like

Thanks for sharing the results on Mac. I tested on both Windows 10 and WSL2 and got the same difference.

FWIW, I compared with Julia’s InsertionSort and got similar timings, so I guess the Julia version I wrote has no obvious flaws.

@btime sort!(y, alg=InsertionSort) setup = y = copy(x)
  # 1.122 s (0 allocations: 0 bytes)

Aren’t you missing an @inbounds?

1 Like

I set it globally --check-bounds=no when calling Julia command. Should it make any difference?

1 Like

I don’t see any difference here:

% gcc -Ofast -march=native insertion.c -o insertion
leandro@pitico:~/Drive/Work/JuliaPlay% ./insertion 
84019  39439  78310  79845  91165  19756  33523  76823  27778  55397  
Execution time: 1.277376
1  2  2  3  3  5  5  6  9  10 

% julia -E 'include("./insertion.jl")'
  1.299 s (0 allocations: 0 bytes)
  1.087 s (0 allocations: 0 bytes)
nothing

The second timing is providing the size of the loop as a parameter of the type (just checking that). It seems to make a difference.

Code:

Summary
function insertion_sort!(x)
    @inbounds for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end

struct X{M}
  x::Vector{Int}
end
Base.getindex(x::X,i) = x.x[i]
Base.setindex!(x::X,i,j) = x.x[j] = i

function insertion_sort2!(x::X{M}) where M
    xx = x.x
    @inbounds for i = 2:M
        c = xx[i]
        j = i
        while j > 1 && c < xx[j-1] 
            xx[j] = xx[j-1]
            j -= 1
        end
        xx[j] = c
    end
end

m = 10^5
x0 = rand(1:m,m);

using BenchmarkTools, Test

y = deepcopy(x0)
y2 = X{m}(deepcopy(x0))
insertion_sort!(y)
insertion_sort2!(y2) 
@test y == y2.x

@btime insertion_sort!(y) setup = y = deepcopy(x0)
@btime insertion_sort2!(y) setup = y = X{m}(deepcopy(x0))

Curiously (??), if I add --check-bounds=no to the execution, the performance is inverted:

%julia  --check-bounds=no -E 'include("./insertion.jl")'
  977.365 ms (0 allocations: 0 bytes)
  1.257 s (0 allocations: 0 bytes)

I have no idea why.

I get identical timings in both cases, that is weird!

  1.096 s (0 allocations: 0 bytes)
  1.095 s (0 allocations: 0 bytes)

I still feel that there is a need for fixed-size arrays in Julia. I’m aware of StaticArrays which are especially useful for short arrays. In the early days of Julia development, I remember a proposal by @timholy Do we want fixed-size arrays? #5857. Then, packages like FixedSizeArrays and ImmutableArrays appeared but were superceded by StaticArrays. For large arrays, though, there is no package (to my knowledge) providing such functionality.

I tested the same code in Nim and Fortran and got identical timings to the C code.

Nim: nim c -d:danger --passC:\"-march=native\" -t:-ffast-math -r ${file_name}
# @[75648, 14156, 70350, 57700, 2108, 52880, 99371, 30407, 34558, 58463]
# Execution time: 0.9907166957855225
# @[1, 1, 2, 2, 3, 5, 5, 7, 8, 9]
Fortran: gfortran -Ofast ${file}
! Time: 1.00000
!     1    1    2    4    5    6    6    6    7    8

which suggest there is a room for improvement of Julia arrays. For Nim, it’s not a surprise since it uses the gcc backend, but I guess nothing should prevent Julia from acheiving the same performance.

Nim and Fortran codes for reference:

import times, random

template runTimed(body: untyped) =
  let t1 = epochTime(); body; echo "Execution time: ", epochTime() - t1

proc insertion_sort[T](a: var openArray[T]) =
  for i in 1..a.high:
    let c = a[i]
    var j = i
    while j > 0 and c < a[j-1]:
      a[j] = a[j-1]
      dec j
    a[j] = c

proc main(m: int) =
  var X = newSeq[int](m)
  for i in 0 ..< m:
    X[i] = rand(m)
  echo X[0..9]
  runTimed:
    insertion_sort X
  echo X[0..9]

main(100_000)
program test_insertion
    implicit none
    integer, parameter :: m = 100000, i64 = selected_int_kind(18)
    integer(i64) :: x(m)
    integer :: i, t0, t1, count_rate, count_max

    x = [(m * rand() + 1, i = 1,m)]
    call system_clock(t0, count_rate, count_max)
    call insertion_sort(x)
    call system_clock(t1)
    print '(a,f7.5)', 'Time: ', real(t1-t0)/count_rate
    print *, x(1:10)

contains

subroutine insertion_sort(x)
    integer(i64) :: x(:)
    integer :: i, j, c
    do i = 2,size(x)
        c = x(i)
        j = i
        do while (j > 1) 
            if (c >= x(j-1)) exit
            x(j) = x(j-1)
            j = j - 1
        end do
        x(j) = c
    end do
end subroutine insertion_sort

end

What are the limitations of StaticArrays.jl for large arrays?

Btw, looks like you are just trying to understand performance implications, but for the fastest sorting algorithm you can look at

SortingAlgorithms.jl and SortingLab.fsort

Use RadixSort e.g.

using SortingAlgorithms: RadixSort

sort(x, alg = RadixSort)

using SortingLab: fsort

fsort(x)

StaticArrays are useful until some threshold (say 100 elements?) is reached, above which StaticArrays behave like normal arrays performance-wise. I mean, to my understanding, Julia arrays are more like allocatable arrays in Fortran. Whereas normal arrays in Fortran are fixed-size and stack allocated, which may enable further compiler optimizations. I’m searching for a counterpart in Julia. I remember once I heared Keno talking about some proposal along those lines for Julia 2.0.

1 Like

Yes, I know that SortingAlgorithms.jl has much better algorithms suitable for this problem. Even Julia’s standard QuickSort takes only 4 ms to sort this 100k vector. For benchmarking purposes, people usually test the same poor algorithms across different languages. I try to understand why Julia couldn’t close that 10% gap in this benchmark, and my gut feeling pointed to arrays.

1 Like

https://chriselrod.github.io/StrideArrays.jl/dev/stack_allocation/

4 Likes

I cannot really reproduce these differences, the time measures here, of any of the codes, fluctuate far more than the differences. In your machine do you get reliably an execution time of exactly 1s for every Fortran run? That is somewhat weird.

The fluctuations can happen due to randomness of the array being sorted. I unified the array by reading from the same file in all languages and I get consistent timings. The gap is more like only 6%.

Julia:
#  1.096 s (0 allocations: 0 bytes)
Others:
# @[20644, 38472, 78042, 7082, 90072, 6454, 13527, 37036, 82249, 20941]
# Execution time: 1.030217885971069
# @[2, 3, 3, 4, 5, 6, 6, 7, 7, 8] 
2 Likes

StrideArrays seems very interesting. In my case though, I find no real change in timings, at most 1.5% in some runs or within noise. Besides, the construction of the x vector in this case is weird. I wrote this x = trunc.(Int, m * (@StrideArray rand(m)) .+ 1 ) and couldn’t do better. Also, it takes too long to compile.

Why not do setup = (y = rand(1:m,m))? Would make the benchmark more varied and not reliant on a specific order on the julia side.

1 Like

Yes, good idea. I get consistent results in Julia 1.096 ± 0.002 s (0 allocations: 0 bytes).