Squeeze out the last 10% of performance for a sorting function?

Here, I get:

% gcc -Ofast -march=native insertion.c -o insertion
% ./insertion 
84019  39439  78310  79845  91165  19756  33523  76823  27778  55397  
Execution time: 1.028668

% julia --math-mode=fast --check-bounds=no insertion.jl 
  977.059 ms (0 allocations: 0 bytes)

% julia --math-mode=fast insertion.jl 
  1.011 s (0 allocations: 0 bytes)

% julia --check-bounds=no insertion.jl 
  983.802 ms (0 allocations: 0 bytes)

% julia insertion.jl 
  1.255 s (0 allocations: 0 bytes)

I don’t understand why --check-bounds=no makes such a difference, given that I am using @inbounds:

function insertion_sort!(x)
    @inbounds for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end
m = 10^5
using BenchmarkTools
@btime insertion_sort!(y) setup = (y = rand(1:m,m));

I’m surprized you find any differences at all. On my computer, all combinations have the same performance. BTW, in recent versions of Julia --math-mode=fast has no effect, it’s overriden by the @fastmath macro. On the other hand, --check-bounds=no sets @inbounds globally everywhere.
My results on Windows 10 and WSL2:

Windows 10
  # 1.095 s (0 allocations: 0 bytes) : @inbounds, "--check-bounds=no"
  # 1.096 s (0 allocations: 0 bytes) : "--check-bounds=no"
  # 1.094 s (0 allocations: 0 bytes) : @inbounds
  # 1.093 s (0 allocations: 0 bytes) : @inbounds @fastmath
  # 1.093 s (0 allocations: 0 bytes) : "--check-bounds=no --math-mode=fast"
WSL2
  # 1.094 s (0 allocations: 0 bytes) : @inbounds, "--check-bounds=no"
  # 1.094 s (0 allocations: 0 bytes) : "--check-bounds=no"
  # 1.095 s (0 allocations: 0 bytes) : @inbounds
  # 1.096 s (0 allocations: 0 bytes) : @inbounds @fastmath
  # 1.093 s (0 allocations: 0 bytes) : "--check-bounds=no --math-mode=fast"

Whether you use @inbounds or --check-bounds=no really shouldn’t make a difference. (Can confirm it locally, similar to what @Seif_Shebl posted.)

1 Like

Yes, that is strange. But here it does (Linux Mint 20.2 on a Samsung i7):

% julia insertion.jl 
  1.254 s (0 allocations: 0 bytes)

% julia --check-bounds=no insertion.jl 
  1.004 s (0 allocations: 0 bytes)

% julia insertion.jl 
  1.266 s (0 allocations: 0 bytes)

% julia --check-bounds=no insertion.jl 
  975.647 ms (0 allocations: 0 bytes)

% more insertion.jl 
function insertion_sort!(x)
    @inbounds for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end
m = 10^5
using BenchmarkTools
@btime insertion_sort!(y) setup = (y = rand(1:m,m));

% julia --version
julia version 1.6.1

This is kind of a naive question, but could this have anything to do with the type of integer for i and j? I believe Julia Int is defined by machine, e.g.

julia> typeof(1)
Int64

whereas gcc might default to 32-bit int?

These are kind of trivial operations with i and j, but everything is trivial here, so maybe it matters? I am at least a couple decades out of date, but it used to be that Int32 was faster for many operations. No idea about modern architectures and compilers though. It might even go the other way?

1 Like

For C and Flang, it makes no difference if I use int64 for all ints, but gfortran is now the same as Julia and Julia now beats Nim. So now, only the C compiler and Fortran’s LLVM Flang beat Julia.

Times using int64 in all languages.

Nim : Execution time: 1.255711317062378
gfortran :      Time: 1.09300
Julia :               1.09400 s (0 allocations: 0 bytes)
Flang :         Time: 1.02821
C :   Execution time: 0.984375
1 Like

Well, defining x as an MVector eliminated that gap to C and Fortran completely. I was hesitant to try StaticArrays at first because the vector length is big, but was surprised that encoding the vector length besides the element type as the MVector does made all the difference. Thank you all for the valuable feedback in this thread.

using StaticArrays

function insertion_sort!(x)
    for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end

m = 10^5
x = MVector{m,Int}(undef)
@btime insertion_sort!(y) setup = (y = rand!(x,1:m)) 
  1.028 s (0 allocations: 0 bytes)
4 Likes