Squeeze out the last 10% of performance for a sorting function?

lmiq · July 13, 2021, 1:30pm

Here, I get:

% gcc -Ofast -march=native insertion.c -o insertion
% ./insertion 
84019  39439  78310  79845  91165  19756  33523  76823  27778  55397  
Execution time: 1.028668

% julia --math-mode=fast --check-bounds=no insertion.jl 
  977.059 ms (0 allocations: 0 bytes)

% julia --math-mode=fast insertion.jl 
  1.011 s (0 allocations: 0 bytes)

% julia --check-bounds=no insertion.jl 
  983.802 ms (0 allocations: 0 bytes)

% julia insertion.jl 
  1.255 s (0 allocations: 0 bytes)

I don’t understand why --check-bounds=no makes such a difference, given that I am using @inbounds:

function insertion_sort!(x)
    @inbounds for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end
m = 10^5
using BenchmarkTools
@btime insertion_sort!(y) setup = (y = rand(1:m,m));

Seif_Shebl · July 13, 2021, 5:28pm

I’m surprized you find any differences at all. On my computer, all combinations have the same performance. BTW, in recent versions of Julia --math-mode=fast has no effect, it’s overriden by the @fastmath macro. On the other hand, --check-bounds=no sets @inbounds globally everywhere.
My results on Windows 10 and WSL2:

Windows 10
  # 1.095 s (0 allocations: 0 bytes) : @inbounds, "--check-bounds=no"
  # 1.096 s (0 allocations: 0 bytes) : "--check-bounds=no"
  # 1.094 s (0 allocations: 0 bytes) : @inbounds
  # 1.093 s (0 allocations: 0 bytes) : @inbounds @fastmath
  # 1.093 s (0 allocations: 0 bytes) : "--check-bounds=no --math-mode=fast"
WSL2
  # 1.094 s (0 allocations: 0 bytes) : @inbounds, "--check-bounds=no"
  # 1.094 s (0 allocations: 0 bytes) : "--check-bounds=no"
  # 1.095 s (0 allocations: 0 bytes) : @inbounds
  # 1.096 s (0 allocations: 0 bytes) : @inbounds @fastmath
  # 1.093 s (0 allocations: 0 bytes) : "--check-bounds=no --math-mode=fast"

carstenbauer · July 13, 2021, 5:38pm

Whether you use @inbounds or --check-bounds=no really shouldn’t make a difference. (Can confirm it locally, similar to what @Seif_Shebl posted.)

lmiq · July 13, 2021, 5:49pm

Yes, that is strange. But here it does (Linux Mint 20.2 on a Samsung i7):

% julia insertion.jl 
  1.254 s (0 allocations: 0 bytes)

% julia --check-bounds=no insertion.jl 
  1.004 s (0 allocations: 0 bytes)

% julia insertion.jl 
  1.266 s (0 allocations: 0 bytes)

% julia --check-bounds=no insertion.jl 
  975.647 ms (0 allocations: 0 bytes)

% more insertion.jl 
function insertion_sort!(x)
    @inbounds for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end
m = 10^5
using BenchmarkTools
@btime insertion_sort!(y) setup = (y = rand(1:m,m));

% julia --version
julia version 1.6.1

apo383 · July 13, 2021, 7:18pm

This is kind of a naive question, but could this have anything to do with the type of integer for i and j? I believe Julia Int is defined by machine, e.g.

julia> typeof(1)
Int64

whereas gcc might default to 32-bit int?

These are kind of trivial operations with i and j, but everything is trivial here, so maybe it matters? I am at least a couple decades out of date, but it used to be that Int32 was faster for many operations. No idea about modern architectures and compilers though. It might even go the other way?

Seif_Shebl · July 14, 2021, 12:28am

For C and Flang, it makes no difference if I use int64 for all ints, but gfortran is now the same as Julia and Julia now beats Nim. So now, only the C compiler and Fortran’s LLVM Flang beat Julia.

Times using int64 in all languages.

Nim : Execution time: 1.255711317062378
gfortran :      Time: 1.09300
Julia :               1.09400 s (0 allocations: 0 bytes)
Flang :         Time: 1.02821
C :   Execution time: 0.984375

Seif_Shebl · July 18, 2021, 11:55pm

Well, defining x as an MVector eliminated that gap to C and Fortran completely. I was hesitant to try StaticArrays at first because the vector length is big, but was surprised that encoding the vector length besides the element type as the MVector does made all the difference. Thank you all for the valuable feedback in this thread.

using StaticArrays

function insertion_sort!(x)
    for i = 2:length(x)
        c = x[i]
        j = i
        while j > 1 && c < x[j-1] 
            x[j] = x[j-1]
            j -= 1
        end
        x[j] = c
    end
end

m = 10^5
x = MVector{m,Int}(undef)
@btime insertion_sort!(y) setup = (y = rand!(x,1:m)) 
  1.028 s (0 allocations: 0 bytes)

Topic		Replies	Views
Programming Language Benchmark 2 Performance	25	3450	April 8, 2024
Strange performance of a loop Performance	37	3238	July 21, 2018
Why is Julia faster than C++ for quicksort? Performance performance , quicksort	15	2011	August 15, 2023
The cost of size() in for-loops Performance	11	1817	July 20, 2018
Julia fn clearer than C++/Fortran (examples sought) General Usage	38	3224	May 30, 2021

Squeeze out the last 10% of performance for a sorting function?

Related topics