% gcc -Ofast -march=native insertion.c -o insertion
% ./insertion
84019 39439 78310 79845 91165 19756 33523 76823 27778 55397
Execution time: 1.028668
% julia --math-mode=fast --check-bounds=no insertion.jl
977.059 ms (0 allocations: 0 bytes)
% julia --math-mode=fast insertion.jl
1.011 s (0 allocations: 0 bytes)
% julia --check-bounds=no insertion.jl
983.802 ms (0 allocations: 0 bytes)
% julia insertion.jl
1.255 s (0 allocations: 0 bytes)
I don’t understand why --check-bounds=no makes such a difference, given that I am using @inbounds:
function insertion_sort!(x)
@inbounds for i = 2:length(x)
c = x[i]
j = i
while j > 1 && c < x[j-1]
x[j] = x[j-1]
j -= 1
end
x[j] = c
end
end
m = 10^5
using BenchmarkTools
@btime insertion_sort!(y) setup = (y = rand(1:m,m));
I’m surprized you find any differences at all. On my computer, all combinations have the same performance. BTW, in recent versions of Julia --math-mode=fast has no effect, it’s overriden by the @fastmath macro. On the other hand, --check-bounds=no sets @inbounds globally everywhere.
My results on Windows 10 and WSL2:
Windows 10
# 1.095 s (0 allocations: 0 bytes) : @inbounds, "--check-bounds=no"
# 1.096 s (0 allocations: 0 bytes) : "--check-bounds=no"
# 1.094 s (0 allocations: 0 bytes) : @inbounds
# 1.093 s (0 allocations: 0 bytes) : @inbounds @fastmath
# 1.093 s (0 allocations: 0 bytes) : "--check-bounds=no --math-mode=fast"
WSL2
# 1.094 s (0 allocations: 0 bytes) : @inbounds, "--check-bounds=no"
# 1.094 s (0 allocations: 0 bytes) : "--check-bounds=no"
# 1.095 s (0 allocations: 0 bytes) : @inbounds
# 1.096 s (0 allocations: 0 bytes) : @inbounds @fastmath
# 1.093 s (0 allocations: 0 bytes) : "--check-bounds=no --math-mode=fast"
Yes, that is strange. But here it does (Linux Mint 20.2 on a Samsung i7):
% julia insertion.jl
1.254 s (0 allocations: 0 bytes)
% julia --check-bounds=no insertion.jl
1.004 s (0 allocations: 0 bytes)
% julia insertion.jl
1.266 s (0 allocations: 0 bytes)
% julia --check-bounds=no insertion.jl
975.647 ms (0 allocations: 0 bytes)
% more insertion.jl
function insertion_sort!(x)
@inbounds for i = 2:length(x)
c = x[i]
j = i
while j > 1 && c < x[j-1]
x[j] = x[j-1]
j -= 1
end
x[j] = c
end
end
m = 10^5
using BenchmarkTools
@btime insertion_sort!(y) setup = (y = rand(1:m,m));
% julia --version
julia version 1.6.1
This is kind of a naive question, but could this have anything to do with the type of integer for i and j? I believe Julia Int is defined by machine, e.g.
julia> typeof(1)
Int64
whereas gcc might default to 32-bit int?
These are kind of trivial operations with i and j, but everything is trivial here, so maybe it matters? I am at least a couple decades out of date, but it used to be that Int32 was faster for many operations. No idea about modern architectures and compilers though. It might even go the other way?
For C and Flang, it makes no difference if I use int64 for all ints, but gfortran is now the same as Julia and Julia now beats Nim. So now, only the C compiler and Fortran’s LLVM Flang beat Julia.
Times using int64 in all languages.
Nim : Execution time: 1.255711317062378
gfortran : Time: 1.09300
Julia : 1.09400 s (0 allocations: 0 bytes)
Flang : Time: 1.02821
C : Execution time: 0.984375
Well, defining x as an MVector eliminated that gap to C and Fortran completely. I was hesitant to try StaticArrays at first because the vector length is big, but was surprised that encoding the vector length besides the element type as the MVector does made all the difference. Thank you all for the valuable feedback in this thread.
using StaticArrays
function insertion_sort!(x)
for i = 2:length(x)
c = x[i]
j = i
while j > 1 && c < x[j-1]
x[j] = x[j-1]
j -= 1
end
x[j] = c
end
end
m = 10^5
x = MVector{m,Int}(undef)
@btime insertion_sort!(y) setup = (y = rand!(x,1:m))
1.028 s (0 allocations: 0 bytes)