Ironic observation about `sort` and `sortperm` speed for "small intergers" vs R


#1

I found something ironic about sort performance in Julia vs R. I am trying to sort a vector with integer values ranging from 1 to 1_000_000. Julia uses counting sort which is much faster than R’s radix sort, but the situation is completely reversed for sortperm, R’s algorithms seems blazingly fast! I am racking my brain to try and think of a way to beat R here. I don’t want to look at R’s source (as I don’t understand the licensing and it’s a nice challenge!). I have posted my implementation of sortperm below which is SLOW, and I think it’s due to the loop

@inbounds for i in la:-1:1
        ai = a[i] - minval + 1
        c = cnts[ai]
        cnts[ai] -= 1
        res[c] = i
    end

and in particular this line res[c] = i; it’s got to do with bad cache performance? I tried my code on v0.7 and it’s also slow. But R has a fast implementation so there might be a better algorithm; or Julia can generate better code?

my full implementation of sortperm which is basically the same as the one in Julia.Base but is easier to understand for me

function sortperm_int_range2(a, rangelen, minval)
    cnts = fill(0, rangelen)

    # create a count first
    @inbounds for ai in a
        cnts[ai - minval + 1] +=  1
    end

    # create cumulative sum
    cumsum!(cnts, cnts)

    la = length(a)
    res = Vector{Int}(la)
    @inbounds for i in la:-1:1
        ai = a[i] - minval + 1
        c = cnts[ai]
        cnts[ai] -= 1
        res[c] = i
    end
    res
end

# to test the code
rangelen = 1_000_000
minval = 1
a = rand(1:1_000_000, 100_000_000)
@time x = sortperm_int_range2(a, 1_000_000, 1); # 20 seconds; TOO SLOW

code to generate graph

js = @elapsed sort(a)
jsp = @elapsed sortperm(a)

using RCall
@rput a
r = R"""
list(system.time(sort(a)),  system.time(order(a)))
"""

using StatPlots

groupedbar(
    ["sort", "sort", "sortperm", "sortperm"],
    [js, r[1][3], jsp, r[2][3]],
    group = ["Julia","R","Julia","R"],
    title = "Julia vs R sort and perm (values = 1:1m, vector len = 100m)",
    ylabel="seconds"
)
savefig("julia_vs_r_sortandperm_integer.png")

#2

This seems related to #939 sort/sortperm has poor performance.


#3

I’ve also found the mismatch in performance in Julia between sort and sortperm to be quite puzzling. I never really understood how could sort be so much faster than sortperm. Are you making the case that a different default sorting algorithm should be used for sort and sortperm?


#4

no, not really. the sortperm i have implemented is quite simple and should be fast, but it is not. i wonder why r is so fast? presumably it’s using some form of radix sort which shouldnt be faster than my counting sort


#5

Looking at the code (orderVector1 function in src/main/sort.c), is seems that R is using Shell sort for integer vectors. AFAICT they are using the following 16 gaps: 1073790977, 268460033, 67121153, 16783361, 4197377, 1050113, 262913, 65921, 16577, 4193, 1073, 281, 77, 23, 8, 1. I have no idea why it would be faster than your approach.


#6

i thought i had tried to manually set method to “shell” and found it to be slow. so the order method is using “radix”, see below.

a = sample(1e6, 1e8, replace=T)
ao <- order(a) # fast
ao_shell <- order(a, method="shell") # slow

#7

I am starting to think that Julia is generating suboptimal code? The algorithm is quite simple and should be fast. The cnts array can be quite long and it’s assigning values using a random distribution of indices; so it jumps around. Perhaps it’s causing bad cache performance?

Can someone with more expertise please look at the code generated by sortperm_int_range2(a, 1_000_000, 1)? I have a feeling that it could generate faster code.

I used a bit of bit twiddling to get something that is only about 2x faster than Base, but I should be able to get R’s performance.

function sortperm_int_range4(a, minval)
    ula = UInt32(length(a))
    lz = leading_zeros(ula)

    ca = Vector{UInt64}(ula)

    tz = trailing_zeros(UInt64(minval))

    ltz = lz - tz
    for i = UInt32(1):ula
        ca[i] = (UInt64(a[i]) << (32-ltz)) | i
    end
    
    sort!(ca, alg = RadixSort)

    Int.(ca .& (UInt64(2)^(32-lz)-1))
end

using SortingAlgorithms, BenchmarkTools, SortingLab

a = rand(1:1_000_000, 100_000_000)
@time x=sortperm_int_range4(a, 1)
issorted(a[x])

@benchmark sortperm_int_range4($a,1) samples = 5 seconds = 120
# BenchmarkTools.Trial:
#   memory estimate:  2.24 GiB
#   allocs estimate:  91
#   --------------
#   minimum time:     9.271 s (3.17% GC)
#   median time:      9.575 s (3.07% GC)
#   mean time:        9.577 s (2.95% GC)
#   maximum time:     9.979 s (1.99% GC)
#   --------------
#   samples:          5
#   evals/sample:     1

#8

R’s code for integer sorting is not a plain counting sort. Once you get to a million unique values, that’s not very efficient. R’s integer sorting is based on a radix sort in Matt Dowle’s data.table package. It breaks up the problem and uses several tricks. See the following presentation:

http://user2015.math.aau.dk/presentations/234.pdf

And, Matt writes super-fast code.


#9

Looks like the R/data.table code is based on C code in the following.

http://stereopsis.com/radix.html

It looks like the license allows adaptation/conversion to Julia.


#10

yeah, I read through that presentation. I have a hard time understanding why counting sort would be slower? Radix sort is nothing but counting sort applied multiple times; so it’s used to sort vectors with too many possible uniques e.g. UInt64.

here the number of uniques is known so it can use one counting sort step and beat radix sort; btw that’s the whole idea behind Base.Sort.sort_int_perm! but somehow Base.Sort.sortperm_int_perm is slow even though the ideas behind them are the same. I highly suspect something is off with the line res[c] = i. Basically is a random value between 1 and 100m, so it “jumps around” alot when done 100m times.

I still think it’s a good idea to see the code generated by the function. It should be efficient unless there something really happening with cache.


#11

The webpage above outlines some of the approaches to beating a straight counting sort as does the following:

http://codercorner.com/RadixSortRevisited.htm

Another big issue is the in R, an integer is 32 bits, so that will speed things up (relative to Julia’s Int64). In Julia, sorting your a vector is twice as fast for me if I use Int32 instead of Int64. sortperm wasn’t any faster, but the indexing vector will still be Int64 even when using an Int32 vector.


#13

They don’t detail contents that beat counting sort, I think they detail how to optimize radix sort to beat a basic application of radix sort. And they have some tricks to deal with partially sorted data better. I have read them as part of my research; again radix sort is nothing but applying counting sort multiple times; each time it iterates through all the elements and assign them to a new location; they can’t use counting sort in one go for 64 bit types as there are 2^64 possible unique values, so they iterate through all the elements using 8 bits (or 11 bits) at a time so each time there are only 2^8 (2^11) unique values to count.

In this particular case, there are only 1_000_000 unique values, and it’s known in advance, hence a straight counting sort will beat radix sort, if things like cache-performance and generated code don’t come into the picture. Theoretically, counting sort will always be faster, for the simple reason that it only needs one pass over the data verses, radix sort’s 64/8 or (64/11) passes, and that’s why Base uses counting sort in sort_int_range! which is demonstrably faster than R’s radix sort as shown in the original post. It’s just that sortperm that has poor performance but it’ using the same underlying “counting sort”-like algorithm.

Getting back into the real world, in practice, things like cache (which I don’t understand very well) and other CPU and other hardware related things matter, the low-level code being generated also matters. That’s why I am leaning towards checking out the generated code for any issues. This might even help towards closing the issue that was opened 5-6 years ago regarding poor sortperm performance.

function sortperm_int_range2(a, rangelen, minval)
    cnts = fill(0, rangelen)

    # create a count first
    @inbounds for ai in a
        cnts[ai - minval + 1] +=  1
    end

    # create cumulative sum
    cumsum!(cnts, cnts)

    la = length(a)
    res = Vector{Int}(la)
    @inbounds for i in la:-1:1
        ai = a[i] - minval + 1
        c = cnts[ai]
        cnts[ai] -= 1
        res[c] = i
    end

    res
end

@code_lowered @code_typed @code_llvm @code_native


Main> @code_lowered sortperm_int_range2(a, 1_000_000, 1)
CodeInfo(:(begin
        nothing
        NewvarNode(:(la))
        NewvarNode(:(res))
        cnts = (Main.fill)(0, rangelen) # line 4:
        $(Expr(:inbounds, true))
        SSAValue(0) = a
        #temp#@_6 = (Base.start)(SSAValue(0))
        9:
        unless !((Base.done)(SSAValue(0), #temp#@_6)) goto 20
        SSAValue(1) = (Base.next)(SSAValue(0), #temp#@_6)
        ai@_5 = (Core.getfield)(SSAValue(1), 1)
        #temp#@_6 = (Core.getfield)(SSAValue(1), 2) # line 5:
        SSAValue(2) = (ai@_5 - minval) + 1
        SSAValue(3) = (Main.getindex)(cnts, SSAValue(2)) + 1
        (Main.setindex!)(cnts, SSAValue(3), SSAValue(2))
        18:
        goto 9
        20:
        $(Expr(:inbounds, :pop)) # line 8:
        (Main.cumsum!)(cnts, cnts) # line 9:
        la = (Main.length)(a) # line 10:
        res = ((Core.apply_type)(Main.Vector, Main.Int))(la) # line 11:
        $(Expr(:inbounds, true))
        SSAValue(4) = (Main.colon)(la, -1, 1)
        #temp#@_10 = (Base.start)(SSAValue(4))
        32:
        unless !((Base.done)(SSAValue(4), #temp#@_10)) goto 48
        SSAValue(5) = (Base.next)(SSAValue(4), #temp#@_10)
        i = (Core.getfield)(SSAValue(5), 1)
        #temp#@_10 = (Core.getfield)(SSAValue(5), 2) # line 12:
        ai@_9 = ((Main.getindex)(a, i) - minval) + 1 # line 13:
        c = (Main.getindex)(cnts, ai@_9) # line 14:
        SSAValue(6) = (Main.getindex)(cnts, ai@_9) - 1
        (Main.setindex!)(cnts, SSAValue(6), ai@_9) # line 15:
        (Main.setindex!)(res, i, c)
        46:
        goto 32
        48:
        $(Expr(:inbounds, :pop)) # line 17:
        return res
    end))

Main> @code_typed sortperm_int_range2(a, 1_000_000, 1)
CodeInfo(:(begin
        NewvarNode(:(la))
        NewvarNode(:(res))
        cnts = $(Expr(:invoke, MethodInstance for fill!(::Array{Int64,1}, ::Int64), :(Base.fill!), :($(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, :(rangelen), 0))), 0)) # line 4:
        $(Expr(:inbounds, true))
        #temp#@_6 = 1
        7:
        unless (Base.not_int)((#temp#@_6 === (Base.add_int)((Base.arraylen)(a)::Int64, 1)::Int64)::Bool)::Bool goto 19
        SSAValue(10) = (Base.arrayref)(a, #temp#@_6)::Int64
        SSAValue(11) = (Base.add_int)(#temp#@_6, 1)::Int64
        ai@_5 = SSAValue(10)
        #temp#@_6 = SSAValue(11) # line 5:
        SSAValue(2) = (Base.add_int)((Base.sub_int)(ai@_5, minval)::Int64, 1)::Int64
        SSAValue(3) = (Base.add_int)((Base.arrayref)(cnts, SSAValue(2))::Int64, 1)::Int64
        (Base.arrayset)(cnts, SSAValue(3), SSAValue(2))::Array{Int64,1}
        17:
        goto 7
        19:
        $(Expr(:inbounds, :pop)) # line 8:
        $(Expr(:invoke, MethodInstance for cumsum!(::Array{Int64,1}, ::Array{Int64,1}), :(Main.cumsum!), :(cnts), :(cnts))) # line 9:
        la = (Base.arraylen)(a)::Int64 # line 10:
        res = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), Array{Int64,1}, 0, :(la), 0)) # line 11:
        $(Expr(:inbounds, true))
        $(Expr(:inbounds, false))
        # meta: location range.jl colon 30
        $(Expr(:inbounds, false))
        # meta: location range.jl _colon 32
        # meta: location range.jl Type 145
        # meta: location range.jl Type 93
        SSAValue(9) = $(Expr(:invoke, MethodInstance for steprange_last(::Int64, ::Int64, ::Int64), :(Base.steprange_last), :(la), -1, 1))
        # meta: pop location
        # meta: pop location
        # meta: pop location
        $(Expr(:inbounds, :pop))
        # meta: pop location
        $(Expr(:inbounds, :pop))
        SSAValue(12) = la
        #temp#@_10 = SSAValue(12)
        44:
        unless (Base.not_int)((Base.or_int)((Base.and_int)((Base.not_int)((SSAValue(12) === SSAValue(9))::Bool)::Bool, (Base.not_int)(((Base.slt_int)(0, -1)::Bool === (Base.slt_int)(SSAValue(12), SSAValue(9))::Bool)::Bool)::Bool)::Bool, (#temp#@_10 === (Base.add_int)(SSAValue(9), -1)::Int64)::Bool)::Bool)::Bool goto 61
        SSAValue(13) = #temp#@_10
        SSAValue(14) = (Base.add_int)(#temp#@_10, -1)::Int64
        i = SSAValue(13)
        #temp#@_10 = SSAValue(14) # line 12:
        ai@_9 = (Base.add_int)((Base.sub_int)((Base.arrayref)(a, i)::Int64, minval)::Int64, 1)::Int64 # line 13:
        c = (Base.arrayref)(cnts, ai@_9)::Int64 # line 14:
        SSAValue(6) = (Base.sub_int)((Base.arrayref)(cnts, ai@_9)::Int64, 1)::Int64
        (Base.arrayset)(cnts, SSAValue(6), ai@_9)::Array{Int64,1} # line 15:
        (Base.arrayset)(res, i, c)::Array{Int64,1}
        59:
        goto 44
        61:
        $(Expr(:inbounds, :pop)) # line 17:
        return res
    end))=>Array{Int64,1}

Main> @code_llvm sortperm_int_range2(a, 1_000_000, 1)

; Function Attrs: uwtable
define i8** @julia_sortperm_int_range2_64498(i8** dereferenceable(40), i64, i64) #0 !dbg !5 {
top:
  %3 = call i8**** @jl_get_ptls_states() #4
  %4 = alloca [13 x i8**], align 8
  %.sub = getelementptr inbounds [13 x i8**], [13 x i8**]* %4, i64 0, i64 0
  %5 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 2
  %6 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 3
  %7 = bitcast [13 x i8**]* %4 to i64*
  %8 = bitcast i8*** %5 to i8*
  call void @llvm.memset.p0i8.i64(i8* %8, i8 0, i64 88, i32 8, i1 false)
  store i64 22, i64* %7, align 8
  %9 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 1
  %10 = bitcast i8**** %3 to i64*
  %11 = load i64, i64* %10, align 8
  %12 = bitcast i8*** %9 to i64*
  store i64 %11, i64* %12, align 8
  store i8*** %.sub, i8**** %3, align 8
  %13 = call i8** inttoptr (i64 1801121808 to i8** (i8**, i64)*)(i8** inttoptr (i64 162829584 to i8**), i64 %1)
  store i8** %13, i8*** %5, align 8
  %14 = call i8** @"jlsys_fill!_42240"(i8** %13, i64 0)
  store i8** %14, i8*** %6, align 8
  %15 = getelementptr inbounds i8*, i8** %0, i64 1
  %16 = bitcast i8** %15 to i64*
  %17 = load i64, i64* %16, align 8
  %18 = icmp eq i64 %17, 0
  br i1 %18, label %L19, label %if.lr.ph

if.lr.ph:                                         ; preds = %top
  %19 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 4
  %20 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 5
  %21 = bitcast i8** %0 to i64**
  %22 = load i64*, i64** %21, align 8
  %23 = bitcast i8** %14 to i64**
  %24 = load i64*, i64** %23, align 8
  br label %if

if:                                               ; preds = %if.lr.ph, %if
  %"#temp#.07" = phi i64 [ 1, %if.lr.ph ], [ %28, %if ]
  %25 = add i64 %"#temp#.07", -1
  %26 = getelementptr i64, i64* %22, i64 %25
  %27 = load i64, i64* %26, align 8
  %28 = add i64 %"#temp#.07", 1
  %29 = sub i64 %27, %2
  %30 = getelementptr i64, i64* %24, i64 %29
  %31 = load i64, i64* %30, align 8
  %32 = add i64 %31, 1
  store i64 %32, i64* %30, align 8
  %33 = icmp eq i64 %"#temp#.07", %17
  br i1 %33, label %L7.L19_crit_edge, label %if

L7.L19_crit_edge:                                 ; preds = %if
  store i8** %14, i8*** %19, align 8
  store i8** %14, i8*** %20, align 8
  br label %L19

L19:                                              ; preds = %L7.L19_crit_edge, %top
  %34 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 6
  %35 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 7
  %36 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 8
  store i8** %14, i8*** %34, align 8
  store i8** %14, i8*** %35, align 8
  %37 = call i8** @"julia_cumsum!_64499"(i8** %14, i8** %14)
  %38 = load i64, i64* %16, align 8
  %39 = call i8** inttoptr (i64 1801121808 to i8** (i8**, i64)*)(i8** inttoptr (i64 162829584 to i8**), i64 %38)
  store i8** %39, i8*** %36, align 8
  %40 = call i64 @jlsys_steprange_last_58768(i64 %38, i64 -1, i64 1)
  %41 = icmp slt i64 %38, %40
  %42 = zext i1 %41 to i8
  %43 = add i64 %40, -1
  %44 = icmp eq i64 %38, %43
  %45 = zext i1 %44 to i8
  %46 = or i8 %45, %42
  %47 = icmp eq i8 %46, 1
  br i1 %47, label %L61, label %if3.lr.ph

if3.lr.ph:                                        ; preds = %L19
  %48 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 9
  %49 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 10
  %50 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 11
  %51 = getelementptr [13 x i8**], [13 x i8**]* %4, i64 0, i64 12
  %52 = bitcast i8** %0 to i64**
  %53 = load i64*, i64** %52, align 8
  %54 = bitcast i8** %14 to i64**
  %55 = load i64*, i64** %54, align 8
  %56 = bitcast i8** %39 to i64**
  %57 = load i64*, i64** %56, align 8
  br label %if3

if3:                                              ; preds = %if3.lr.ph, %if3
  %"#temp#2.06" = phi i64 [ %38, %if3.lr.ph ], [ %58, %if3 ]
  %58 = add i64 %"#temp#2.06", -1
  %59 = getelementptr i64, i64* %53, i64 %58
  %60 = load i64, i64* %59, align 8
  %61 = sub i64 %60, %2
  %62 = getelementptr i64, i64* %55, i64 %61
  %63 = load i64, i64* %62, align 8
  %64 = add i64 %63, -1
  store i64 %64, i64* %62, align 8
  %65 = getelementptr i64, i64* %57, i64 %64
  store i64 %"#temp#2.06", i64* %65, align 8
  %66 = icmp eq i64 %58, %43
  %67 = zext i1 %66 to i8
  %68 = or i8 %67, %42
  %69 = icmp eq i8 %68, 1
  br i1 %69, label %L44.L61_crit_edge, label %if3

L44.L61_crit_edge:                                ; preds = %if3
  store i8** %14, i8*** %48, align 8
  store i8** %14, i8*** %49, align 8
  store i8** %14, i8*** %50, align 8
  store i8** %39, i8*** %51, align 8
  br label %L61

L61:                                              ; preds = %L44.L61_crit_edge, %L19
  %70 = load i64, i64* %12, align 8
  store i64 %70, i64* %10, align 8
  ret i8** %39
}

Main> @code_native sortperm_int_range2(a, 1_000_000, 1)
        .text
Filename: none
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rsi
        pushq   %rdi
        pushq   %rbx
        subq    $152, %rsp
        movq    %r8, %r14
        movq    %rdx, %rsi
        movq    %rcx, %r15
        movabsq $jl_get_ptls_states, %rax
        callq   *%rax
        xorps   %xmm0, %xmm0
        movups  %xmm0, -88(%rbp)
        movups  %xmm0, -104(%rbp)
        movups  %xmm0, -120(%rbp)
        movups  %xmm0, -136(%rbp)
        movups  %xmm0, -152(%rbp)
        movq    $0, -72(%rbp)
        movq    $22, -168(%rbp)
        movq    (%rax), %rcx
        movq    %rcx, -160(%rbp)
        leaq    -168(%rbp), %rcx
        movq    %rax, -64(%rbp)
        movq    %rcx, (%rax)
Source line: 2
        movl    $jl_alloc_array_1d, %eax
        movl    $162829584, %ecx        # imm = 0x9B49510
        movq    %rsi, %rdx
        callq   *%rax
        movq    %rax, -152(%rbp)
        movabsq $"fill!", %rbx
        xorl    %edx, %edx
        movq    %rax, %rcx
        callq   *%rbx
        movq    %rax, %r13
        movq    %r13, -144(%rbp)
Source line: 4
        movq    8(%r15), %rax
        testq   %rax, %rax
        je      L222
        movq    (%r15), %rcx
Source line: 5
        movq    (%r13), %rdx
        nopw    %cs:(%rax,%rax)
Source line: 4
L192:
        movq    (%rcx), %rbx
Source line: 5
        subq    %r14, %rbx
        incq    jl_system_image_data(%rdx,%rbx,8)
Source line: 4
        addq    $8, %rcx
        decq    %rax
        jne     L192
Source line: 5
        movq    %r13, -136(%rbp)
        movq    %r13, -128(%rbp)
Source line: 8
L222:
        movq    %r13, -120(%rbp)
        movq    %r13, -112(%rbp)
        movabsq $"cumsum!", %rax
        movq    %r13, %rcx
        movq    %r13, %rdx
        callq   *%rax
Source line: 9
        movq    8(%r15), %rbx
Source line: 10
        movl    $jl_alloc_array_1d, %eax
        movl    $162829584, %ecx        # imm = 0x9B49510
        movq    %rbx, %rdx
        callq   *%rax
        movq    %rax, %r12
        movq    %r12, -104(%rbp)
Source line: 93
        movabsq $steprange_last, %rax
        movq    $-1, %rdx
        movl    $1, %r8d
        movq    %rbx, %rcx
        callq   *%rax
Source line: 11
        cmpq    %rax, %rbx
        setl    %r9b
        setge   %r8b
        leaq    -1(%rax), %rdi
        cmpq    %rdi, %rbx
        setne   %dl
        andb    %r8b, %dl
        cmpb    $1, %dl
        jne     L411
Source line: 12
        movq    (%r15), %r8
Source line: 13
        movq    (%r13), %rsi
Source line: 15
        movq    (%r12), %r10
        nopw    %cs:(%rax,%rax)
Source line: 12
L352:
        movq    -8(%r8,%rbx,8), %rcx
        subq    %r14, %rcx
Source line: 13
        movq    (%rsi,%rcx,8), %rdx
Source line: 14
        leaq    -1(%rdx), %rdi
        movq    %rdi, (%rsi,%rcx,8)
Source line: 15
        movq    %rbx, -8(%r10,%rdx,8)
Source line: 11
        cmpq    %rbx, %rax
        leaq    -1(%rbx), %rbx
        sete    %cl
        orb     %r9b, %cl
        cmpb    $1, %cl
        jne     L352
Source line: 13
        movq    %r13, -96(%rbp)
Source line: 14
        movq    %r13, -88(%rbp)
        movq    %r13, -80(%rbp)
Source line: 15
        movq    %r12, -72(%rbp)
Source line: 17
L411:
        movq    -160(%rbp), %rax
        movq    -64(%rbp), %rcx
        movq    %rax, (%rcx)
        movq    %r12, %rax
        addq    $152, %rsp
        popq    %rbx
        popq    %rdi
        popq    %rsi
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq

#14

More than you realize. On one of my systems, your sortperm_int_range2 (or the original function in base) runs in 2.5s on your standard problem (vs 2.0s for R’s order on the same system). On another, sortperm_int_range2 takes 6.5s, and sortperm_int_range4 takes 9s. These only differ in the generation of Xeon processor (with associated stuff) and the version of Linux in use.

It may well come down to really gritty things like what clocking the memory uses and how that affects latency. Perhaps even how one’s OS deals with the recently publicized processor vulnerabilities.

In short, there’s nothing wrong with your algorithm: nearly all the time is in a pair or two of loads and stores per iteration, and the cost of those is a benchmarker’s nightmare.


#15

Just to confirm, you ran exactly the above and got 2.5 seconds? But then how can R be so much faster on the same laptop? I am amazed! Theoretically, Julia should be faster. If nothing wrong with my algorithm then can Julia generate faster code? Or is that LLVM not generating optimized code for my hardware for some reason? It’s just a standard i7 laptop. R’s implementation is in C.


#16

Yes, 2.5s on one and 6.5s on the other, with the problem size as you show it. As you say, the speed of the R code is surprising, but they presumably use fully optimized compilation which might do magic with instruction ordering. (So perhaps I was too pessimistic about the essential importance of low-level contributors.)


#17

Maybe I’m missing something, but isn’t it obvious that counting sort is going to be slow when the number of possible values gets so large that the counting vector doesn’t fit in the cache? I would expect counting sort to be very fast for a small number of values, but radix sort to be faster when it gets large because of these cache issues. Depending on the size of your CPU’s cache (i.e. whether the counting vector fits into it or not), the timings might vary a lot.

That should be easy to prove by passing the range of possible values (and therefore the size of the counting vector) to the function, and increasing it progressively.


#18

No. It’s not obvious to me for the reason that Julia’s sort is a lot faster as shown in my original post, and it’s using a counting sort for 1_000_000 records (which clearly doesn’t fit in cache) but it gets slower for sortperm. I think if it has to do with cache then it’s to do with bad cache performance in the creation of the output which is a vector containing 100 million elements.


#19

The problem is not so much that arrays don’t fit in cache as that the accesses are scattershot so the cache is in the way. The in-place sort makes things more local even for big arrays. So let’s try turning sortperm into the solved problem of sort!. We will replace pairs of slow loads/stores with fast load/stores of Pairs.:

const RADIX_SIZE = 10
if RADIX_SIZE < 9
    const RADIX_MASK = UInt8(2<<(RADIX_SIZE-1)-1)
elseif RADIX_SIZE < 17
    const RADIX_MASK = UInt16(2<<(RADIX_SIZE-1)-1)
else
    const RADIX_MASK = UInt32(2<<(RADIX_SIZE-1)-1)
end

const VT = Pair{UInt32,UInt32}
const VTV = Vector{VT}
function sortperm_int_range_p1(a, rangelen, minval)
    @assert 2^32 > rangelen
    @assert 2^32 >= length(a)
    vs::VTV = Vector{VT}(length(a))
    @inbounds for i in eachindex(a)
        vs[i] = Pair(UInt32(i),UInt32(a[i]))
    end
    _sortperm_int_range_p1(vs, rangelen, minval)
end

function _sortperm_int_range_p1(vs, rangelen, minval)
    ts = similar(vs)

    # Init
    lo = 1
    hi = length(vs)
    iters = 1
    while (1<<(iters*RADIX_SIZE)) <= rangelen
        iters += 1
    end
    println("iters: $iters")
    @assert iters <= ceil(Integer, sizeof(typeof(vs[1].second))*8/RADIX_SIZE)
    bin = zeros(UInt32, 2^RADIX_SIZE, iters)

    # Histogram for each element, radix
    for i = lo:hi
        v = vs[i].second
        for j = 1:iters
            idx = Int((v >> (j-1)*RADIX_SIZE) & RADIX_MASK) + 1
            @inbounds bin[idx,j] += 1
        end
    end

    # Sort!
    swaps = 0
    len = hi-lo+1
    for j = 1:iters
        # Unroll first data iteration, check for degenerate case
        v = vs[hi].second
        idx = Int((v >> (j-1)*RADIX_SIZE) & RADIX_MASK) + 1

        # are all values the same at this radix?
        if bin[idx,j] == len;  continue;  end

        cbin = cumsum(bin[:,j])
        ci = cbin[idx]
        ts[ci] = vs[hi]
        cbin[idx] -= 1

        # Finish the loop...
        @inbounds for i in hi-1:-1:lo
            v = vs[i].second
            idx = Int((v >> (j-1)*RADIX_SIZE) & RADIX_MASK) + 1
            ci = cbin[idx]
            ts[ci] = vs[i]
            cbin[idx] -= 1
        end
        vs,ts = ts,vs
        swaps += 1
    end

    if isodd(swaps)
        vs,ts = ts,vs
        @inbounds for i = lo:hi
            vs[i] = ts[i]
        end
    end
    res = Vector{Int}(length(vs))
    @inbounds for i in eachindex(vs)
        res[i] = Int(vs[i].first)
    end
    res
end

This is 2.7s on my slower machine (probably close to R, but I don’t have R here).

Credit: most of this is copied from SortingAlgorithms.jl.

Notes:

  1. As you can see, I had to beat the compiler up to get type stability.
  2. Optimal radix size and cutover where this beats counting sort will depend on machine details.

#20

That’s pretty cool, and it is even a stable sort. May I ask where specifically you had to beat up the compiler for type stability? The code looks pretty natural to me; I’d be interested in the pitfalls you encountered.

BTW, the code gets significantly faster on my machine if you add some more @inbounds. That is,

function _sortperm_int_range_p1(vs, rangelen, minval)
    ts = similar(vs)

    # Init
    lo = 1
    hi = length(vs)
    iters = 1
    while (1<<(iters*RADIX_SIZE)) <= rangelen
        iters += 1
    end
    println("iters: $iters")
    @assert iters <= ceil(Integer, sizeof(typeof(vs[1].second))*8/RADIX_SIZE)
    bin = zeros(UInt32, 2^RADIX_SIZE, iters)

    # Histogram for each element, radix
    @inbounds for i = lo:hi
        v = vs[i].second
        for j = 1:iters
            idx = Int((v >> (j-1)*RADIX_SIZE) & RADIX_MASK) + 1
            @inbounds bin[idx,j] += 1
        end
    end

    # Sort!
    swaps = 0
    len = hi-lo+1
    @inbounds for j = 1:iters
        # Unroll first data iteration, check for degenerate case
        v = vs[hi].second
        idx = Int((v >> (j-1)*RADIX_SIZE) & RADIX_MASK) + 1

        # are all values the same at this radix?
        if bin[idx,j] == len;  continue;  end

        cbin = cumsum(bin[:,j])
        ci = cbin[idx]
        ts[ci] = vs[hi]
        cbin[idx] -= 1

        # Finish the loop...
        @inbounds for i in hi-1:-1:lo
            v = vs[i].second
            idx = Int((v >> (j-1)*RADIX_SIZE) & RADIX_MASK) + 1
            ci = cbin[idx]
            ts[ci] = vs[i]
            cbin[idx] -= 1
        end
        vs,ts = ts,vs
        swaps += 1
    end

    if isodd(swaps)
        vs,ts = ts,vs
        @inbounds for i = lo:hi
            vs[i] = ts[i]
        end
    end
    res = Vector{Int}(length(vs))
    @inbounds for i in eachindex(vs)
        res[i] = Int(vs[i].first)
    end
    res
end

#21

Have you tried the sortperm_int_range4 function that I wrote? Curious as to how it performs on your machine.