Ironic observation about `sort` and `sortperm` speed for "small integers" vs R

sortperm_int_range4 takes almost 7s for the standard problem on my systems. Let it be noted that my sortperm_int_range_p1 is very similar in design. I thought that the difference in performance occurred because

  1. I forced type-stability (partially true, more on this below)
  2. there are fewer iterations of the radix mask loop (true, and possibly helpful for your project).

I noticed that SortingAlgorithms.sort!(a::Vector{UInt},...,RadixSort,...) is type-unstable and I assumed this was critical. (Fixing type stability was important for my version.) However, I recently found that making the ca variable in sortperm_int_range4 a Vector{Int} instead of Vector{UInt} (which has no effect on the algorithm for this problem) removes the type instability but does not fix the performance, so there is something sick in the compilation of the method, hidden from the friendly tools I have used so far.

I need to mention that I am using v0.6.2. The type instability of radix sort! seems to be an inference bug which may be fixed in master.

2 Likes

Regarding making my function type-stable

I found it necessary to add extra type assertions and a function barrier, and to replace generators and broadcast expressions with for loops. Some day the compiler will presumably not need so much help.

if this is true then it’s actually a good story and would confirm my hypothesis. And hopefully this will lead to an improvement in the compiler, and then everybody benefits!

totally. i actually implemented a version using the sort32! function in SortingLab.jl which only sorts the first 32 bits. That shaved 2 seconds off the benchmark so still slower than R, so I decided not to publish it. I actually got an idea for a modified counting sort that should be much faster if you radix sort benchmarks also holds on my machine. Will get to it once i get home.

For what it is worth, here are the numbers I get.
I hope it is clear which algorithms I mean with p1 and p4.

I note that I get an error for the line which should save the pn

Main> savefig("julia_vs_r_sortandperm_integer.png")
ERROR: png output from the plotly backend is not supported.  Please use plotlyjs instead.

sorting_pic

You gotta run gr() use another backend.

1 Like

Thank you. Interestingly I had to Pkg.add("GR") which makes me wonder whether that package should be in any of the REQUIRE files. Admittedly I have absolutley no knowledge about Plots and so on.

It’d probably be a little much to have every backend as a dependency, so I think it normally installs plotly by default, and lets you install any other back ends you would like to use.

@Ralph_Smith and @foobar_lv2 The algorithms you helped write there will not only help sortperm for small integer ranges but also sortperm for any integer range. Is it ok if I incorporate your code into SortingLab.jl and then find a way to contribute them back to Base and/or SortingAlgorithms.jl?

I have done a lot of testing and have read into What every programmer should know about memory?. In particular, this quote from the paper made me realize how important understanding cache is

A simple computation can show how effective caches can theoretically be. Assume access to main memory takes 200 cycles and access to the cache memory take 15 cycles. Then code using 100 data elements 100 times each will spend 2,000,000 cycles on memory operations if there is no cache and only 168,500 cycles if all data can be cached. That is an improvement of 91.5%.

This is quite new to me and I assume to many other Julia programmers as well. Basically, the performance of counting sort beats radix sort up to around 2^11 = 2048 unique values. In counting sort, the size of the counting array is # unqiues * sizeof(UInt32) which is around 8192, so it’s getting close to maxing out the L1d-cache on most PCs. So perhaps @nalimilan’s point about the slowness of counting is “obvious” once you understand the importance of cache but not to beginners like me.

code

function counting_vs_radix(n)
    a = rand(1:n, 100_000_000)
    (@belapsed(sortperm_int_range2($a, $n, 1)), @belapsed(sortperm_int_range_p1($a, $n, 1, 10)))
end

tic()
res = [counting_vs_radix(n) for n in 2.^(2:20)]
toc()

using Plots
plot([r[1] for r in res], title="Counting-sortperm vs Radix-sortperm (# of elements = 100m)", label = "Counting-sortperm",
    ylabel = "seconds", xlabel = "log_2(n) unique values", ylim = (0, 15))
plot!([r[2] for r in res], label = "Radix-sortperm")
plot!(xticks = (2:20))
savefig("Counting-sortperm vs Radix-sortperm ( of elements = 100m).png")
3 Likes

Please do. I can think of at least one elaboration which I could propose as a PR.

Don’t ask me, Ralph_Smith did the work. (so this means “yes”, of course it is OK as far as I’m concerned, but I literally added two @inbounds to Ralph_Smith’s cool thing and don’t deserve credit here)

Minor elaborations if you put this somewhere for general consumption: cbin = cumsum(bin[:,j]): The cbin can be re-used for all radices; the cumsum wants to be done via a view on 0.7 and by hand on 0.6.

actually you dont need cbin just reuse bin