Dispatching by values is very slow - Feature or Problem?

After learning in Faster default Base.sort! for Julia about Sorting Networks I thought that we surely must have a package in Julia for that, and surely we do: SortingNetworks.jl

The problem is this:

using SortingNetworks, BenchmarkTools
n = 4
@benchmark for i in 1:10000 sort(rand(n)) end      #  1.547 ms ± 211.471 μs
@benchmark for i in 1:10000 swapsort(rand(n)) end  #  2.393 ms ± 356.063 μs

The whole point of using Sorting Networks is because they’re so much faster than Julia’s default sort function for short vectors and yet, in the above implementation, Sorting Networks are hopelessly slow.

As it turns out SortintNetworks.jl is dispatching by value where the value is the size of the vector. Although it makes sense that is precisely what makes it so slow. See the below implementation using a “dispatching by value” function instead.

function sortn!(x::Vector)
    n = length(x)
    n == 2 && return sort2!(x)
    n == 3 && return sort3!(x)
    n == 4 && return sort4!(x)
end

function sort4!(x)
    swap(i,j) = x[j], x[i] = x[i], x[j]
    x[2]>x[3] && swap(2,3)
    x[1]>x[2] && swap(1,2); x[3]>x[4] && swap(3,4)
    x[2]>x[3] && swap(2,3)
    x[1]>x[2] && swap(1,2); x[3]>x[4] && swap(3,4)
    true
end

And now the benchmarks show that, indeed, Sorting Networks are faster than Julia’s default sort!:

n=4
@benchmark for i in 1:10000 sort!(rand(n)) end   #  1.253 ms ± 172.871 μs
@benchmark for i in 1:10000 sortn!(rand(n)) end  #  1.141 ms ± 201.472 μs

This raises some questions about dispatching by value, in particular; is dispatching by value inherently slow and not to be used when performance is critical?

1 Like

@benchmark already loops for you and you’re also counting creation time of the random matrix, so that’s going to skew your benchmark significantly. I recommend using the setup keyword for @benchmark (e.g. setup=(x=rand(n)) and passing x to sort! without interpolation), as well as evals=1 to make sure each matrix is only ever used once.

If the performance difference persists, it’s probably because dispatching on dynamically constructed Val objects looses all benefit of using them in the first place (static knowledge of some value) and branch predictors are really good at predicting your branchy version. You can also try with an array/vector from StaticArrays, where the length is part of the type already which should avoid the dynamic problems.

3 Likes

BenchmarkTools is not very well suited to compare benchmarks as discussed in Identical functions benchmark differently but that’s another story. Actually benchmarks with those settings only makes differences wider, and using a dispatching function instead dispatching by value makes sortn! ~300% faster than default Base.sort!

using SortingNetworks
n = 4

#206.279 μs ±  10.895 μs
@benchmark for i in 1:10000 sort!(x)  end  setup=(x=rand(n)) evals=1

#64.406 μs ±   5.169 μs
@benchmark for i in 1:10000 sortn!(x) end  setup=(x=rand(n)) evals=1

and for sort vs swapsort

# 565.565 μs ± 142.188 μs
@benchmark for i in 1:10000 sort(x)  end  setup=(x=rand(n)) evals=1

#1.449 ms ± 205.297 μs
@benchmark for i in 1:10000 swapsort(x) end  setup=(x=rand(n)) evals=1

Yeah, I imagine this might work but it defeats the purpose of using Sorting Networks for Jualia Base short vectors.

This benchmark again doesn’t do what you want it to do. A nice sanity check for this is how long a single iteration of sortn! takes, according to your benchmark:

64.406µs / 10000 = 0.0064406µs = 6.4ns

So one sorting (including loading, comparing and storing!) of the input array is supposed to take ~6ns? I find that rather hard the believe, when you’re doing 6 comparisons & branches, which (if the CPU doesn’t know anything about your data) should take longer than that. 1ns per compare&swap is too fast.

The final expanded macro (basically) looks like this:

for i in 1:upper_bound # calculated by heuristic timing of one of the first runs, done seperately
# while elapsed_time < time_out # alternative stoppping point
    x = rand(n)
    for i in 1:evals # == 1:1 or 1 iteration only due to evals=1
        # user code start
        for i in 1:10000
            sort!(x)
        end
        # user code end
    end
end

So you’re repeatedly sorting the same vector over and over again. If you then think about what your code and what the code from SortingNetworks.jl is doing, it might become clearer why your code is so much faster. The branch predictor in your CPU is happily learning the pattern of your sorted data in x, only the first run is actually doing any sorting work and any subsequent runs are predicted perfectly (i.e. the branch predictor never misses). This is of course unrealistic, which is why I’m suggesting not to use the extra loop here, thus forcing a new x for each invocation of sort!, making it much harder for the branch predictor to learn a pattern (it’ll collapse to being correct for probably 50% of the time, which is no better than flipping a coin). SortingNetworks.jl is basically doing the same thing though as you are, except it’s not using getindex and setindex! and destructuring the arguments ahead of time instead.

Accounting for that, SortingNetworks.jl will probably still be slower due to the fact of the dynamic dispatch, but definitely not by a factor of 4x and more like a flat overhead of 50µs (untested, educated guess based on intuition on my machine with dynamic dispatch times only). Reading the code (and noticing that most of it is from 2019), the intention was probably to make the code SIMD friendly for autovectorization using known-good constructions. That’s why the code mentions “in parallel” in its comments.


Nevertheless, to answer your question: I’d expect dispatching by value to be slower than branching after a certain cutoff point, which I don’t think has been reached for n = 4. (i.e., before that cutoff, branching will be faster).

1 Like

Now instead ~3 times faster/slower is ~2 times faster/slower, the problem remains the same. Not really the point though, since differences are so wide that BenchmarkTools settings are not going to change the fact that dispatching by value makes things really slow.

using SortingNetworks
n = 4
@benchmark sort!(x)    setup=(x=rand(n)) evals=1 # 75.774 ns ± 18.150 ns
@benchmark sortn!(x)   setup=(x=rand(n)) evals=1 # 58.701 ns ± 46.172 ns
@benchmark sort(x)     setup=(x=rand(n)) evals=1 # 118.909 ns ± 183.294 ns
@benchmark swapsort(x) setup=(x=rand(n)) evals=1 # 202.094 ns ± 43.011 ns 

BenchmarkTools is not really there when it comes to compare functions, but I would not want to use much time discussing about BenchmarkTools and rather focus on why dispatching by value slow things down so much and if in general we should avoid it when performance is critical.

1 Like

that’s literally the only purpose it serves… (comparing 2 versions of function is same idea)

there’s really a ton of discussion on this, and the answer is yes, don’t use value as types but also don’t be afraid to write branch if you need to, sometimes constant folding gives you a ride:

3 Likes

I disagree (and have also done so in the other thread you linked). It’s a tool like any other and you have to use it correctly, else you’re going to get wrong results and draw faulty conclusions.

That’s a very open-ended question though. In general, yes, it should absolutely be avoided in performance critical situations (i.e. don’t dynamically dispatch in hot loops). That’s the (old but still useful) definition of type instability, dispatching on something the compiler can’t know ahead of time. That’s why I suggested testing with StaticArrays instead, since there the compiler does know the length of the given vector ahead of time. You’d get basically the same result if you had chosen to use tuples instead of an SVector, because those also have the length encoded in its type (each element of a tuple is a type parameter of it, so the number of type parameters is the length of the tuple, which is known during compilation).

5 Likes

In my opinion it can do better, but truly that’s a discussion for another topic.

Thank you! I think that answers it, maybe the SortingNetworks.jl team that developed the package were not aware of this at the time.

In contrary, they did the right thing! their functions are not defined on Vectors!! Tuples has length information at compile time

I can run swapsort on vectors just fine, is the one being so slow.

you CAN, but it just get length information first, which is why it won’t be as fast as if you start with a Tuple or SVector:
https://github.com/JeffreySarnoff/SortingNetworks.jl/blob/b6bda59bd936915dd88ebb94a28420aead921859/src/swapsort.jl#L25

Well, then we agree makes no sense :slight_smile:

why can’t people have Tuple or SVectors they need to sort?

1 Like

The source code is available online, it’s just that the main interface also accepts a vector because that’s what most people want to put in.

You’ll notice that there are also functions taking in a Tuple:

https://github.com/JeffreySarnoff/SortingNetworks.jl/blob/b6bda59bd936915dd88ebb94a28420aead921859/src/swapsort.jl#L80-L81

The codepath running dispatching to that vector version can be found here:

https://github.com/JeffreySarnoff/SortingNetworks.jl/blob/b6bda59bd936915dd88ebb94a28420aead921859/src/swapsort.jl#L20-L26

which can be called like all functions:

julia> t = (rand(4)...,)                                                         
(0.44566021448516036, 0.8805452024880177, 0.4031296876560667, 0.9577022378173413)
                                                                                 
julia> swapsort(t)                                                               
(0.4031296876560667, 0.44566021448516036, 0.8805452024880177, 0.9577022378173413)
                                                                                 
julia> typeof(t)                                                                 
NTuple{4, Float64}                                                               

No sense for Julia Base Vectors that is, which you also have a gain in performance if we don’t use dispatching by value.

you might as well just argue sorting <25 elements doesn’t matter since it’s already fast, but the pkg is for niche and it works well for that.

Your general question of:
“if I don’t know length and I want to sort a vector, what I can do?”

The answer is not much because not having length information means compiler can’t do anything special. If you create length information first then dispatch again, you might as well just sort it already.

1 Like

Maybe this thread is of interest: Union splitting vs C++

There many insights into dealing with dynamic dispatch are given.

1 Like

Thanks you for your comments, I truly appreciate them. However why BenchmarkTools is not good for comparisons deserves another topic, if you’re interested in further debate you can create a new discussion topic and I’ll gladly give more details there. Allow me though not comment further on BenchmarkTools in this thread.

It’s funny you said that right before @lmiq made a contribution since not long ago he posted a topic about partial sorting for short arrays relevant to his field of science (5 to 20 in size if I remember correclty).

The point being that what is niche in your field of science is standard in other fields of science.