In place sorting for views

Maybe I am benchmarking it wrong, but it seems that sorting part of the vector (with the help of @views) is slower than sorting original vector.

using BenchmarkTools

v = rand(1000);
v2 = rand(2000);

v2[500:1499] .= v;

and benchmarks itself

julia> @btime sort!(vv) setup = (vv = deepcopy($v)) evals = 1;
  12.722 μs (0 allocations: 0 bytes)

julia> @btime sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499]) evals = 1;
  22.854 μs (0 allocations: 0 bytes)

Is there any way to exploit current algorithms for sorting part of the array? It seems wrong to copy-paste original algorithms and add offset to all indices.

The difference is quite small in my test

julia> @benchmark sort!(vv) setup = (vv = deepcopy($v))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.850 μs (0.00% GC)
  median time:      9.100 μs (0.00% GC)
  mean time:        9.346 μs (0.00% GC)
  maximum time:     21.575 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.850 μs (0.00% GC)
  median time:      10.275 μs (0.00% GC)
  mean time:        10.562 μs (0.00% GC)
  maximum time:     20.925 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> versioninfo()
Julia Version 1.6.0-beta1.0
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, sandybridge)

Even in this example it’s more than 10% difference, but it shouldn’t exists at all, cause it’s literally same algorithm over the same data, but with different offset.

I can do it manually though, but I do not like this approach much.

import Base: getindex, size, setindex!
struct LinearSubArray{T} <: AbstractVector{T}
    v::Vector{T}
end

struct LinearSubArray{T} <: AbstractVector{T}
    v::Vector{T}
    offset::Int
    len::Int
end
getindex(x::LinearSubArray, i) = @inbounds x.v[i + x.offset - 1]
setindex!(x::LinearSubArray, v, k) = @inbounds x.v[k + x.offset - 1] = v
size(x::LinearSubArray) = (x.len, )

v = rand(1000);
v2 = rand(2000);

v2[500:1499] .= v;
lv = LinearSubArray(v2, 500, 1000);

julia> @btime sort!(vv) setup = (vv = deepcopy($v)) evals = 1;
  13.110 μs (0 allocations: 0 bytes)

julia> @btime sort!(vv) setup = (vv = deepcopy($lv)) evals = 1;
  12.037 μs (0 allocations: 0 bytes)

Weirdly enough it is even slightly faster (maybe because length can be elided?). Still, I would prefer more canonical way of solving it.

I just tested it on a different PC with Julia 1.5.3. The new result indicates no significant difference at all.

julia> @benchmark sort!(vv) setup = (vv = deepcopy($v))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.400 μs (0.00% GC)
  median time:      7.860 μs (0.00% GC)
  mean time:        8.059 μs (0.00% GC)
  maximum time:     28.000 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.580 μs (0.00% GC)
  median time:      7.900 μs (0.00% GC)
  mean time:        8.096 μs (0.00% GC)
  maximum time:     17.300 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 4

Ah, you are right it looks like it is a regression.

julia> @btime sort!(vv) setup = (vv = deepcopy($v)) evals = 1;
  17.655 μs (0 allocations: 0 bytes)

julia> @btime sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499]) evals = 1;
  11.901 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.5.0-rc1.0
Commit 24f033c951 (2020-06-26 20:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

And

julia> @btime sort!(vv) setup = (vv = deepcopy($v)) evals = 1;
  12.733 μs (0 allocations: 0 bytes)

julia> @btime sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499]) evals = 1;
  15.665 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.7.0-DEV.538
Commit 85354cf8ef (2021-02-15 11:47 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

Thank you, I’ll report an issue.

Ah! Here is the important thing, you didn’t add evals = 1 in the end, so you divided time difference by 5 (since in your benchmarks you have 5 evals per sample). Actually, it shrank even more than by factor 5, since sorting sorted array takes less time.

julia> @benchmark sort!(vv) setup = (vv = deepcopy($v)) evals=1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.200 μs (0.00% GC)
  median time:      13.101 μs (0.00% GC)
  mean time:        13.640 μs (0.00% GC)
  maximum time:     53.300 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499]) evals=1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.300 μs (0.00% GC)
  median time:      13.299 μs (0.00% GC)
  mean time:        13.860 μs (0.00% GC)
  maximum time:     53.300 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

This is interesting. Anyway here is an issue https://github.com/JuliaLang/julia/issues/39864, so let’s wait for its resolution.

The sort! method uses quick sort by default. Why does it work faster for a sorted array? Is there any specific implementation detail in Julia?
A useful Wiki link

I do not think it is related to implementation or algorithm. It’s just the fact that in the case of sorted array you are not doing permutations, only comparisons. Less operations → faster timing.

That’s a clue. I recall that, in the undergraduate course, we usually say that naive quick sort displays the worst performance when the input array is already sorted. Now it becomes faster instead with the “median-of-three” pivot rule (see Wiki above).
In fact, the time difference is huge:

julia> arr = randn(10000);

julia> @btime sort!(a) setup=(a=copy($arr)) evals=1;
  377.399 μs (0 allocations: 0 bytes)

julia> sort!(arr); # make arr sorted now

julia> @btime sort!(a) setup=(a=copy($arr)) evals=1;
  66.399 μs (0 allocations: 0 bytes)
1 Like

These identical timings are suspicious, may be you occasionally sorted the arrays at some point prior to benchmarking? Can you try the following in fresh REPL

using BenchmarkTools
using StableRNGs

rng = StableRNG(2021);
v = rand(rng, 1000);
v2 = rand(rng, 2000);

v2[500:1499] .= v;

@btime sort!(vv) setup = (vv = deepcopy($v)) evals = 1;
@btime sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499]) evals = 1;

StableRNGs makes results reproducible between different OS and Julia versions so it’s a more fair comparison.

Just did as you suggested.

julia> @btime sort!(vv) setup = (vv = deepcopy($v)) evals = 1;
  11.600 μs (0 allocations: 0 bytes)

julia> @btime sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499]) evals = 1;
  12.300 μs (0 allocations: 0 bytes)

# run a second time

julia> @btime sort!(vv) setup = (vv = deepcopy($v)) evals = 1;
  11.801 μs (0 allocations: 0 bytes)

julia> @btime sort!(vv) setup = (vv1 = deepcopy($v2); vv = @view vv1[500:1499]) evals = 1;
  12.399 μs (0 allocations: 0 bytes)

There is indeed some difference.

PS: first time knowing StableRNGs, it is cool.

1 Like