Quantile! much faster when called twice with a scalar than once with a vector!

x = rand(Float32, 10000000);

@time quantile!(x, [0.02, 0.98])
  0.196571 seconds (2 allocations: 160 bytes)
2-element Vector{Float64}:
 0.02001022338867188
 0.9800319683551788

@time quantile!(x, 0.02)
  0.022732 seconds (1 allocation: 16 bytes)
0.02001022338867188

@time quantile!(x, 0.98)
  0.022851 seconds (1 allocation: 16 bytes)
0.9800319683551788

quantile! modifies the input, so your second two cases don’t need to spend any time sorting the input vector.

julia> v = [3, 2, 1]
3-element Vector{Int64}:
 3
 2
 1

julia> quantile!(v, 0.5)
2.0

julia> v
3-element Vector{Int64}:
 1
 2
 3

That said, I can confirm that performance is much worse than it ought to be:

julia> @btime quantile($x, [0.02, 0.98])
  2.145 s (4 allocations: 38.15 MiB)

julia> @btime quantile($x, 0.02)
  99.073 ms (2 allocations: 38.15 MiB)
3 Likes

To add to this: I am not super familiar with quantile!, but this is not due to the vector possibly being allocated on the heap as opposed to living on the stack as the individual numbers are.
Using a static Array which is stack-allocated does not remove this difference in performance:

julia> using Statistics,StaticArrays,BenchmarkTools

julia> x = rand(Float32,10000000);StaticInput = SA[0.02,0.98];

julia> @btime quantile!($x,$StaticInput)
  490.895 ms (0 allocations: 0 bytes)
2-element SVector{2, Float64} with indices SOneTo(2):
 0.0199735164642334
 0.9800283956527709

julia> @btime quantile!($x,0.02)
  42.356 ms (0 allocations: 0 bytes)
0.0199735164642334

julia> @btime quantile!($x,0.98)
  28.156 ms (0 allocations: 0 bytes)
0.9800283956527709

Your benchmark is broken - after the first warmup, even an interpolated x will already be sorted because you reuse that variable all throughout the benchmark. You have to use a new x for each run, so use setup=(x=rand(Float32,10000000)) and evals=1 as additional arguments to make sure each unsorted x is only used once.

This doesn’t have anything to do with whether the array is allocated on the heap or not.

5 Likes

Sorry to have brought a staring confusion by using the !inplace form. But also the mutating form showes the same behavior

x = rand(Float32, 10000000);
xx = deepcopy(x);

@time quantile(x, [0.02, 0.98])
  0.944558 seconds (4 allocations: 38.147 MiB)
2-element Vector{Float64}:
 0.019991873502731322
 0.9799974584579467

all(x .== xx)
true

@time quantile(x, 0.02)
  0.082169 seconds (3 allocations: 38.147 MiB)
0.019991873502731322

all(x .== xx)
true

@time quantile(x, 0.98)
  0.088702 seconds (3 allocations: 38.147 MiB)
0.9799974584579467

but also quantil!. See

x = rand(Float32, 10000000);

@time quantile!(x, [0.02, 0.98])
  0.947875 seconds (2 allocations: 160 bytes)
2-element Vector{Float64}:
 0.020006891489028934
 0.9800460338592529

@time quantile!(x, 0.02)
  0.021764 seconds (1 allocation: 16 bytes)
0.020006891489028934

@time quantile!(x, 0.98)
  0.021982 seconds (1 allocation: 16 bytes)
0.9800460338592529

and if we repeat the vector form, it’s much slower again


@time quantile!(x, [0.02, 0.98])
  0.194401 seconds (2 allocations: 160 bytes)
2-element Vector{Float64}:
 0.020006891489028934
 0.9800460338592529

If we broadcast quantile, the times look similar to the sum of the individual times:

julia> @time quantile.(Ref(x), [0.02, 0.98])
  0.194654 seconds (9 allocations: 152.588 MiB)

# or, using BenchmarkTools:
@btime quantile.(Ref($x), [0.02, 0.98])
187.388 ms (6 allocations: 152.59 MiB)

Conclusion. The reason why quantile!(x, [0.02, 0.98]) is much slower than [quantile!(x, 0.02), quantile!(x, 0.98)] is that the former executes Statistics._quantilesort!(v, sorted, minp, maxp) with minp << maxp and the latter Statistics._quantilesort!(v, sorted, p, p).

See https://github.com/JuliaLang/Statistics.jl/blob/master/src/Statistics.jl#L949-L959

Demonstration. (Jupyter notebook)

using Statistics
X = rand(Float32, 10^7)

x = copy(X); y = quantile!(x, [0.02, 0.98])
x = copy(X); z = [quantile!(x, 0.02), quantile!(x, 0.98)]
@show y == z;
y == z = true
x = copy(X); @time quantile!(x, [0.02, 0.98])
x = copy(X); @time quantile!(x, [0.02, 0.98])
x = copy(X); @time quantile!(x, [0.02, 0.98]);
  1.171875 seconds (2 allocations: 192 bytes)
  1.175514 seconds (2 allocations: 192 bytes)
  1.174704 seconds (2 allocations: 192 bytes)
x = copy(X); @time [quantile!(x, 0.02), quantile!(x, 0.98)]
x = copy(X); @time [quantile!(x, 0.02), quantile!(x, 0.98)]
x = copy(X); @time [quantile!(x, 0.02), quantile!(x, 0.98)];
  0.179641 seconds (3 allocations: 128 bytes)
  0.196316 seconds (3 allocations: 128 bytes)
  0.178824 seconds (3 allocations: 128 bytes)

Yes! The former is much slower than the latter.

@time’s added version of https://github.com/JuliaLang/Statistics.jl/blob/master/src/Statistics.jl#L949-L959

function myquantile!(v::AbstractVector, p::Union{AbstractArray, Tuple{Vararg{Real}}};
                   sorted::Bool=false, alpha::Real=1., beta::Real=alpha)
    if !isempty(p)
        minp, maxp = extrema(p)
        print("Statistics._quantilesort!(v, sorted, minp, maxp):             ")
        @time Statistics._quantilesort!(v, sorted, minp, maxp)
    end
    print("map(x->Statistics._quantile(v, x, alpha=alpha, beta=beta), p):")
    return @time map(x->Statistics._quantile(v, x, alpha=alpha, beta=beta), p)
end

myquantile!(v::AbstractVector, p::Real; sorted::Bool=false, alpha::Real=1., beta::Real=alpha) = begin
    print("Statistics._quantilesort!(v, sorted, p, p):        ")
    @time Statistics._quantilesort!(v, sorted, p, p)
    print("Statistics._quantile(v, p, alpha=alpha, beta=beta):")
    @time Statistics._quantile(v, p, alpha=alpha, beta=beta)
end

println("myquantile!(x, [0.02, 0.98]):")
x = copy(X); u = myquantile!(x, [0.02, 0.98])
println("\n[myquantile!(x, 0.02), myquantile!(x, 0.98)]:")
x = copy(X); v = [myquantile!(x, 0.02), myquantile!(x, 0.98)]
println(); @show u == v;

Output:

myquantile!(x, [0.02, 0.98]):
Statistics._quantilesort!(v, sorted, minp, maxp):               1.208875 seconds
map(x->Statistics._quantile(v, x, alpha=alpha, beta=beta), p):  0.000001 seconds (1 allocation: 96 bytes)

[myquantile!(x, 0.02), myquantile!(x, 0.98)]:
Statistics._quantilesort!(v, sorted, p, p):          0.055211 seconds
Statistics._quantile(v, p, alpha=alpha, beta=beta):  0.000001 seconds
Statistics._quantilesort!(v, sorted, p, p):          0.098211 seconds
Statistics._quantile(v, p, alpha=alpha, beta=beta):  0.000022 seconds

u == v = true

As above, most of the computation time of the former is the execution time of Statistics._quantilesort!(v, sorted, minp, maxp) with minp << maxp.

See also https://github.com/JuliaLang/julia/blob/361b3a48e2ce05a05f7e3c01903d18bf114432e3/base/sort.jl#L628-L652

Postscript. (case of [0.499, 0.501])

x = copy(X); @time y = quantile!(x, [0.499, 0.501])
x = copy(X); @time y = quantile!(x, [0.499, 0.501])
x = copy(X); @time y = quantile!(x, [0.499, 0.501]);
  0.133984 seconds (2 allocations: 192 bytes)
  0.138859 seconds (2 allocations: 192 bytes)
  0.161751 seconds (2 allocations: 192 bytes)
x = copy(X); @time z = [quantile!(x, 0.499), quantile!(x, 0.501)]
x = copy(X); @time z = [quantile!(x, 0.499), quantile!(x, 0.501)]
x = copy(X); @time z = [quantile!(x, 0.499), quantile!(x, 0.501)];
  0.235203 seconds (3 allocations: 128 bytes)
  0.225585 seconds (3 allocations: 128 bytes)
  0.200007 seconds (3 allocations: 128 bytes)

In this case, the former is faster than the latter.

1 Like

EmpiricalCDFs.jl uses a different approach that may be more efficient, depending on your use case

julia> using EmpiricalCDFs

julia> X = rand(Float32, 10^7);

julia> cdf = EmpiricalCDF(X);

julia> sort!(cdf);

julia> const icdf2 =  finv(cdf); # get the quantile function

julia> @btime icdf2.([.02, .98])
  42.028 ns (2 allocations: 192 bytes)
2-element Vector{Float32}:
 0.02000904
 0.9799932
4 Likes