Sorting in-place several Vector's simultaneously

Leo_I · February 7, 2025, 5:27pm

I have several vectors (with different types of elements and lengths) whose entries I wish to sort in-place according to some function, like first or identity (lexicographical sort). For example:

n=10^7;  x=rand(1:n,n);  y=rand(1.0:n,n);

I’d like to make it as efficient as possible, but still use the built-in sort!. A good solution is:

using StructArrays
@time sort!(StructArray((x,y)), by=first, alg=QuickSort);
  1.827674 seconds (21 allocations: 880 bytes)

If x and y have different lengths, I would pass view(x,...).

Given that I will only be using StructArray on Vectors and only for sorting, I’d like to extract the minimal amount of code from that repo, to avoid another dependency. I was hoping someone with experience there could help me out.

Julia already has zip that iterates to the shortest length:
for (i,j) ∈ zip([1,2,3,4,5],[1.0,2.0,3.0]) print("$i $j ") end
But sort!(zip(x,y)) does not work. Inspired by this, I’d like to create Zip as a lightweight version of StructArray. My attempt:

struct Zip{T<:Tuple} <: AbstractVector{Tuple}
    vec::T 
end;
Base.length(z::Zip) = minimum(length(v) for v ∈ z.vec)  # truncate to shortest vector
Base.size(z::Zip) = (length(z),)  #  for AbstractVector compliance
Base.getindex(z::Zip, i::Int) = map(v -> v[i], z.vec)  # Get tuple (x[i], y[i], z[i], ...)
Base.setindex!(z::Zip, vals::Tuple, i::Int) = foreach((x, y) -> x[i] = y, z.vec, vals)  # assign in-place

@time sort!(Zip((x,y)), by=first, alg=QuickSort);
  8.336718 seconds (265.34 M allocations: 5.931 GiB, 4.23% gc time)

Any idea why my approach is so much slower and causes more allocations and memory usage?

jling · February 7, 2025, 6:33pm

you need to benchmark things properly I think:

julia> @be rand(1:10^6, 10^6) sort!
Benchmark: 15 samples with 1 evaluation
 min    2.350 ms (6 allocs: 7.637 MiB, 1.57% gc time)
 median 2.477 ms (6 allocs: 7.637 MiB, 1.95% gc time)
 mean   3.880 ms (6 allocs: 7.637 MiB, 11.61% gc time)
 max    15.783 ms (6 allocs: 7.637 MiB, 81.08% gc time)

julia> @be rand(1:10^6, 10^6) sort
Benchmark: 8 samples with 1 evaluation
 min    2.917 ms (9 allocs: 15.267 MiB, 2.80% gc time)
 median 8.669 ms (9 allocs: 15.267 MiB, 44.48% gc time)
 mean   9.214 ms (9 allocs: 15.267 MiB, 43.12% gc time)
 max    17.536 ms (9 allocs: 15.267 MiB, 81.37% gc time)

If your vector is large enough, I think allocation time is negligible, because we know the size in advance and only need to allocate O(1) times.

In this case, I suggest using sortperm to obtain an order first:

order = sortperm(x)
new_x = @view x[order]
new_y = @view y[order]

You can even use sortperm!() if you do this a lot

Leo_I · February 7, 2025, 6:47pm

Umm, I still want my sorting to be in-place, so I will avoid sortperm, because doing p=sortperm(x); x=x[p]; y=y[p]; creates 3 new allocations of large vectors (in your above benchmark, sort uses twice as much memory as sort!, and you haven’t even done x=x[p]; y=y[p];).

eldee · February 7, 2025, 7:27pm

I’d assume it’s due to type instability:

julia> @btime sort!(Zip((x, y)), by=first, alg=QuickSort) setup=(x=rand(1:$n,$n);y=rand(1.0:$n,$n))
  5.216 s (265781594 allocations: 5.94 GiB)
10000000-element Zip{Tuple{Vector{Int64}, Vector{Float64}}}:
(...)

julia> @btime sort!(Zip((x, y)), by=first, alg=QuickSort) setup=(x=rand(1.0:$n,$n);y=rand(1.0:$n,$n))
  978.344 ms (0 allocations: 0 bytes)
10000000-element Zip{Tuple{Vector{Float64}, Vector{Float64}}}:
(...)

i.e. if you use the same element type in both Vectors, everything works as intended.

jling · February 7, 2025, 8:46pm

if you use view, it doesn’t create large allocations. you just need 1 allocation for the order, after that it’s allocation-free, and you can re-use the order vector.

And I just want to reiterate that, unless you’re GC-bound, the size of allocation isn’t so much of an issue, many many small-ish allocation is more of a performance killer usually.

mohamed.d180 · February 7, 2025, 9:06pm

By trying your code it is OK ,But can you explain where the type instability ?

Leo_I · February 7, 2025, 9:14pm

@eldee Hmm. But my use-case involves vectors of different types. How could I remedy this? I tried with parametrization

struct Zip2{tx,ty} <: AbstractVector{Tuple}
    x::Vector{tx} 
    y::Vector{ty} 
end;
function Base.length(a::Zip2{tx,ty}) ::Int  where {tx,ty}
    return min(length(a.x),length(a.y)) end  # truncate to shortest vector
function Base.size(a::Zip2{tx,ty}) ::Tuple{Int64}  where {tx,ty}
    return (length(a),) end  #  for AbstractVector compliance
function Base.getindex(a::Zip2{tx,ty}, i::Int) ::Tuple{tx,ty}  where {tx,ty}
    return (a.x[i],a.y[i]) end  # get tuple
function Base.setindex!(a::Zip2{tx,ty}, t::Tuple{tx,ty}, i::Int) ::Nothing  where {tx,ty}
    a.x[i] = t[1];  a.y[i] = t[2];   return nothing end  # assign in-place

I get really weird results:

julia> n=10^4; x=rand(-9:9,n); y=rand(-1:0.001:1,n); _x,_y=deepcopy.((x,y));  
julia> @btime sort!(StructArray((_x,_y)));  @btime sort!(Zip2(x,y));  x==_x, y==_y
  257.400 μs (37 allocations: 244.11 KiB)
  37.014 μs (1 allocation: 32 bytes)
(true, true)

julia> n=10^7; x=rand(-9:9,n); y=rand(-1:0.001:1,n); _x,_y=deepcopy.((x,y));  
julia> @time sort!(StructArray((_x,_y)));  @time sort!(Zip2(x,y));  x==_x, y==_y
  3.357601 seconds (37 allocations: 235.144 MiB, 1.16% gc time)
  9.076270 seconds (132.48 M allocations: 4.023 GiB, 14.64% gc time)
(true, true)

Under @btime, Zip2 seems faster and leaner, but on big vectors, the actual @time is 3x slower. What is going on here?

mohamed.d180 · February 7, 2025, 9:28pm

I think @eldee solution works well because Julia can handle homogeneous tuples Tuple{Float64, Float64} more efficiently than heterogeneous tuples Tuple{Int, Float64} ,So avoid subset Tuples and i think using @jling sortperm! with be good choice.

Leo_I · February 7, 2025, 9:34pm

No, I will avoid sortperm and sort. If StructArray can do efficient sort! on vectors of different types, so should my Zip.

mohamed.d180 · February 7, 2025, 9:34pm

Try this ,

using BenchmarkTools

struct CustomTuple
    x::Int
    y::Float64
end

Base.isless(a::CustomTuple, b::CustomTuple) = isless(a.x, b.x)

@btime sort!(data, by=p -> p.x, alg=QuickSort) setup=(data = [CustomTuple(rand(1:$n), rand(1.0:$n)) for _ in 1:$n])

Leo_I · February 7, 2025, 9:41pm

@mohamed.d180 What you suggest is not in-place (creating data vector). I already have vectors x and y (many of them). I do not want to create any new arrays. I wish to sort x and y in-place as efficiently as possible.

mohamed.d180 · February 7, 2025, 10:11pm

I think by adding index vector id can make some speed up with this

n=10^7;  x=rand(1.0:n,n);  y=rand(1.0:n,n); id = zeros(Int, length(x));

function sort_2!(id, x, y)
    sortperm!(id ,x)
    permute!(x,id)
    permute!(y,id)
end

@btime sort_2!(id, x, y);

jling · February 7, 2025, 10:16pm

sorry, but how do you think StructArrays.jl is doing that under the hood…

github.com/JuliaArrays/StructArrays.jl

src/sort.jl

9fdd86fd0


      
              cols = components(c)
              x = cols[1]
              p = sortperm(x)
              if length(cols) > 1
                  y = cols[2]
                  refine_perm!(p, cols, 1, x, y, 1, length(x))
              end
              return p
          end
          
          Base.sort!(c::StructArray{<:Union{Tuple, NamedTuple}}) = (permute!(c, sortperm(c)); c)
          Base.sort(c::StructArray{<:Union{Tuple, NamedTuple}}) = c[sortperm(c)]
          
          # Given an ordering `p`, return a vector `v` such that `Perm(Forward, v)` is
          # equivalent to `p`. Return `nothing` if such vector is not found.
          forward_vec(p::Perm{ForwardOrdering}) = p.data
          forward_vec(::Ordering) = nothing
          
          # Methods from IndexedTables to refine sorting:
          # # assuming x[p] is sorted, sort by remaining columns where x[p] is constant
          function refine_perm!(p, cols, c, x, y′, lo, hi)

and then it calls:

github.com/JuliaArrays/StructArrays.jl

src/sort.jl

9fdd86fd0


      
              func = function (idxs)
                  p_idxs = perm[idxs]
                  return keys[p_idxs[1]] => p_idxs
              end
              (func(idxs) for idxs in GroupPerm(keys, perm))
          end
          
          function Base.sortperm(c::StructVector{T}) where {T<:Union{Tuple, NamedTuple}}
              cols = components(c)
              x = cols[1]
              p = sortperm(x)
              if length(cols) > 1
                  y = cols[2]
                  refine_perm!(p, cols, 1, x, y, 1, length(x))
              end
              return p
          end
          
          Base.sort!(c::StructArray{<:Union{Tuple, NamedTuple}}) = (permute!(c, sortperm(c)); c)
          Base.sort(c::StructArray{<:Union{Tuple, NamedTuple}}) = c[sortperm(c)]

it’s just sortperm() (without ! even)

eldee · February 8, 2025, 9:47am

Let me just point out that inplace operations need not be more efficient than working an a newly allocated vector. Allocations should really only be avoided in hot loops, e.g. within every sort step (cf. original Zip version).

There might of course be plenty of other reasons you might want to work inplace (e.g. to reduce memory usage). In this case sort! does also seem to be slightly faster than sort:

julia> @btime sort!(x) setup=(x=rand(10^7));
  202.232 ms (6 allocations: 76.30 MiB)

julia> @btime sort(x) setup=(x=rand(10^7));
  215.259 ms (9 allocations: 152.60 MiB)

So I’d argue avoiding sort is typically not really necessary. There are good reasons to avoid sortperm though: it’s very slow.

julia> @btime sortperm(x) setup=(x=rand(10^7))
  4.565 s (6 allocations: 152.59 MiB)

eldee · February 8, 2025, 10:53am

What’s happening here is that @btime repeats the code block sort!(Zip2(x, y)) multiple times and reports the lowest execution time. But after the first run x and y are already sorted, so subsequent runs will be much faster. In this case you should use setup=..., which will be executed before every run, to set x and y.

Also, the default alg seems to perform really poorly here. (The default by/lt is lexicographical, so essentially the same as by=first.)

julia> @btime sort!(Zip2(xc,yc)) setup=(t=deepcopy.(($x, $y)); xc = t[1]; yc = t[2])  # (or just xc=deepcopy($x); yc=deepcopy($y)
  7.259 s (132638930 allocations: 4.03 GiB)
10000000-element Zip2{Int64, Float64}:
 (1, 6.993507e6)
 (1, 7.686177e6)
 (2, 3.0)
 (2, 2.562542e6)
...

julia> @btime sort!(Zip2(xc,yc), alg=QuickSort) setup=(t=deepcopy.(($x, $y)); xc = t[1]; yc = t[2])  # (or just xc=deepcopy($x); yc=deepcopy($y)
  838.394 ms (0 allocations: 0 bytes)
10000000-element Zip2{Int64, Float64}:
 (1, 6.993507e6)
 (1, 7.686177e6)
 (2, 3.0)
 (2, 2.562542e6)
...

You can check that the problem with Zip lies within setindex!:

julia> z = Zip((x, y));

julia> @btime ($z)[1] = (2, 3.);
  29.719 ns (2 allocations: 48 bytes)

@code_warntype doesn’t show any issues, but I’m pretty sure these are hidden by the foreach. If you use the equivalent

function Base.setindex!(z::Zip, vals::Tuple, i::Int)
    for j in eachindex(vals)
        z.vec[j][i] = vals[j]
    end
end

you get

julia> @code_warntype z[1] = (2, 3.)
MethodInstance for setindex!(::Zip{Tuple{Vector{Int64}, Vector{Float64}}}, ::Tuple{Int64, Float64}, ::Int64)
  from setindex!(z::Zip, vals::Tuple, i::Int64) @ Main REPL[37]:1
Arguments
  #self#::Core.Const(setindex!)
  z::Zip{Tuple{Vector{Int64}, Vector{Float64}}}
  vals::Tuple{Int64, Float64}
  i::Int64
Locals
  @_5::Union{Nothing, Tuple{Int64, Int64}}
  j::Int64
Body::Nothing
1 ─ %1  = Main.eachindex(vals)::Core.Const(Base.OneTo(2))
│         (@_5 = Base.iterate(%1))
│   %3  = @_5::Core.Const((1, 1))
│   %4  = (%3 === nothing)::Core.Const(false)
│   %5  = Base.not_int(%4)::Core.Const(true)
└──       goto #4 if not %5
2 ┄ %7  = @_5::Tuple{Int64, Int64}
│         (j = Core.getfield(%7, 1))
│   %9  = Core.getfield(%7, 2)::Int64
│   %10 = j::Int64
│   %11 = Base.getindex(vals, %10)::Union{Float64, Int64}
│   %12 = Base.getproperty(z, :vec)::Tuple{Vector{Int64}, Vector{Float64}}
│   %13 = j::Int64
│   %14 = Base.getindex(%12, %13)::Union{Vector{Float64}, Vector{Int64}}
│         Base.setindex!(%14, %11, i)
│         (@_5 = Base.iterate(%1, %9))
│   %17 = @_5::Union{Nothing, Tuple{Int64, Int64}}
│   %18 = (%17 === nothing)::Bool
│   %19 = Base.not_int(%18)::Bool
└──       goto #4 if not %19
3 ─       goto #2
4 ┄       return nothing

(it’s clearer in the REPL where you get colour coding). The issue is that the compiler does not know the type of vals[j] (it’s either Int64 or Float64, but no idea which one). The same applies for z.vec[j][i]. This is also the reason that the (Float64, Float64) example does work fine.

Simply unrolling the loop gives the compiler enough information: since vals isa Tuple{Int, Float64} and z.vec isa Tuple{Vector{Int64}, Vector{Float64}}, it knows that e.g. vals[1] isa Int, as is z.vec[1][i].

function Base.setindex!(z::Zip, vals::Tuple, i::Int)
    z.vec[1][i] = vals[1]
    z.vec[2][i] = vals[2]
end
@btime sort!(Zip((x, y)), by=first, alg=QuickSort) setup=(x=rand(1:$n,$n);y=rand(1.0:$n,$n))
#   799.739 ms (0 allocations: 0 bytes)
# 10000000-element Zip{Tuple{Vector{Int64}, Vector{Float64}}}:

The downside is that to unroll the loop you have to know length(z.vec) or length(vals) (which feels like it really should be inferable from Tuple{Int, Float64}, but apparently isn’t) in advance. Zip2 solves the issue essentially in this way, by restricting to precisely two vectors x and y.

To be more generic you could add length information in the Zip type.

struct ZipN{N, T<:Tuple} <: AbstractVector{Tuple}
    vec::T
    ZipN{N, T}(vecs::T) where {N, T} = length(vecs) == N ? new(vecs) : error("Lengths don't match!")
end
	   
ZipN(vecs) = ZipN{length(vecs), typeof(vecs)}(vecs)

Base.length(z::ZipN{N}) where N = minimum(length(z.vec[i]) for i = 1:N)  # truncate to shortest vector
Base.size(z::ZipN) = (length(z),)  #  for AbstractVector compliance
Base.getindex(z::ZipN, i::Int) = map(v -> v[i], z.vec)  # Get tuple (x[i], y[i], z[i], ...)
# (Could probably exploit knowledge of N here, as we did for length)

To unroll the loop in setindex! you can use a generated function.

@generated function Base.setindex!(z::ZipN{N}, vals::Tuple, i::Int) where N
    ex = :()
    for j = 1:N
        ex = :($ex; z.vec[$j][i] = vals[$j])
	end
    return ex
end

This should then be essentially as fast as Zip2, while allowing for arbitrary many elements in the Tuples:

julia> @btime sort!(ZipN((x, y)), alg=QuickSort) setup=(x=rand(1:$n,$n);y=rand(1.0:$n,$n))
  847.480 ms (0 allocations: 0 bytes)
10000000-element ZipN{2, Tuple{Vector{Int64}, Vector{Float64}}}:
 (1, 9.908178e6)
 (2, 8.058796e6)
 (3, 8.157882e6)
 (4, 8.805712e6)
...

julia> @btime sort!(ZipN((x, y, z)), alg=QuickSort) setup=(x=rand(1:$n,$n);y=rand(1.0:$n,$n);z=rand(1:$n, $n))
  965.327 ms (0 allocations: 0 bytes)
10000000-element ZipN{3, Tuple{Vector{Int64}, Vector{Float64}, Vector{Int64}}}:
 (1, 1.640825e6, 2285048)
 (2, 429818.0, 4893870)
 (4, 4.449581e6, 3214969)
 (5, 1.02763e6, 3245078)
...

stevengj · February 8, 2025, 1:23pm

sort(x) just calls sort!(copy(x)) by default.

Leo_I · February 8, 2025, 3:25pm

Your answer is very helpful and insightful, much appreciated! I will look into @generated and might have some follow-up questions, if the Julia documentation and ChatGPT do not suffice.

One thing still bugging me, though, is:

julia> n=10^6; @btime sort!(ZipN((x,y)), by=first, alg=QuickSort) setup=(x=rand(1:$n,$n); y=rand(1.0:$n,$n));
  99.834 ms (0 allocations: 0 bytes)

julia> n=10^6; @btime sort!(StructArray((x,y)), by=first, alg=QuickSort) setup=(x=rand(1:$n,$n); y=rand(1.0:$n,$n));
  71.858 ms (0 allocations: 0 bytes)

Why is ZipN around 30% slower?

eldee · February 8, 2025, 4:13pm

That’s a very good question, especially since we had established that sortperm is way slower and sort!(::StructArray) calls it on the first component. Also note that we somehow don’t get any allocations.

Well, turns out sort!(::StructArray) is never used here, which you can check via @which (and @edit or @less to inspect the actual code). When specifying alg we’re falling back on the generic version in sort.jl:

julia> n=10^6;  x=rand(1:n,n);  y=rand(1.0:n,n);

julia> @which sort!(StructArray((x, y)), alg=QuickSort)
kwcall(::NamedTuple, ::typeof(sort!), v::AbstractVector{T}) where T
     @ Base.Sort sort.jl:1687

julia> @which sort!(StructArray((x, y)))
sort!(c::StructArray{<:Union{Tuple, NamedTuple}})
     @ StructArrays (...)\.julia\packages\StructArrays\9hB1H\src\sort.jl:65

The only major difference between ZipN and StructArray I then see at the moment is that StructArray (safely) uses @inbounds. If we (unsafely) add @inbounds for ZipN

Base.length(z::ZipN{N}) where N = minimum(length(@inbounds(z.vec[i])) for i = 1:N)  # (should be safe)
Base.getindex(z::ZipN{N}, i::Int) where N = map(v -> @inbounds(v[i]), z.vec)  # unsafe

@generated function Base.setindex!(z::ZipN{N}, vals::Tuple, i::Int) where N
	ex = Expr(:meta, :inline)
    for j = 1:N
        ex = :($ex; @inbounds(z.vec[$j][i] = vals[$j]))
	end
    return ex
end  # unsafe

the performance is pretty much the same

julia> @btime sort!(ZipN((xc,yc)), alg=QuickSort) setup=(xc=copy($x); yc=copy($y));  # copy instead of rand for fair comparison
  63.996 ms (0 allocations: 0 bytes)

julia> @btime sort!(StructArray((xc,yc)), alg=QuickSort) setup=(xc=copy($x); yc=copy($y));
  62.854 ms (0 allocations: 0 bytes)

Leo_I · February 9, 2025, 11:20am

Excellent, now the performances on Vectors match! There’s just one more problem: on views there are some strange allocations:

julia> begin n=10^6;  x=rand(1:n,2n); y=rand(-1:0.001:1,2n);
       @btime sort!(ZipN((view(xc,$(n:2n)),view(yc,$(n:2n)))), alg=QuickSort) setup=(xc=copy($x); yc=copy($y));
       @btime sort!(StructArray((view(xc,$(n:2n)),view(yc,$(n:2n)))), alg=QuickSort) setup=(xc=copy($x); yc=copy($y)); end;
  99.979 ms (4 allocations: 288 bytes)
  88.169 ms (0 allocations: 0 bytes)

It would be really nice if I could learn to detect problems with my code like you did above, to be able to improve it. I tried your approach:

I don’t know how to use this output. : ( There are no occurences of Any in this output. Similarly for @code_typed and @code_llvm. ChatGPT and DeepSeek suggest the problem is with view, but if StructArray can avoid allocations, so should ZipN.

eldee · February 9, 2025, 2:54pm

getindex and setindex! work fine for ZipN on views, but length (somehow) allocates:

julia> z = ZipN(([1.], [1])); zv = ZipN((view([1.], 1:1), view([1], 1:1)));

julia> @btime $zv[1]; @btime $zv[1] = (1., 1); 
  2.400 ns (0 allocations: 0 bytes)
  2.200 ns (0 allocations: 0 bytes)

julia> @btime length($z); @btime length($zv);
  2.200 ns (0 allocations: 0 bytes)
  25.904 ns (2 allocations: 144 bytes)

I’m really not sure why length(::ZipN) behaves differently for z and zv. StructArrays avoids problems by just using the length of the first component. If we want to use the minimum of the component lengths, we can again use a generated function to get around the issue.

Base.length(z::ZipN{0}) = 0
@generated function Base.length(z::ZipN{N}) where N
	min_args = (:(length(z.vec[$j])) for j = 1:N)
	return :(min($(min_args...)))
end

julia> @btime length($z); @btime length($zv);
  2.200 ns (0 allocations: 0 bytes)
  2.200 ns (0 allocations: 0 bytes)

julia> n=10^6; x=rand(1:n,2n); y=rand(-1:0.001:1,2n);

julia> @btime sort!(StructArray((view(xc,$(n:2n)),view(yc,$(n:2n)))), alg=QuickSort) setup=(xc=copy($x); yc=copy($y));
  67.220 ms (0 allocations: 0 bytes)

julia> @btime sort!(ZipN((view(xc,$(n:2n)),view(yc,$(n:2n)))), alg=QuickSort) setup=(xc=copy($x); yc=copy($y));
  66.451 ms (0 allocations: 0 bytes)

Edit: Base.length(z::ZipN) = minimum(length(v) for v in z.vec) (cf. your original Zip definition) also already works.

Topic		Replies	Views
Fast permutation vector: code needed Performance sorting	38	641	March 21, 2023
[ANN] OptimalSortingNetworks: Sort small collections efficiently and with good type inference Package Announcements package , announcement , tuple , sort , vector	14	1010	February 25, 2024
In place sorting for views Performance question , sort , views	12	2743	March 1, 2021
Sorted Array implementation? General Usage array , sort	21	7970	February 6, 2024
How to sort a vector by an expensive function in place efficiently? Performance sorting	14	1718	March 8, 2022

Sorting in-place several Vector's simultaneously

Related topics