Sort array based on another array?


I am post-processing some particle simulation data and for an example I would get the following two arrays:

125749-element Array{Float32,1}:
125749-element Array{Int32,1}:

Where “Idp_vec” is the vector holdning the index value of each particle starting from 0 to N, while “Rhop_vec” holds the corresponding density for each particle. So for an example for the first element I would have Idp = 0 and Rhop = 1007.7992.

Now comes the tricky part. Sometimes I might only want to look at particles from the range of Idp of 0 to 1500. Since the Idp_vec is clearly unstructured I have two options:

  1. Sort Idp_vec from lowest to highest and then enforce same sorting on Rhop_vec
  2. Find the specific indices directly and correlate with Rhop_vec

I have no clue which one is fastest / most efficient, but I have a hard time figuring out how to for an example implement option 1 efficiently. Currently what I am doing is the same as in the documentation, using sortperm to save an array of the indices after sorting so I can transform the Rhop_vec array correspondingly, but wondered if there is a smarter way.

For the second option, I don’t know if it is viable, but I think it might be more efficient since I do not have to make any temporary arrays?

I am finishing up my own solution, but would like to hear about your past experiences.

Kind regards

There’s certainly a O(N) solution, if the number of particle is known in advance:
create a new vector Rhop_aux = Vector{Float32}(undef, maximum(Idp_vec))
fill it as Rhop_aux[Idp[i]+1] = Rhop[i]
now, Rhop_aux[i] is the Rhop whose corresponding Idp is i-1.

A sorted index with a binary search for the lookup would work, but if you want to do the search often using the same data, sorting the actual values might be faster. Maybe it can be sorted in a tree structure for faster lookup.

Not sure this is faster, but I sometime make a Dict for lookup and use sort’s by.

val_lookup = Dict(Pair.(id,val))
sort(id, by=x->val_lookup[x])

now I think you can also directly find specific indices thru the dictionary.


Thanks for all of your suggestions guys!

My own way was doing it like this:

function readBi4Body(Body,typ)
    start = getfield(Body, :beg)+1    #First idp
    move  = start+getfield(Body, :count)-1  #Number of particles from first idp
    @time idp_vec  = readBi4Array(Idp)
    @time val_vec  = readBi4Array(typ)

    j = similar(val_vec)

    for i = 1:length(pq)
        j[i] = val_vec[i][sortperm(idp_vec[i])][start:move]

    return j

Ignore the start, move fields. Basically I am reading my idp data and my rhop data (in val_vec), creating a new array similar to for an example rhop_vec in j and then looping through to make sure that the values of val_vec move according to the sortperm of idp_vec and I am getting results as I want.

Regarding speed it is taking me 0.65 seconds to load data in (using readBi4Array) and then about 0.1 seconds for the sorting I asked about here as seen in the @time:

@time k = readBi4Body(Bodies[2][1],Points);
  0.146138 seconds (6.30 k allocations: 119.800 MiB, 29.87% gc time)
  0.532087 seconds (8.85 k allocations: 370.412 MiB, 40.51% gc time)
  0.756527 seconds (157.40 k allocations: 509.116 MiB, 34.27% gc time)

I might test Vasily’s solution in the future but for now I think this is the best way for me.

Thank you once again for your suggestions, hope they might help others in the future as well.

Kind regards

I would say that this is the best solution in most cases.

If this is a bottleneck in your code, what is optimal depends on how many times you do lookups for one pair of vectors, ie whether making them faster is worth the cost of sorting.

It takes about 3 seconds in total to go through 1 GB of files (earlier benchmarks were a bit wrong), so for now it is not a performance bottleneck, but testing bigger data sets might show another picture. Thanks for your comment.

Kind regards

In theory it should be possible to wrap your two arrays into a light structure, pass it to sort! and get both Arrays sorted at the same time. You would just need to overload get/setindex an <. I gave it a quick try but I cannot add a method to isnan for some reason. I’m also not sure that solution would be optimal (we might have some copying going on).

struct CoSort{T,N} <: AbstractArray{T,N}

Base.size(x::CoSort) = size(x.x)
Base.getindex(x::CoSort,i) = (x.x[i], x.y[i])

function Base.setindex!(x::CoSort, v, i) 
    x.x[i] = v[1]
    x.y[i] = v[2]

Base.Sort.isnan(o::Base.Order.Ordering, x::Tuple{T,T}) where T = isnan(x[1]) || isnan(x[2])

x = CoSort(rand(3), rand(3))

>ERROR: MethodError: no method matching isnan(::Base.Order.ForwardOrdering, ::Tuple{Float64,Float64})

That also looks a lot like a DataFrame…

Sorting both arrays at the same time, seems like a waste when I can sort one, and provide its indices to the other array, or did I misunderstand your procedure?

My idea is that the second array would just passively get sorted as the first one gets sorted (< would depend only on the first array). But it’s true that if elements gets moved several times it might be slower than using indices, so it’s probably a bad idea for some sorting algorithms.

Thanks for clarifying :slight_smile:

Hey, here’s a simple approach:

aux_vec = []
while length(aux_vec) != length(vec_to_be_sorted) 
    greater_value = Inf
    position = 1
    for v in vec_to_be_sorted
        if greater_value  > based_vec[v] && !(v ∈ aux_vec ) 
            greater_value  = based_vec[v]
            position = v  
    push!(aux_vec, position)
vec_to_be_sorted = aux_vec 

I think you wrote greater_number when you wanted to write greater_value in the if.

1 Like

You’re right. It’s now fixed. Thank you.