Sort array based on another array?

Ahmed_Salih · July 26, 2019, 4:29pm

Hello

I am post-processing some particle simulation data and for an example I would get the following two arrays:

Rhop_vec[50]
125749-element Array{Float32,1}:
 1007.7992
 1014.8219
 1011.84064
 1019.48114
 1011.2265
 1015.946
    ⋮
 1012.1963
 1016.4518
 1006.3834
 1014.0597
 1007.33563
 1021.40027

Idp_vec[50]
125749-element Array{Int32,1}:
      0
      1
      2
    251
    253
    255
      ⋮
 124256
 124505
 125003
 124754
 125749
 125750

Where “Idp_vec” is the vector holdning the index value of each particle starting from 0 to N, while “Rhop_vec” holds the corresponding density for each particle. So for an example for the first element I would have Idp = 0 and Rhop = 1007.7992.

Now comes the tricky part. Sometimes I might only want to look at particles from the range of Idp of 0 to 1500. Since the Idp_vec is clearly unstructured I have two options:

Sort Idp_vec from lowest to highest and then enforce same sorting on Rhop_vec
Find the specific indices directly and correlate with Rhop_vec

I have no clue which one is fastest / most efficient, but I have a hard time figuring out how to for an example implement option 1 efficiently. Currently what I am doing is the same as in the documentation, using sortperm to save an array of the indices after sorting so I can transform the Rhop_vec array correspondingly, but wondered if there is a smarter way.

For the second option, I don’t know if it is viable, but I think it might be more efficient since I do not have to make any temporary arrays?

I am finishing up my own solution, but would like to hear about your past experiences.

Kind regards

Vasily_Pisarev · July 26, 2019, 5:15pm

There’s certainly a O(N) solution, if the number of particle is known in advance:
create a new vector Rhop_aux = Vector{Float32}(undef, maximum(Idp_vec))
fill it as Rhop_aux[Idp[i]+1] = Rhop[i]
now, Rhop_aux[i] is the Rhop whose corresponding Idp is i-1.

feanor12 · July 26, 2019, 5:17pm

A sorted index with a binary search for the lookup would work, but if you want to do the search often using the same data, sorting the actual values might be faster. Maybe it can be sorted in a tree structure for faster lookup.

cchderrick · July 26, 2019, 5:53pm

Not sure this is faster, but I sometime make a Dict for lookup and use sort’s by.

val_lookup = Dict(Pair.(id,val))
sort(id, by=x->val_lookup[x])

now I think you can also directly find specific indices thru the dictionary.

getindex.(Ref(val_lookup),[1,3,5,7,8])

Ahmed_Salih · July 26, 2019, 6:09pm

Thanks for all of your suggestions guys!

My own way was doing it like this:

function readBi4Body(Body,typ)
    start = getfield(Body, :beg)+1    #First idp
    move  = start+getfield(Body, :count)-1  #Number of particles from first idp
    @time idp_vec  = readBi4Array(Idp)
    @time val_vec  = readBi4Array(typ)

    j = similar(val_vec)

    for i = 1:length(pq)
        j[i] = val_vec[i][sortperm(idp_vec[i])][start:move]
    end

    return j
end

Ignore the start, move fields. Basically I am reading my idp data and my rhop data (in val_vec), creating a new array similar to for an example rhop_vec in j and then looping through to make sure that the values of val_vec move according to the sortperm of idp_vec and I am getting results as I want.

Regarding speed it is taking me 0.65 seconds to load data in (using readBi4Array) and then about 0.1 seconds for the sorting I asked about here as seen in the @time:

@time k = readBi4Body(Bodies[2][1],Points);
  0.146138 seconds (6.30 k allocations: 119.800 MiB, 29.87% gc time)
  0.532087 seconds (8.85 k allocations: 370.412 MiB, 40.51% gc time)
  0.756527 seconds (157.40 k allocations: 509.116 MiB, 34.27% gc time)

I might test Vasily’s solution in the future but for now I think this is the best way for me.

Thank you once again for your suggestions, hope they might help others in the future as well.

Kind regards

Tamas_Papp · July 27, 2019, 6:23am

I would say that this is the best solution in most cases.

If this is a bottleneck in your code, what is optimal depends on how many times you do lookups for one pair of vectors, ie whether making them faster is worth the cost of sorting.

Ahmed_Salih · July 27, 2019, 10:38am

It takes about 3 seconds in total to go through 1 GB of files (earlier benchmarks were a bit wrong), so for now it is not a performance bottleneck, but testing bigger data sets might show another picture. Thanks for your comment.

Kind regards

jonathanBieler · July 27, 2019, 1:33pm

In theory it should be possible to wrap your two arrays into a light structure, pass it to sort! and get both Arrays sorted at the same time. You would just need to overload get/setindex an <. I gave it a quick try but I cannot add a method to isnan for some reason. I’m also not sure that solution would be optimal (we might have some copying going on).

struct CoSort{T,N} <: AbstractArray{T,N}
    x::Array{T,N}
    y::Array{T,N}
end

Base.size(x::CoSort) = size(x.x)
Base.getindex(x::CoSort,i) = (x.x[i], x.y[i])

function Base.setindex!(x::CoSort, v, i) 
    x.x[i] = v[1]
    x.y[i] = v[2]
end

Base.Sort.isnan(o::Base.Order.Ordering, x::Tuple{T,T}) where T = isnan(x[1]) || isnan(x[2])

x = CoSort(rand(3), rand(3))
sort!(x)

>ERROR: MethodError: no method matching isnan(::Base.Order.ForwardOrdering, ::Tuple{Float64,Float64})

That also looks a lot like a DataFrame…

Ahmed_Salih · July 27, 2019, 2:02pm

Sorting both arrays at the same time, seems like a waste when I can sort one, and provide its indices to the other array, or did I misunderstand your procedure?

jonathanBieler · July 27, 2019, 2:19pm

My idea is that the second array would just passively get sorted as the first one gets sorted (< would depend only on the first array). But it’s true that if elements gets moved several times it might be slower than using indices, so it’s probably a bad idea for some sorting algorithms.

Ahmed_Salih · July 27, 2019, 3:20pm

Thanks for clarifying

Wesley_Coelho · October 27, 2020, 12:10pm

Hey, here’s a simple approach:

aux_vec = []
while length(aux_vec) != length(vec_to_be_sorted) 
    greater_value = Inf
    position = 1
    for v in vec_to_be_sorted
        if greater_value  > based_vec[v] && !(v ∈ aux_vec ) 
            greater_value  = based_vec[v]
            position = v  
        end
    end
    push!(aux_vec, position)
end
vec_to_be_sorted = aux_vec

Henrique_Becker · October 27, 2020, 1:43pm

I think you wrote greater_number when you wanted to write greater_value in the if.

Wesley_Coelho · October 27, 2020, 2:20pm

You’re right. It’s now fixed. Thank you.

leoflotor · March 16, 2023, 2:42am

I found that this is pretty useful, and easy, if a bunch of arrays are to be sorted depending on another array of reference. It might not be the fastest solution but a very convenient one to have.

Topic		Replies	Views
How to sort an array based on another on GPU (CUDA) efficiently? GPU	3	328	March 17, 2024
Sorting a StructArray variable faster and how to update values? Performance sort , structarrays	6	212	April 7, 2024
Sort an array according other array General Usage question , sort , sortperm , arrays	6	483	September 17, 2021
Is there a more elegant solution than `sortperm(sortperm(...))` Performance question	6	517	February 22, 2020
Faster Sorting With GPU? New to Julia gpu , sort , arrays	25	2311	January 17, 2018

Sort array based on another array?

Related topics