Sort array based on another array?

Hello

I am post-processing some particle simulation data and for an example I would get the following two arrays:

Rhop_vec[50]
125749-element Array{Float32,1}:
 1007.7992
 1014.8219
 1011.84064
 1019.48114
 1011.2265
 1015.946
    ⋮
 1012.1963
 1016.4518
 1006.3834
 1014.0597
 1007.33563
 1021.40027
Idp_vec[50]
125749-element Array{Int32,1}:
      0
      1
      2
    251
    253
    255
      ⋮
 124256
 124505
 125003
 124754
 125749
 125750

Where “Idp_vec” is the vector holdning the index value of each particle starting from 0 to N, while “Rhop_vec” holds the corresponding density for each particle. So for an example for the first element I would have Idp = 0 and Rhop = 1007.7992.

Now comes the tricky part. Sometimes I might only want to look at particles from the range of Idp of 0 to 1500. Since the Idp_vec is clearly unstructured I have two options:

  1. Sort Idp_vec from lowest to highest and then enforce same sorting on Rhop_vec
  2. Find the specific indices directly and correlate with Rhop_vec

I have no clue which one is fastest / most efficient, but I have a hard time figuring out how to for an example implement option 1 efficiently. Currently what I am doing is the same as in the documentation, using sortperm to save an array of the indices after sorting so I can transform the Rhop_vec array correspondingly, but wondered if there is a smarter way.

For the second option, I don’t know if it is viable, but I think it might be more efficient since I do not have to make any temporary arrays?

I am finishing up my own solution, but would like to hear about your past experiences.

Kind regards

There’s certainly a O(N) solution, if the number of particle is known in advance:
create a new vector Rhop_aux = Vector{Float32}(undef, maximum(Idp_vec))
fill it as Rhop_aux[Idp[i]+1] = Rhop[i]
now, Rhop_aux[i] is the Rhop whose corresponding Idp is i-1.

A sorted index with a binary search for the lookup would work, but if you want to do the search often using the same data, sorting the actual values might be faster. Maybe it can be sorted in a tree structure for faster lookup.

Not sure this is faster, but I sometime make a Dict for lookup and use sort’s by.

val_lookup = Dict(Pair.(id,val))
sort(id, by=x->val_lookup[x])

now I think you can also directly find specific indices thru the dictionary.

getindex.(Ref(val_lookup),[1,3,5,7,8])

Thanks for all of your suggestions guys!

My own way was doing it like this:

function readBi4Body(Body,typ)
    start = getfield(Body, :beg)+1    #First idp
    move  = start+getfield(Body, :count)-1  #Number of particles from first idp
    @time idp_vec  = readBi4Array(Idp)
    @time val_vec  = readBi4Array(typ)

    j = similar(val_vec)

    for i = 1:length(pq)
        j[i] = val_vec[i][sortperm(idp_vec[i])][start:move]
    end

    return j
end

Ignore the start, move fields. Basically I am reading my idp data and my rhop data (in val_vec), creating a new array similar to for an example rhop_vec in j and then looping through to make sure that the values of val_vec move according to the sortperm of idp_vec and I am getting results as I want.

Regarding speed it is taking me 0.65 seconds to load data in (using readBi4Array) and then about 0.1 seconds for the sorting I asked about here as seen in the @time:

@time k = readBi4Body(Bodies[2][1],Points);
  0.146138 seconds (6.30 k allocations: 119.800 MiB, 29.87% gc time)
  0.532087 seconds (8.85 k allocations: 370.412 MiB, 40.51% gc time)
  0.756527 seconds (157.40 k allocations: 509.116 MiB, 34.27% gc time)

I might test Vasily’s solution in the future but for now I think this is the best way for me.

Thank you once again for your suggestions, hope they might help others in the future as well.

Kind regards

I would say that this is the best solution in most cases.

If this is a bottleneck in your code, what is optimal depends on how many times you do lookups for one pair of vectors, ie whether making them faster is worth the cost of sorting.

It takes about 3 seconds in total to go through 1 GB of files (earlier benchmarks were a bit wrong), so for now it is not a performance bottleneck, but testing bigger data sets might show another picture. Thanks for your comment.

Kind regards

In theory it should be possible to wrap your two arrays into a light structure, pass it to sort! and get both Arrays sorted at the same time. You would just need to overload get/setindex an <. I gave it a quick try but I cannot add a method to isnan for some reason. I’m also not sure that solution would be optimal (we might have some copying going on).

struct CoSort{T,N} <: AbstractArray{T,N}
    x::Array{T,N}
    y::Array{T,N}
end

Base.size(x::CoSort) = size(x.x)
Base.getindex(x::CoSort,i) = (x.x[i], x.y[i])

function Base.setindex!(x::CoSort, v, i) 
    x.x[i] = v[1]
    x.y[i] = v[2]
end

Base.Sort.isnan(o::Base.Order.Ordering, x::Tuple{T,T}) where T = isnan(x[1]) || isnan(x[2])

x = CoSort(rand(3), rand(3))
sort!(x)

>ERROR: MethodError: no method matching isnan(::Base.Order.ForwardOrdering, ::Tuple{Float64,Float64})

That also looks a lot like a DataFrame…

Sorting both arrays at the same time, seems like a waste when I can sort one, and provide its indices to the other array, or did I misunderstand your procedure?

My idea is that the second array would just passively get sorted as the first one gets sorted (< would depend only on the first array). But it’s true that if elements gets moved several times it might be slower than using indices, so it’s probably a bad idea for some sorting algorithms.

Thanks for clarifying :slight_smile: