Sorting a matrix according to a matrix of column-wise sorted indices

kingoslo · September 17, 2023, 9:53am

I have a matrix that looks like this:

A = rand( 40000000,30000000)

I have already used measurable computational power to find the indices that will sort it column-wise:

idx = mapslices(sortperm, A, dims=1)

Now I want to use these indices in the fastest possible way to get the matrix with the columns sorted in that order. How can I do that?

stevengj · September 17, 2023, 7:34pm

Your matrix is obviously not that big; it would be more helpful to give actual runnable examples. Also, you should quote your code. Please read: make it easier to help you

Probably faster just to sort the columns directly, in-place, instead of computing the permutation indices and then applying them. e.g.

for col in eachcol(A)
    sort!(col)
end

or foreach(sort!, eachcol(A)).

kingoslo · September 17, 2023, 8:34pm

I don’t know how big the matrix will be, but currently I’m using ~1TB of memory. I can’t tell you the exact dimensions: (1) There could be an appetite to increase the size twentyfive- or hundred-folds. (b) I don’t know the data type exactly. But if the exact number of zeros determines the solution you would recommend, you could say it is A = rand(10000000,10000000) to get a physical example.

I had to calculate the sortperm for some other routine, so I thought I could reuse it to save time.

stevengj · September 17, 2023, 8:47pm

No, you couldn’t — that would require 800 TB of memory. To get useful advice it would be better to give realistic information about your application.

Do you actually have that much RAM? Or are you working “out-of-core” with data stored in files or swap memory? (Big difference.)

I had to calculate the sortparm for some other routine, so I thought I could reuse it to save time.

If your dataset is truly enormous, you really need to be careful about how many passes over the array you make and how much extraneous data you allocate. Your thinking needs to fundamentally change when your data gets this big.

If possible you should avoid allocating an entire array (terabytes?) of all your column permutations at once, and then separately make a pass over the array or other data in order to use those permutations. For example, you could loop over your data 1 column at a time and do as much computation as possible before moving onto the next column: compute the permutation and sort the column, do whatever you need with the permutation indices, and then discard those indices when you move onto the next column (so that you make a single pass over the data, and don’t store the permuations for all columns simultaneously).

But it’s hard to give you more useful advice without knowing more context about what you are doing.

kingoslo · September 17, 2023, 8:48pm

I have to do it in this way unfortunately. If someone else wants to suggest something, I’d be happy to give it a try.

Dan · September 17, 2023, 9:03pm

How about this way:

julia> A = rand(5,4)
5×4 Matrix{Float64}:
 0.213777  0.186819  0.233056  0.211086
 0.202914  0.708067  0.636107  0.965155
 0.355595  0.566363  0.585728  0.477823
 0.351841  0.399977  0.322975  0.725874
 0.216301  0.709295  0.552921  0.0545331

julia> idx = mapslices(sortperm, A, dims=1)
5×4 Matrix{Int64}:
 2  1  1  5
 1  4  4  1
 5  3  5  3
 4  2  3  4
 3  5  2  2

julia> B = [A[idx[i],j] for i=1:5,j=1:4]
5×4 Matrix{Float64}:
 0.202914  0.708067  0.636107  0.965155
 0.213777  0.186819  0.233056  0.211086
 0.216301  0.709295  0.552921  0.0545331
 0.351841  0.399977  0.322975  0.725874
 0.355595  0.566363  0.585728  0.477823

B will now be your desired matrix.

To make this transformation in-place, the following can be used:

for (v, p) in zip(eachcol(A), eachcol(idx))
    permute!(v, p)
end

Afterwhich, A will have columns sorted.
This can also be written succintly as:

foreach(permute!, eachcol(A), eachcol(idx))

As a sidenote, please spend some time to make the question easy-to-answer. Any work related issue preventing the disclosure of actual application, can be easily be overcome by making a toy example, with a little more effort. Perhaps the toy-example in this answer is what was needed.

stevengj · September 17, 2023, 9:26pm

If @kingoslo actually has 1TB of data (and wants to go 25–100× larger!) then a small toy example will likely be totally misleading for algorithm choices.

For such large problems you need to put a little more thought into describing the actual problem and some context, and can’t expect much more than general advice (since no one else will casually run or write code for problems that large just to help you).

Topic		Replies	Views
Can't get sortperm to return the sorting order of columns of a matrix General Usage question , sort , sortperm	11	1360	July 16, 2022
Sorting matrix according to the first column General Usage sorting , matrix	12	553	March 28, 2024
Permuting the rows of a matrix based on given a given permutation General Usage	13	3907	May 22, 2025
Ordering Of Matrix Columns General Usage sort , matrices	9	701	May 30, 2022
Sort matrix based on the elements of a specific column General Usage sort	8	15649	October 30, 2023

Sorting a matrix according to a matrix of column-wise sorted indices

Related topics