How to filter columns of a matrix by columns of another matrix

Hi

I have two matrics in Julia, both matrics are coded by 0,1,2. and I want to exclude the columns of matrix B from columns of matrix A.
at the end, I will have new matrix with ( 28960x 41227).
how can I do it ?
thanks :slight_smile:

matrix A 28960×45807 Array{Float64,2}:

matrix B 28960×4580 view(::Array{Float64,2}, :, [5, 11, 12, 29, 64, 80, 94, 127, 136, 141 … 45707, 45710, 45718, 45732, 45741, 45771, 45773, 45788, 45791, 45807]) with eltype Float64:

Could you give a small example of A and B inputs and the desired output?

2 Likes

here is small part of my data.
if A is 15x15 matrix and B is 15x5 matrix, how can I sort 5 columns of B from columns of A?
so my desired matrix will be C (15x10)

15×15 Array{Float64,2}:
0.0 2.0 2.0 2.0 0.0 0.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1.0 1.0 2.0 2.0 1.0 1.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
1.0 1.0 1.0 2.0 2.0 2.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1.0
1.0 1.0 2.0 1.0 2.0 2.0 1.0 1.0 2.0 2.0 2.0 2.0 2.0 0.0 2.0
2.0 0.0 2.0 1.0 2.0 2.0 2.0 2.0 2.0 0.0 1.0 0.0 1.0 2.0 2.0
2.0 0.0 2.0 2.0 2.0 2.0 2.0 2.0 0.0 2.0 0.0 2.0 2.0 0.0 2.0
2.0 0.0 2.0 1.0 2.0 2.0 2.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0
2.0 0.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 1.0 0.0 1.0 2.0 1.0 2.0
2.0 1.0 1.0 1.0 2.0 2.0 1.0 2.0 2.0 2.0 2.0 1.0 2.0 1.0 2.0
2.0 0.0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 1.0 2.0 1.0 2.0
2.0 0.0 2.0 2.0 2.0 2.0 2.0 1.0 1.0 2.0 1.0 2.0 2.0 1.0 2.0
0.0 1.0 2.0 1.0 2.0 2.0 1.0 1.0 2.0 1.0 2.0 1.0 2.0 1.0 2.0
0.0 2.0 2.0 2.0 1.0 1.0 2.0 2.0 2.0 1.0 2.0 2.0 1.0 2.0 1.0
2.0 0.0 2.0 2.0 2.0 2.0 2.0 2.0 1.0 1.0 1.0 2.0 2.0 0.0 1.0
2.0 0.0 2.0 2.0 2.0 2.0 2.0 1.0 2.0 1.0 1.0 2.0 2.0 2.0 2.0

15×5 Array{Float64,2}:
0.0 2.0 2.0 2.0 2.0
1.0 2.0 2.0 1.0 1.0
2.0 1.0 1.0 2.0 0.0
2.0 2.0 2.0 1.0 1.0
2.0 1.0 0.0 1.0 1.0
2.0 0.0 2.0 2.0 0.0
2.0 1.0 1.0 2.0 0.0
2.0 0.0 1.0 1.0 1.0
2.0 2.0 1.0 2.0 2.0
1.0 1.0 1.0 2.0 0.0
2.0 1.0 2.0 1.0 1.0
2.0 2.0 1.0 1.0 2.0
1.0 2.0 2.0 2.0 2.0
2.0 1.0 2.0 2.0 1.0
2.0 1.0 2.0 2.0 2.0

Sorry, but I think we still mean more clarity.

If there are 5 columns in B, and 15 columns in A, how are you sorting one by the other?

1 Like

sorry for unclear message, the matrix B is part of matrix A. so I want to exlude the column of matrix B from A, to get new matrix.

I think you mean to say you want to filter the columns of B from A. There are likely more succinct ways of doing this, but here’s a straight-forward but inefficient low-level solution.

keep=Int64[]
for i=1:size(A,2)
         for j=1:size(B,2)
           if A[:,i]==B[:,j]
               break  # Match, do not keep column i
           end
           push!(keep,i)  # No match, keep column i
         end
end
A = A[:,keep]

I look forward to seeing more concise solutions for Julia experts.

2 Likes

thank you for helping, I tried the command and maybe cause of my data size,
its running quite slow.

You can try @view A[:, i] == B[:, j] and see if that speeds things up

1 Like

This is a bit inefficient for large matrices because it scales as the product of the number of columns of A and B. It might be better to construct a Set of B’s columns so that you can use fast hash-table lookup:

Bcols = Set(eachcol(B))
C = reduce(hcat, collect(c for c in eachcol(A) if c ∉ Bcols))

(I use collect here because of issue #31636: Julia already has an optimized reduce(hcat, x) method but only for the case where x is an array of arrays.) Alternatively, you could do

Bcols = Set(eachcol(B))
C = A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]

which might be faster than reduce?

As usual, if you want to benchmark these it is best to first put the code into a function:

function filtercols(A, B)
    Bcols = Set(eachcol(B))
    return A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]
end

(In general, you might want to step back and re-think the data structures you are using—matrices might not be the best choice depending on the operations you want to perform. e.g. you might want to store B as a Set of columns to begin with.)

5 Likes

many thank you guys for helping.
both commands works for my large matrices, and both of them are quite fast.

Bcols = Set(eachcol(B))
C = reduce(hcat, collect(c for c in eachcol(A) if c ∉ Bcols))

Bcols = Set(eachcol(B))
C = A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]

Realize that this is almost 10GB of memory. If the entries of the matrix are only 0, 1, and 2, you can save a factor of 8 in memory (and also speed things up) simply by using an array of Int8 (8-bit integer) values instead of Float64 values.

But again, the fact that you are working with this much data really suggests that you should put some thought into the data structures you are using. It is impossible to give you more detailed advice without knowing more about the underlying problem you are trying to solve.

2 Likes

thank you for suggestion and would love to hear more suggestion from you guys.

here is some info of what I am doing.
this matrix is construted by number of individuals(rows) and number of snps(columns).
the original file is bed file from plink, which has individuals ID(row) and snp ID(column). then I converted into Matrix(Float64) in Julia. the next step is to fit this matrix(28960x41227) into mixed model equation and calculate snp effects,

One alternative that seems to perform alright:

function A_except_B(A,B)
    a2, b2 = size(A,2), size(B,2)
    [@views  B[:,j] == A[:,i] for i in 1:a2, j in 1:b2] |> ix -> maximum(ix,dims=2) |> iy -> A[:, vec(.!iy)]
end

For large matrices (the OP said A had 45807 cols and B had 4580 cols) this will be slow and will also allocate a large matrix of booleans, because the time and memory of your solution scales like the product of the number of columns in A with the number of columns in B.

@stevengj, please find here below test results for large matrices as per OP, run on regular Win10 laptop and Julia 1.6.0-beta1.0, confirming your speed predictions:

using Random, BenchmarkTools
A = rand(0:2, 28960, 45807);
B = A[:,vec(rand(1:45807,1,4580))];

@btime A_except_B($A,$B)    # 19.0 s (21 allocations: 9.1 GiB)
@btime filtercols1($A,$B)   #  9.3 s (45833 allocations: 9.0 GiB)
@btime filtercols2($A,$B)   # 11.5 s (26 allocations: 9.0 GiB)

The last two functions are your solutions.

PS: input as per @Jeff_Emanuel’s suggestion

3 Likes

To match the OP’s case, I think you want to create B with a random selection of 10% of A’s columns.

1 Like

Hi,

I used the the function

function filtercols(A, B)
Bcols = Set(eachcol(B))
return A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]
end

to filter the columns of matrix C from matrix D
size of C = 1104x4580 and size of D = 1104x45807
so the filtered matrix should have size of 1104x41227, but I got filtered matrix with1104x41192
where did 41227 - 41192 = 35 columns go ? and how can I fix it ?

many thanks in advance!

Presumably those 35 columns are columns of C that appeared more than once in D.

thank you for the answer, how can I get that 35 column in the filtered matrix ?

what about to check only the first n (1000 for intance) elements of each column?

function filtercolsr(A, B)
    Bcols = Set(eachcol(@view B[1:1000, :]))
    return A[:, @views [i for i in axes(A,2) if A[1:1000,i] ∉ Bcols]]
end

and then filter on the remaining table?

1 Like