I have two matrics in Julia, both matrics are coded by 0,1,2. and I want to exclude the columns of matrix B from columns of matrix A.
at the end, I will have new matrix with ( 28960x 41227).
how can I do it ?
thanks
here is small part of my data.
if A is 15x15 matrix and B is 15x5 matrix, how can I sort 5 columns of B from columns of A?
so my desired matrix will be C (15x10)
I think you mean to say you want to filter the columns of B from A. There are likely more succinct ways of doing this, but here’s a straight-forward but inefficient low-level solution.
keep=Int64[]
for i=1:size(A,2)
for j=1:size(B,2)
if A[:,i]==B[:,j]
break # Match, do not keep column i
end
push!(keep,i) # No match, keep column i
end
end
A = A[:,keep]
I look forward to seeing more concise solutions for Julia experts.
This is a bit inefficient for large matrices because it scales as the product of the number of columns of A and B. It might be better to construct a Set of B’s columns so that you can use fast hash-table lookup:
Bcols = Set(eachcol(B))
C = reduce(hcat, collect(c for c in eachcol(A) if c ∉ Bcols))
(I use collect here because of issue #31636: Julia already has an optimized reduce(hcat, x) method but only for the case where x is an array of arrays.) Alternatively, you could do
Bcols = Set(eachcol(B))
C = A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]
function filtercols(A, B)
Bcols = Set(eachcol(B))
return A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]
end
(In general, you might want to step back and re-think the data structures you are using—matrices might not be the best choice depending on the operations you want to perform. e.g. you might want to store B as a Set of columns to begin with.)
Realize that this is almost 10GB of memory. If the entries of the matrix are only 0, 1, and 2, you can save a factor of 8 in memory (and also speed things up) simply by using an array of Int8 (8-bit integer) values instead of Float64 values.
But again, the fact that you are working with this much data really suggests that you should put some thought into the data structures you are using. It is impossible to give you more detailed advice without knowing more about the underlying problem you are trying to solve.
thank you for suggestion and would love to hear more suggestion from you guys.
here is some info of what I am doing.
this matrix is construted by number of individuals(rows) and number of snps(columns).
the original file is bed file from plink, which has individuals ID(row) and snp ID(column). then I converted into Matrix(Float64) in Julia. the next step is to fit this matrix(28960x41227) into mixed model equation and calculate snp effects,
function A_except_B(A,B)
a2, b2 = size(A,2), size(B,2)
[@views B[:,j] == A[:,i] for i in 1:a2, j in 1:b2] |> ix -> maximum(ix,dims=2) |> iy -> A[:, vec(.!iy)]
end
For large matrices (the OP said A had 45807 cols and B had 4580 cols) this will be slow and will also allocate a large matrix of booleans, because the time and memory of your solution scales like the product of the number of columns in A with the number of columns in B.
@stevengj, please find here below test results for large matrices as per OP, run on regular Win10 laptop and Julia 1.6.0-beta1.0, confirming your speed predictions:
using Random, BenchmarkTools
A = rand(0:2, 28960, 45807);
B = A[:,vec(rand(1:45807,1,4580))];
@btime A_except_B($A,$B) # 19.0 s (21 allocations: 9.1 GiB)
@btime filtercols1($A,$B) # 9.3 s (45833 allocations: 9.0 GiB)
@btime filtercols2($A,$B) # 11.5 s (26 allocations: 9.0 GiB)
function filtercols(A, B)
Bcols = Set(eachcol(B))
return A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]
end
to filter the columns of matrix C from matrix D
size of C = 1104x4580 and size of D = 1104x45807
so the filtered matrix should have size of 1104x41227, but I got filtered matrix with1104x41192
where did 41227 - 41192 = 35 columns go ? and how can I fix it ?