How to filter columns of a matrix by columns of another matrix

rafael.guerra · April 11, 2021, 8:40pm

@rocco_sprmnt21, this seems to be a very nice idea to split the problem in two parts.

Tried the implementation below which composes your first-pass function with @stevengj’s filtercols(). The performance numbers seem a bit inconsistent for these very large matrices. Sometimes a 5-20% speed-up can observed.

What would be the best way to implement the 2-step idea?

NB: edited filtercolsr() following advice from @stevengj

function filtercols(A,B)   # by @stevengj
    Bcols = Set(eachcol(B))
    A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]
end

@views function filtercolsr(A, B)    # edited from @rocco_sprmnt21
    Bcols = Set(eachcol(B[1:100, :]))
    A[:, [i for i in axes(A,2) if A[1:100,i] ∉ Bcols]]
end

g(A,B) = filtercols(filtercolsr(A,B),B)

using Random, BenchmarkTools
A = rand(0:2, 28960, 45807);
B = A[:,vec(rand(1:45807,1,4580))];

@btime filtercols($A,$B)  # (single-pass) 8.721 s (26 allocations: 8.94 GiB)

@btime g($A,$B)  # (two-passes) 8.317 s (50 allocations: 8.94 GiB)

rocco_sprmnt21 · April 12, 2021, 7:48am

I would like to make an even stronger suggestion.
If you limited your search to just the first step, how likely would it be to have unwanted columns?
For example, limiting the check to only the first 100 items, how many more columns are there on average than the full check?

rafael.guerra · April 12, 2021, 9:09am

It will all depend on the type of input data we are looking at and its underlying statistics, whether it is expected to be a completely random sequence or, just occasional random differences expected.

rocco_sprmnt21 · April 12, 2021, 10:32am

I would like to understand where the following reasoning is misleading or not applicable or maybe wrong:

Suppose that, as in the model proposed here, that the arrangements of the triples (0,1,2) on, say, 1000 positions are all equally probable.
Let’s consider the columns with the first 100 elements all 0.
of these columns {0,0, …, 0, e_101, e_102, …, e_1000}, there are 3 ^ 900.
So the probability of having 2 of this type in 2 columns is about 1/3 ^ 100.
The probability of having 2 or more in m columns is less than m / 3 ^ 100

rafael.guerra · April 12, 2021, 11:41am

There are other scenarios. For example, a manufacturing machine that is supposed to always reproduce the same sequence of 28960 tasks but occasionally misbehaves. In this case, all the columns are equal most of the time, but some will be different at random positions where the machine has misbehaved, performing some tasks out of tolerance for example.

rocco_sprmnt21 · April 12, 2021, 8:13pm

I did this experiment and this happens several times …

julia> A = rand(0:2, 20000, 40000);

julia> B = A[:,vec(rand(1:40000,1,4000))];

julia> @btime A_except_B($A,$B)
  57.924 s (21 allocations: 5.54 GiB)
20000×36165 Matrix{Int64}:
 0  1  2  1  1  2  0  1

julia> @btime filtercolsr($A,$B)
  55.803 ms (24 allocations: 1.38 MiB)
20000×36165 view(::Matrix{Int64}, :, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  … 39993, 39995, 39999, 40000]) with eltype Int64:
 0  1  2  1  1  2  0


julia> A_except_B(A,B) ==filtercolsr(A,B)
true

rafael.guerra · April 12, 2021, 8:31pm

That testing a fraction of the whole data is faster seems obvious, that for some use cases it might be enough that is conceivable, but your random input is not the scenario (far from it) described in my previous post.

rocco_sprmnt21 · April 12, 2021, 9:03pm

###edited##

using a sort of pseucode, this is the idea:

Acols = eachcol(A)

Bcols = eachcol(B)

Acolsh = eachcol(A[1:h, :])

Bcolsh = eachcol(B[1:h, :])

S1a = A[:, [i for i in axes(A,2) if Acols[i] ∉ Bcolsh]]

S1b = A[:, [i for i in axes(A,2) if Acols[i] ∈ Bcolsh]]

S2 = A[:, [i for i in axes(S1b,2) if S1b[i] ∉ Bcols]]

and this should be the result: S1a U S2

stevengj · April 12, 2021, 10:36pm

rafael.guerra:

function filtercolsr(A, B)    # edited from @rocco_sprmnt21
    Bcols = Set(eachcol(@views B[1:100, :]))
    @views A[:, @views [i for i in axes(A,2) if A[1:100,i] ∉ Bcols]]
end

@views applies to all subexpressions so you don’t need to repeat it like this. You can just write @views function filtercolsr(A, B) .... end and every slicing expression inside the function body will be a view.

Topic		Replies	Views
How to exclude the columns of a matrix which are present as columns in another matrix Numerics	2	391	October 23, 2019
How to delete following columns from matrix in julia (specific case) General Usage matrices	11	549	October 13, 2021
Dataframe Filter New to Julia question , dataframes	6	6027	March 26, 2022
Choosing only different vectors from a matrix New to Julia	8	523	September 1, 2019
Finding unique columns of a matrix New to Julia	3	728	March 18, 2021

How to filter columns of a matrix by columns of another matrix

Related topics