@rocco_sprmnt21, this seems to be a very nice idea to split the problem in two parts.
Tried the implementation below which composes your first-pass function with @stevengj’s filtercols(). The performance numbers seem a bit inconsistent for these very large matrices. Sometimes a 5-20% speed-up can observed.
What would be the best way to implement the 2-step idea?
NB: edited filtercolsr() following advice from @stevengj
function filtercols(A,B) # by @stevengj
Bcols = Set(eachcol(B))
A[:, @views [i for i in axes(A,2) if A[:,i] ∉ Bcols]]
end
@views function filtercolsr(A, B) # edited from @rocco_sprmnt21
Bcols = Set(eachcol(B[1:100, :]))
A[:, [i for i in axes(A,2) if A[1:100,i] ∉ Bcols]]
end
g(A,B) = filtercols(filtercolsr(A,B),B)
using Random, BenchmarkTools
A = rand(0:2, 28960, 45807);
B = A[:,vec(rand(1:45807,1,4580))];
@btime filtercols($A,$B) # (single-pass) 8.721 s (26 allocations: 8.94 GiB)
@btime g($A,$B) # (two-passes) 8.317 s (50 allocations: 8.94 GiB)
I would like to make an even stronger suggestion.
If you limited your search to just the first step, how likely would it be to have unwanted columns?
For example, limiting the check to only the first 100 items, how many more columns are there on average than the full check?
It will all depend on the type of input data we are looking at and its underlying statistics, whether it is expected to be a completely random sequence or, just occasional random differences expected.
I would like to understand where the following reasoning is misleading or not applicable or maybe wrong:
Suppose that, as in the model proposed here, that the arrangements of the triples (0,1,2) on, say, 1000 positions are all equally probable.
Let’s consider the columns with the first 100 elements all 0.
of these columns {0,0, …, 0, e_101, e_102, …, e_1000}, there are 3 ^ 900.
So the probability of having 2 of this type in 2 columns is about 1/3 ^ 100.
The probability of having 2 or more in m columns is less than m / 3 ^ 100
There are other scenarios. For example, a manufacturing machine that is supposed to always reproduce the same sequence of 28960 tasks but occasionally misbehaves. In this case, all the columns are equal most of the time, but some will be different at random positions where the machine has misbehaved, performing some tasks out of tolerance for example.
That testing a fraction of the whole data is faster seems obvious, that for some use cases it might be enough that is conceivable, but your random input is not the scenario (far from it) described in my previous post.
@views applies to all subexpressions so you don’t need to repeat it like this. You can just write @views function filtercolsr(A, B) .... end and every slicing expression inside the function body will be a view.