It often comes up that one wants to filter/select rows of a Matrix based on the values of some column. I believe the best way of doing this is with Boolean masks? But these can get a bit weird… For example:
a = ["a" "x"; "b" "y"]
a[in.(a[:, 1], [["a", "c"]]), :]
A bit hard to read and it is probably confusing at first that you need to wrap the iterable in another iterable.
In this case I would prefer something like filter!()
. For example with DataFrames
I can do it like this (although it is currently not at all efficient; but it could be):
using DataFrames
b = DataFrame(["a" "x"; "b" "y"])
filter!(row -> row[1] in ["a", "c"], b)
Which got me thinking that perhaps it makes sense to add another version of filter()
/ filter!()
that also has a dimension argument (1
for iterating rows as in DataFrame
)? Or am I missing something and it is already easy to achieve somehow?
And also, why is filter!()
so much slower than using a Boolean mask?
fun1(arr) = filter!(x -> x in ["a", "c"], arr)
fun2(arr) = arr[in.(arr, [["a","c"]]), :]
c = rand(["a","b","c","d"], 10000000);
@time fun1(c); @time fun2(c);
@time fun1(c); @time fun2(c);