`filter()` and Boolean masks


#1

It often comes up that one wants to filter/select rows of a Matrix based on the values of some column. I believe the best way of doing this is with Boolean masks? But these can get a bit weird… For example:

a = ["a" "x"; "b" "y"]
a[in.(a[:, 1], [["a", "c"]]), :]

A bit hard to read and it is probably confusing at first that you need to wrap the iterable in another iterable.

In this case I would prefer something like filter!(). For example with DataFrames I can do it like this (although it is currently not at all efficient; but it could be):

using DataFrames
b = DataFrame(["a" "x"; "b" "y"])
filter!(row -> row[1] in ["a", "c"], b)

Which got me thinking that perhaps it makes sense to add another version of filter() / filter!() that also has a dimension argument (1 for iterating rows as in DataFrame)? Or am I missing something and it is already easy to achieve somehow?

And also, why is filter!() so much slower than using a Boolean mask?

fun1(arr) = filter!(x -> x in ["a", "c"], arr)
fun2(arr) = arr[in.(arr, [["a","c"]]), :]
c = rand(["a","b","c","d"], 10000000);
@time fun1(c); @time fun2(c);
@time fun1(c); @time fun2(c);