How to use `in` with @where [DataFramesMeta]?

question

#1

As from the title, how to use in with @where (DataFramesMeta)?

e.g. @where(df, :x in ["cat1", "cat2"])

I did already try :x .in ["cat1", "cat2"] or .in(:x, ["cat1", "cat2"])


#2

You actually need your function to take a vector and output a vector of booleans, as @where works with functions taking vectors as inputs. For example:

df = DataFrame(x = 1:3, y = [2, 1, 2])
@where(df, :x .> 1)

One easy solution is to use a version of in where the first argument can be a vector:

@where(df, [x in [1,2] for x in :x])

which of course gets annoying to write so I guess you could define some auxiliary function if you have to do this very often. For example you could define a customized operator \smallin :

as ∊ b = [a in b for a in as]
@where(df, :x ∊ [1,2])

On the other hand I’d be curious to know what is the recommended way to select data in a DataFrame. I’m not sure whether it is the @where macro from DataFramesMeta or if instead the Query.jl package should be preferred (or some other option that I’m not aware of).


#3

FWIW, there’s an issue about it here.


#4

You can use Query for this:

@from i in df begin
    @where i.x in ["cat1", "cat2"]
    @select i
    @collect DataFrame
end

I’m still working on a short-version API for this kind of scenario (the above code really is quite verbose if all you want to do is filter…), but nothing really working at this point.


#5

I use indexin for this kind of query, i.e.

julia> d = DataFrame(x=[1, 2, 3], y="X")
julia> @where(d, indexin(:x, [1, 3]) .> 0)
2×2 DataFrames.DataFrame
│ Row │ x │ y │
├-----┼---┼---┤
│ 1   │ 1 │ X │
│ 2   │ 3 │ X │

As a bonus, it is tolerant of NA values, too :slight_smile:


#6
@where(df, in.(:x, [["cat1", "cat2"]]))

Notice that ["cat1", "cat2"] is wrapped in another Array (the iterable is the only element of another iterable).