How to use `in` with @where [DataFramesMeta]?

As from the title, how to use in with @where (DataFramesMeta)?

e.g. @where(df, :x in ["cat1", "cat2"])

I did already try :x .in ["cat1", "cat2"] or .in(:x, ["cat1", "cat2"])…

1 Like

You actually need your function to take a vector and output a vector of booleans, as @where works with functions taking vectors as inputs. For example:

df = DataFrame(x = 1:3, y = [2, 1, 2])
@where(df, :x .> 1)

One easy solution is to use a version of in where the first argument can be a vector:

@where(df, [x in [1,2] for x in :x])

which of course gets annoying to write so I guess you could define some auxiliary function if you have to do this very often. For example you could define a customized operator \smallin :

as ∊ b = [a in b for a in as]
@where(df, :x ∊ [1,2])

On the other hand I’d be curious to know what is the recommended way to select data in a DataFrame. I’m not sure whether it is the @where macro from DataFramesMeta or if instead the Query.jl package should be preferred (or some other option that I’m not aware of).

1 Like

FWIW, there’s an issue about it here.

2 Likes

You can use Query for this:

@from i in df begin
    @where i.x in ["cat1", "cat2"]
    @select i
    @collect DataFrame
end

I’m still working on a short-version API for this kind of scenario (the above code really is quite verbose if all you want to do is filter…), but nothing really working at this point.

3 Likes

I use indexin for this kind of query, i.e.

julia> d = DataFrame(x=[1, 2, 3], y="X")
julia> @where(d, indexin(:x, [1, 3]) .> 0)
2Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ x β”‚ y β”‚
β”œ-----β”Ό---β”Ό---─
β”‚ 1   β”‚ 1 β”‚ X β”‚
β”‚ 2   β”‚ 3 β”‚ X β”‚

As a bonus, it is tolerant of NA values, too :slight_smile:

@where(df, in.(:x, [["cat1", "cat2"]]))

Notice that ["cat1", "cat2"] is wrapped in another Array (the iterable is the only element of another iterable).

2 Likes

I really like this thank you.
Can also be written

@where(df, :x .∈ [["cat1", "cat2"]])