Dynamically choosing correct way to filter

Hi there,

I have an question regarding best practice.

I have a dataframe that I need to filter on several columns based on the value in each column.
However depending on the situation I may be filtering based on an array of elements or a single element. I’ve tired to show the problem below.


using DataFrames, DataFramesMeta
df = DataFrame(a = ["el1", "el2", "el3", "el4"], b = [1,2,3,4])

subset1 = ["el1", "el2", "el3"]
subset2 = "el1"

I know how to filter each of these cases individually:

@subset(df, in.(:a, Ref(subset1)))
@subset(df, :a.==subset2)

However I can’t find a way to filter regardless of whether the input is an array or a single element without resorting to if statements, something like this:

function df_filter(df, f)
    if typeof(f) <: AbstractArray
        dff = @subset(df, in.(:a, Ref(f)))
        dff = @subset(df, (:a.==f))
    return dff 

I also thought about possibly using multiple dispatch. The issue is that in either case it seems to get very messy as the number of columns that I need to filter on increases. E.g. for 3 columns I’d need 8 different methods.

Any help on how to better write this kind of functionality would be appreciated. Thanks!

If possible, it might be best to sidestep the problem entirely by modifying the input. That is, at some higher level in the call stack, change your subset2 so that it is ["el1"] rather than "el1". This is more conceptually consistent with what you are doing, which is asking “does a match one of these things?”. It’s just that in the case of subset2, the collection you’re matching it against only has one element. In general I find it usually makes more sense to keep functions that do this sort of low-level filtering as simple/generic as possible, and do some transformation at a higher level to keep the input well formed.

But there are other options if this isn’t suitable for you. You could define a custom filter function: that is, rather than wrapping the whole thing in a function like your df_filter, make a new (multiple-dispatched) function that you pass to @subset. Like this:

eq_or_in(x::T, y::T) where {T<:Any} = (x == y)
eq_or_in(x::T, y::AbstractVector{T}) where {T<:Any} = (x in y)
@subset(df, eq_or_in.(:a, Ref(subset1)))
@subset(df, eq_or_in.(:a, Ref(subset2)))

This will throw an error if you try to match against an item or a vector with a different element type, but you could drop the matching T's altogether if you wanted (e.g. to be able to filter a column of numbers [1.0, 1.5, 2.0] based on a vector of a different number type [1,2,3])

I’ll also point out that DataFramesMeta now has @rsubset which operates the subset rowwise, removing the need for broadcasting and Refs in most cases. For example, you could write things like

@rsubset(df, :a in subset1)
@rsubset(df, eq_or_in(:a, subset1))

Also I think DataFramesMeta now exports everything from DataFrames itself, so you can just do using DataFramesMeta :slight_smile:

1 Like

Thanks so much for the detailed response.

You are right, modifying the input at an earlier stage probably makes the most sense in my use case.
Having said that, seeing the example you showed for multiple dispatch is intriguing. Had never considered solving the problem like this.

@rsubset is also a great tip. :+1: