Is it possible to use filter(:col => myFilter, df) with additional constant input variables?

Hi all,
I’m very new to Julia and struggling to understand the filter() syntax. I’d like to use filter() to filter a data frame to contain only rows with values in a specific range (like R’s dplyr::between()). I’ve implemented a simple working version of what I want below. However, I am struggling to move the hard-coded lower, upper, and target variables into the function input without breaking the filter call. Is there any way to have the :Col1 => myFilter accept constant input arguments in myFilter so that I can change the target window with each call to filter? What I’d like is to have a line of code in this direction: filter(:Col1 => myFilter(., lower, upper, target), df). Is something like that possible using filter and =>?

Minimal Example:

using DataFrames
df = DataFrame(Col1 = 1:5, Col2 = ["A", "B", "C", "D", "E"])
function myFilter(x)::Bool
    lower = 0.5
    upper = 2
    target = 1
    in_range = x >= target + lower && x <= target + upper
    return in_range
end
filter(:Col1 => myFilter, df)
# -> returns two rows of df that are in range
# Problem: lower, upper, and target are hard-coded inside function.

PS: I’m trying out different approaches to filtering for benchmarking purposes. I’m aware that df[(df[:,"Col1"] .>= target + lower) .& (df[:,"Col1"] .<= target + upper),:] would be one option to solve this filtering problem. If there is a recommended way to rapidly filter down data frames or similar data structures based on a single numeric column (sorted) I’d also be very interested to learn more about it.

1 Like

You asked many questions in one post so it is hard to filter them out. I understand that your main question is how to define a function that would take parameters for filtering.
You can do it as follows:

myFilter(lower, upper, target) = x -> lower <= x - target <= upper

and now you can write:

julia> filter(:Col1 => myFilter(0.5, 2, 1), df)
2×2 DataFrame        
 Row │ Col1   Col2   
     │ Int64  String 
─────┼───────────────
   1 │     2  B      
   2 │     3  C      
4 Likes

You can define your function as

myFilter(x; lower=0.5, upper=2, target=1) = x >= target + lower && x <= target + upper

and use it like this for example:

julia> filter(:Col1 => (x->myFilter(x, upper=3.1)), df)
3×2 DataFrame
 Row │ Col1   Col2   
     │ Int64  String 
─────┼───────────────
   1 │     2  B
   2 │     3  C
   3 │     4  D

One day this might be much nicer, if https://github.com/JuliaLang/julia/pull/24990 ever gets merged:

filter(:Col1 => myFilter(_, upper=3.1), df)

Another option would be to write a make_filter function that returns the function you want, for example:

make_filter(; lower=0.5, upper=2, target=1) = x -> x >= target + lower && x <= target + upper

julia> filter(:Col1 => make_filter(upper=3.1), df)
3×2 DataFrame
 Row │ Col1   Col2   
     │ Int64  String 
─────┼───────────────
   1 │     2  B
   2 │     3  C
   3 │     4  D
2 Likes

use the searchsortedfirst and searchsortedlast functions if you are sure your column is sorted.

2 Likes

Thanks a lot bkamins & sudete! Both answers solve my problem. I selected bkamins solution because it was first.

I actually have a small follow up question regarding the solution itself. I’m somewhat confused about how Julia knows that :Col1 => corresponds to the x variable inside the myFilter function. Would it be possible to pass multiple columns this way too, e.g. use an additional condition based on Col2 as below?

function myFilter(lower, upper, target)
    cond1 = x -> lower <= x - target <= upper # condition on first column
    cond2 = y -> y == "B"                     # condition on second column
    cond1 && cond2
end
filter([:Col1, :Col2] => myFilter(0.5, 2, 1), df) # does not work

Yes you can do this. In this case myFilter must return a function that accepts two parameters (one for each column value). However you cannot use the && operator to combine two boolean functions into one (but that would be a great feature I think, you could file an issue to propose it). You can for example do this:

function myFilter(lower, upper, target)
    return function (x,y)
        cond1 = lower <= x - target <= upper # condition on first column
        cond2 = y == "B"                     # condition on second column
        cond1 && cond2
    end
end

and use it as you did:

julia> filter([:Col1, :Col2] => myFilter(0.5, 2, 1), df)
1×2 DataFrame
 Row │ Col1   Col2   
     │ Int64  String 
─────┼───────────────
   1 │     2  B
3 Likes

If you want to, you can also do this with one of the macro packages, for example with DataFrameMacros.jl:

using DataFrameMacros
using DataFrames
df = DataFrame(Col1 = 1:5, Col2 = ["A", "B", "C", "D", "E"])
lower = 0.5
upper = 2
target = 1
@subset(df, target + lower <= :Col1 <= target + upper)

The only thing is that this would be slower if lower, upper and target were global variables, because then the closure created by the macro would not be type-stable, as the three variables are not guaranteed to keep their types. But if it’s in a local context like let or behind a function barrier, it’s fine.

2 Likes

Regarding “similar data structures”: there is a simple common way to filter many kinds of tables, e.g. vector-of-namedtuples in Base julia, tables from StructArrays or TypedTables packages, etc. That’s the filter(predicate, table) function:

julia> using TypedTables

julia> t = Table(a=1:3, b=10:10:30)
Table with 2 columns and 3 rows:
     a  b
   ┌──────
 1 │ 1  10
 2 │ 2  20
 3 │ 3  30

julia> filter(x -> 1 < x.a <= 4, t)
Table with 2 columns and 2 rows:
     a  b
   ┌──────
 1 │ 2  20
 2 │ 3  30

You can use multiple columns of course, with intuitive Base julia syntax:

julia> filter(x -> 2 < x.a < 4 || x.b == 10, t)
Table with 2 columns and 2 rows:
     a  b
   ┌──────
 1 │ 1  10
 2 │ 3  30

Unfortunately, DataFrames don’t follow the same approach and don’t support this filter syntax.

1 Like

DataFrames.jl carefully supports all API that is defined in Julia Base (if you find any inconsistency can you please open an issue? Thank you!):

julia> using DataFrames

julia> t = DataFrame(a=1:3, b=10:10:30)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     10
   2 │     2     20
   3 │     3     30

julia> filter(x -> 1 < x.a <= 4, t)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     2     20
   2 │     3     30

julia> filter(x -> 2 < x.a < 4 || x.b == 10, t)
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     10
   2 │     3     30

The point of the src => fun syntax is as follows:

  1. to allow processing very wide tables (like 10,000 columns), as otherwise compilation latency would be killing processing of such tables
  2. fun then does not need to be aware that it is processing columns, so you can write e.g. :some_column => !ismissing
3 Likes

Sorry, of course you are right here!
I just misremembered: everything is fine with filter, it’s the map function that DataFrames don’t support, unlike other common table types.

Btw:

That’s just filter(x -> !ismissing(x.some_column)).

1 Like

Yes - and that is why map errors. We might add map in the future, but together with @nalimilan we decided that it is safer to disallow it than to make a bad design decision.

For now, map is supported with eachrow(df) or eachcol(df) (depending on if the user wants to iterate columns or rows, or alternatively use the select function that is more general.

This approach also works for the two column case, giving a very concise way of doing these filters. Thanks a lot for the tip!

@subset(df, target + lower <= :Col1 <= target + upper && :Col2 == "B")

Another syntax for multiple conditions like that is with begin end, slightly different runtime probably because it’s two function applications, but often clarity matters more than pure speed:

@subset df begin
    target + lower <= :Col1 <= target
    :Col2 == "B"
end

Or the same thing with commas:

@subset(df, target + lower <= :Col1 <= target + upper, :Col2 == "B")
2 Likes