Hi all,
I’m very new to Julia and struggling to understand the filter() syntax. I’d like to use filter() to filter a data frame to contain only rows with values in a specific range (like R’s dplyr::between()). I’ve implemented a simple working version of what I want below. However, I am struggling to move the hard-coded lower, upper, and target variables into the function input without breaking the filter call. Is there any way to have the :Col1 => myFilter accept constant input arguments in myFilter so that I can change the target window with each call to filter? What I’d like is to have a line of code in this direction: filter(:Col1 => myFilter(., lower, upper, target), df). Is something like that possible using filter and =>?
Minimal Example:
using DataFrames
df = DataFrame(Col1 = 1:5, Col2 = ["A", "B", "C", "D", "E"])
function myFilter(x)::Bool
lower = 0.5
upper = 2
target = 1
in_range = x >= target + lower && x <= target + upper
return in_range
end
filter(:Col1 => myFilter, df)
# -> returns two rows of df that are in range
# Problem: lower, upper, and target are hard-coded inside function.
PS: I’m trying out different approaches to filtering for benchmarking purposes. I’m aware that df[(df[:,"Col1"] .>= target + lower) .& (df[:,"Col1"] .<= target + upper),:] would be one option to solve this filtering problem. If there is a recommended way to rapidly filter down data frames or similar data structures based on a single numeric column (sorted) I’d also be very interested to learn more about it.
You asked many questions in one post so it is hard to filter them out. I understand that your main question is how to define a function that would take parameters for filtering.
You can do it as follows:
myFilter(lower, upper, target) = x -> lower <= x - target <= upper
Thanks a lot bkamins & sudete! Both answers solve my problem. I selected bkamins solution because it was first.
I actually have a small follow up question regarding the solution itself. I’m somewhat confused about how Julia knows that :Col1 => corresponds to the x variable inside the myFilter function. Would it be possible to pass multiple columns this way too, e.g. use an additional condition based on Col2 as below?
function myFilter(lower, upper, target)
cond1 = x -> lower <= x - target <= upper # condition on first column
cond2 = y -> y == "B" # condition on second column
cond1 && cond2
end
filter([:Col1, :Col2] => myFilter(0.5, 2, 1), df) # does not work
Yes you can do this. In this case myFilter must return a function that accepts two parameters (one for each column value). However you cannot use the && operator to combine two boolean functions into one (but that would be a great feature I think, you could file an issue to propose it). You can for example do this:
function myFilter(lower, upper, target)
return function (x,y)
cond1 = lower <= x - target <= upper # condition on first column
cond2 = y == "B" # condition on second column
cond1 && cond2
end
end
The only thing is that this would be slower if lower, upper and target were global variables, because then the closure created by the macro would not be type-stable, as the three variables are not guaranteed to keep their types. But if it’s in a local context like let or behind a function barrier, it’s fine.
Regarding “similar data structures”: there is a simple common way to filter many kinds of tables, e.g. vector-of-namedtuples in Base julia, tables from StructArrays or TypedTables packages, etc. That’s the filter(predicate, table) function:
julia> using TypedTables
julia> t = Table(a=1:3, b=10:10:30)
Table with 2 columns and 3 rows:
a b
┌──────
1 │ 1 10
2 │ 2 20
3 │ 3 30
julia> filter(x -> 1 < x.a <= 4, t)
Table with 2 columns and 2 rows:
a b
┌──────
1 │ 2 20
2 │ 3 30
You can use multiple columns of course, with intuitive Base julia syntax:
julia> filter(x -> 2 < x.a < 4 || x.b == 10, t)
Table with 2 columns and 2 rows:
a b
┌──────
1 │ 1 10
2 │ 3 30
Unfortunately, DataFrames don’t follow the same approach and don’t support this filter syntax.
Sorry, of course you are right here!
I just misremembered: everything is fine with filter, it’s the map function that DataFrames don’t support, unlike other common table types.
Btw:
That’s just filter(x -> !ismissing(x.some_column)).
Yes - and that is why map errors. We might add map in the future, but together with @nalimilan we decided that it is safer to disallow it than to make a bad design decision.
For now, map is supported with eachrow(df) or eachcol(df) (depending on if the user wants to iterate columns or rows, or alternatively use the select function that is more general.
Another syntax for multiple conditions like that is with begin end, slightly different runtime probably because it’s two function applications, but often clarity matters more than pure speed:
@subset df begin
target + lower <= :Col1 <= target
:Col2 == "B"
end