DataFrames: obtaining the subset of rows by a set of values

Don’t use @time, you need @btime from BenchmarkTools to do any actual benchmarking.

It’s possible that filter will be slower in some cases due to type stability issues. Hopefully the situation will improve as more parts of DataFrames are implemented using NamedTuples.

1 Like

I think Query.jl’s @filter should generally performe much better on DataFrame than the base filter (because it gets around the type stability issue that @ExpandingMan mentioned).

Even withing DataFrames, you can work around this by using the type stable Tables.rows iterator and a function barrier:

using DataFrames, Tables

function _where(f, t)
  [f(i) for i in t]
end

function _filter(f, df)
  mask = _where(f, Tables.rows(df))
  df[mask, :]
end

This could even become the default implementation I guess. It may even bring some extra performance over the Query implementation (I think, have not measured) in that Tables.rows produces a lazy iterator that only materializes the fields that you are using in your function f.

I’m not 100% sure we need the extra function barrier here, I imagine the array comprehension will already be turned into a call to some function so that in the performance critical part Julia knows the type of Tables.rows(df)

Unfortunately that solution doesn’t fly when there are many columns with heterogeneous types:

julia> df = DataFrame(rand(10000, 100));

julia> df.a = 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> @time _filter(x -> x.x1 > 0.5, df);
  0.809720 seconds (3.87 M allocations: 132.545 MiB, 12.98% gc time)

julia> @time filter(x -> x.x1 > 0.5, df);
  0.105625 seconds (2.52 M allocations: 67.293 MiB, 10.89% gc time)

My current thinking is that the ideal interface would be something like filter(x1 -> x1 > 0.5, df), and we would extract the names of the arguments to identify which variables (here x1) are actually used. That would avoid problems with too large numbers of columns and would offer a compact syntax.

This sounds nice. You just introduce a macro

@λ x1, x3 -> ...

that would map to a

SelectingClosure((:x1, :x3), ((x1, x3)) -> ...)

where

struct SelectingClosure
    columns::C
    f::F
end 

and dispatch on this.

We don’t even need a macro. We can extract the columns inside filter and pass them to a helper function. The names of the arguments are available from the anonymous function.

2 Likes

That’s the strategy followed by JuliaDBMeta row-wise macros: figure out what columns are needed from the expression fed to the macro and only iterate on those fields.

Could you please spell out in more detail how one does this without a macro? It sounds very interesting, I imagine one would look at how the anonymous function stores the variables it needs, but I’m not very familiar with how that is represented.

Yes, that’s something I realized quite recently. See this comment.

Then DataFramesMeta/JuliaDBMeta could just provide macros to create these anonymous functions for convenience, but the basic support would be in DataFrames/JuliaDB, and it wouldn’t be too inconvenient to use.

I can’t help but ask, how is it possible to obtain these? I didn’t know this was possible, and now that you’ve mentioned it my head is starting to filly with all sorts of whacky crazy ideas about fun things to do with it :laughing:

Oh, sorry I see, it’s in the comment you linked. Man, that’s not as elegant as I was hoping for. If this became something that DataFrames relied on we’d of course need to ask for it to be part of a public, stable API.

I also tend to agree with @ExpandingMan that it’d be nicer to have a public API for this.

In terms of future direction I would propose the following. Have a:

filter(f, df::DataFrame; select = find_fieldnames(f))

function (like in JuliaDB) where the user can add select = ... by hand for optimization. In case Julia can prove that f only uses some fields then select is set to only specify those fields.

Ideally one would also overload the select methods from say here for generic tables, and then you would define:

function _where(f, t)
  [f(i) for i in t]
end

function _filter(f, df; select = find_fieldnames(f))
  mask = _where(f, Tables.rows(df, select = select)) # Or maybe TableTools.select(Tables.rows(df), select)
  df[mask, :]
end

EDIT: maybe relevant to the discussion, these slides from JuliaCon show how it happens right now for JuliaDB(Meta) where the macros gets the symbols to pass to the select argument.

iirc DataFramesMeta parses the entire expression first and creates a Dict-like object of all the symbols used. One could then construct a TypedDataFrame type object from that, similar to what JuliaDBMeta does. So implementing this in DataFramesMeta would not be that hard.

However it’s nice to do all data operations in functions, in which case the scoping rules of DataFramesMeta can make things difficult. It would be nice to have this live in DataFrames but a DataFramesMeta implementation is a good place to start.

Well, it would be really cool if we could just read the argument names as @nalimilan seems to be suggesting and simply do

filter((col1, col2) -> f(col1, col2), df)

What implications this has for type stability I’m really not sure, asking the compiler to know ahead of time what method of f would be used seems to be asking quite a lot. On the other hand, once it runs, it would have the advantage of operating on arguments of fixed types rather than DataFrameRow. The point I was making that if something like this even turned out to be a good idea, we’d want to have a solid API for determining the argument names rather than making DataFrames reliant on Julia internals that are likely to change.

It should be really simple for the compiler. filter would be type-unstable, but it would pass a named tuple of columns to a helper function which would be type stable.

The drawback of JuliaDB’s select approach is that you specify the column names once via an argument, but then you also need to repeat the name of the argument to the anonymous function, and inside it (i.e. df -> df.col1). Also, with data frames, if you don’t specify select we either have to generate a giant named tuple or fall back to type unstable code. So that’s not ideal, especially for newcomers.

I agree we would need to make sure there’s a relatively stable Julia interface for that, though. But even if they change in a future 1.x release it’s not the end of the world to adapt, as long as it remains accessible.

2 Likes

Cool, I’m getting excited about this! :smiley:

My idea was to use your trick so that select would pick only the relevant column if Julia is able to deduce what they are just by inspecting the function (select = find_fieldnames(f), wherefind_fieldnameswould be a function that tries to guess what fields are needed byf`). I imagine that there are some functions for which these method won’t work, in which case we would still allow the user to specify this manually. Or does your method work on all functions?

My “method” doesn’t detect variables which are used by a function. I don’t think that’s possible, nor that it makes sense in Julia. I merely suggest using the names of the arguments a function takes and require them to match column names.

1 Like

Ah, I had completely misunderstood! I had in mind something much more complicated (infer from the function body what fields are needed), but I agree that’s much harder if at all possible.

If you need to rely on the names the user is passing, I would tend to agree with @Tamas_Papp that this could be done by a macro using symbols, say @λ :SepalLength > 2*:SepalWidth (or denoted in another way, say @λ $SepalLength > 2*$SepalWidth or @λ &SepalLength > 2 * &SepalWidth) would return a SelectingClosure object. Such a basic macro could live in a very low-dependency package (say a TablesMeta that only requires Tables and maybe MacroTools) and DataFrames could even reexport it. In this case, from my proposal above, one would add the dispatch:

filter(f::SelectingClosure, df::DataFrame) = filter(f.f, df, select = f.select)

Otherwise, if I’m working on a table with a 1000 columns and I filter on x -> x.mycol > 3 but :x is one of the columns, I would get Error: type ... has no field mycol

Yes, that’s the main issue with the idea of using argument names. There are several possible solutions: 1) Use a special name to indicate that the full table should be passed (e.g. _); we could prevent people from creating columns with that name. 2) Do df::DataFrame -> ... to indicate you want the full table (this information can also be extracted). 3) Pass the function via a keyword argument to change its meaning.

Of course we can do anything we want with macros (and DataFramesMeta/JuliaDBMeta already do that). But it would be nice to avoid requiring a non-standard syntax for simple operations.

2 Likes

Somehow I feel that, at least in the IndexedTables case, given that an IndexedTable is a row iterator, there the filter(f, df) should by default do the normal thing (meaning, apply f to each row), but a keyword argument would definitely make sense and we’d need a different signature for the alternative.

I wonder whether it makes sense to create callable structs:

struct Map{F, N}
  f::F
  select::NTuple{N, Symbol}
end
Map(f::SelectingClosure) = Map(f.f, f.select)
Map(f) = Map(SelectingClosure(f))

(m::Map)(df) = map(m.f, df, select = m.select)


struct Filter{F, N}
  f::F
  select::NTuple{N, Symbol}
end
Filter(f::SelectingClosure) = Filter(f.f, f.select)
Filter(f) = Filter(SelectingClosure(f))

(m::Filter)(df) = filter(m.f, df, select = m.select) 

which has the advantage of being non-breaking and working with both IndexedTables and DataFrames and working with piping:

iris |>
  Filter(Species -> Species == "versicolor") |>
  Map((SepalLength, SepalWidth) -> SepalLength / SepalWidth)

A keyword to filter and map (option 3) also makes a lot of sense.

I think that a macro would actually be an advantage here, indicating that something special is going on, but at the same time allow a construct that is meaningful in isolation as a callable. Eg

@λ x1, x3 -> ...

would have information about which fields are used (if necessary), but should also work fine as a callable as a fallback.