Stack overflow in DataFrames group by


#1

I have a million row data frame and I am running groupby on a column of strings:

by(df, :col) do sdf
    DataFrame(n=size(sdf,1))
end

and I get the following StackOverflowError

ERROR: LoadError: StackOverflowError:
Stacktrace:
 [1] promote_type(::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Vararg{Type{T} where T,N} where N) at ./promotion.jl:127 (repeats 16343 times)
 [2] promote_col_type(...) at /home/expandingman/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:666
 [3] vcat(::DataFrames.DataFrame, ::DataFrames.DataFrame, ::DataFrames.DataFrame, ::Vararg{DataFrames.DataFrame,N} where N) at /home/expandingman/.julia/v0.6/DataFrames/src/abstractdataframe/abstractdataframe.jl:744
 [4] combine(::DataFrames.GroupApplied{DataFrames.DataFrame}) at /home/expandingman/.julia/v0.6/DataFrames/src/groupeddataframe/grouping.jl:196
 [5] (::DataFrames.#kw##by)(::Array{Any,1}, ::DataFrames.#by, ::DataFrames.DataFrame, ::Symbol, ::Function) at ./<missing>:0

Update: It turns out that promote_type just seems to stack overflow if it gets too many arguments. Is this a Julia problem? It’s probably not ok to pass 10^5 arguments to a function, but there doesn’t seem to be a promote_type for arrays. I’ve added the following horribly hackish code to DataFrames as a temporary quick fix

if length(cols) > 10^3
    T = promote_type(map(x -> Nulls.T(eltype(x)), cols[1:10^3])...)
end

Obviously this is a horrible hack, but works quite will in my application. The good news out of this is that I’m finding groupby impressively fast. I can’t imagine that we are very far behind pandas.


#2

Well, you’ve hit line 666, no wonder why it doesn’t work… :wink:

I suppose you have many different groups? That code is passing one argument to promote_type per vector to concatenate, which can be a lot. We could stop doing that, and just use something like this:

T = mapreduce(x -> Nulls.T(eltype(x)), promote_type, cols)

Would you check that it fixes you problem, and if so make a pull request?

Though there’s a deeper problem, which is that groupby calls combine, which relies on vcat, passing it one DataFrame per group. So the function will have to be recompiled for each number of groups, which doesn’t make sense. We should probably implement a special version of vcat in combine to avoid this. @cjprybol may have comments. The same applies to promote_col_type.

Also, calling promote_type once for each group probably has a significant performance cost, and in most cases it won’t be useful if the function is type stable. Maybe we could use inference to detect this.

(BTW, I guess your concrete use case is more complex than that? Else that’s a pretty inefficient version of FreqTables.freqtable. I think we need to find a way to do these simple summary computations that return a single scalar efficiently.)


#3

Ok, this seems to work without a significant performance penalty, I’ll make a PR. Clever, I hadn’t thought of it.

And to clarify, yes I had many different groups. This is a 10^7 row dataset with about 10^5 groups. I’m a little puzzled about why vcat is being recompiled for each number of groups, can you expand on this a bit?

I definitely think we need some way of allowing metadata to determine types in these cases. As you implied, calling promote_type on thousands of arguments that in almost all cases are identical seems horribly wasteful. I have to imagine there are a lot of other problems like this since combine doesn’t seem to assume much about the dataframes being passed to it. Perhaps this would require rethinking how by works?

I also think we need to test on a bigger dataset. This was only a 4 GB dataset, so the types of issues I ran into here should be pretty common.


#4

Because Julia generally compiles a method for each combination argument types, which includes the number of arguments. This is one should avoid splatting collections which can contain more than a handful of entries. Here it was reasonable to splat in the call to promote since the arguments where passed to vcat. What is silly is to pass these as separate arguments to vcat in the first place. Looks like we introduced this in this PR without realizing it would affect combine. We could go back to doing this, since the overhead of allocating a vector is negligible here and knowing the number of arguments does not allow generating more efficient code:

Base.vcat(dts::AbstractDataTable...) = vcat(AbstractDataTable[dts...]) 
function Base.vcat{T<:AbstractDataTable}(dts::Vector{T})
...

Again, PR welcome!

The problem is that by cannot assume much, since the types of columns are not part of the DataFrames type, and because it doesn’t know which columns will be used by the function. For grouping, I was able to enable specialization by splitting the performance-critical parts into separate functions which take a tuple of vectors (see here](https://github.com/JuliaData/DataTables.jl/pull/79)). This works well because we can specialize on the types of the columns used for grouping only, which are typically not that many. But since for groupby we don’t know what columns are accessed by the function, we would need to pass the types of all columns (essentially passing a type-stable DataFrame), and compile a method for each combination of types of all columns, which is also quite slow. The only solution I can see would be to use macros like in Query, so that we can extract the names of columns used by the function.

So that’s not an easy task. aggregate is potentially easier to make more efficient since it’s simpler, see this PR. In most cases, people don’t need the flexibility of by, which comes with a high performance cost (at least currently).

BTW, thanks for stress-testing git master, we really need to sort these issues out before we tag a release. Adding benchmark for these cases would also be useful.


#5

See https://github.com/JuliaData/DataFrames.jl/pull/1248.


#6

I might have found a solution to at least avoid calling vcat on thousands of small DataFrames. We could use inference to detect whether the function returns a DataFrame, since that doesn’t depend on the input argument’s type. If that’s not the case, then we can assume the function returns one result per group, which allows use to allocate a DataFrame or the appropriate size beforehand, and fill it progressively. We could make use of NamedTuple to allow returning a series of values which should each go into a column of the result, as that allows indicating that each group will need a single row (contrary to DataFrame, which can contain multiple rows even if that’s not the typical case).

The problem of this approach is that generally inference should only be used for optimizations, and the fallback should be as general as possible (and possibly very slow too). But in the present case we know we need a slow path for DataFrame results, while all other types should generate a single row in the final combined DataFrame. And inference cannot know that a function returns e.g. an Int, since that depends on the (unknown) type of columns. So it doesn’t sound too bad to require that functions which return a DataFrame by inferrable, as is generally the case anyway with simple functions like this which end with DataFrame(...).


#7

This may go somewhat against the intended design philosophy but I think that we need a few different functions available depending on the specific context. The issue of allocating appropriate types is difficult, but in a huge number of cases the function passed to by will return dataframes all of which have the same types. Then, when the user calls this function, we can use the first output to determine the types of the rest. In that case it would be up to the user to ensure the function always outputs the proper type. We would of course still have by for inferring types. Ideally we shouldn’t sacrifice performance for edge cases, though perhaps I’m wrong about what the most common use cases are.

I also like @nalimilan’s ideas. It also seems to me that Base really ought to have a version of vcat which takes an iterable as an argument.


#8

Yes, maybe, but there are several levels of complexity regarding what the function returns for each group:

  1. a single scalar
  2. several scalars (e.g. as a tuple, NamedTuple or a single-row DataFrame)
  3. a multiple-row DataFrame (possibly with a varying number of rows depending on the group)

Each of these cases can be either type-stable or not, which makes things even more tricky.

FWIW, Pandas provides three different functions:

  • aggregate to return a single value for each column and each group (similar to our aggregate, which is a bit more general)
  • transform to return a DataFrame with the same shape as the original for each group
  • apply for general transformations (similar to our by)

The advantage of transform over apply is that you know in advance the size of the result, so you can avoid allocating a temporary copy if you can predict the output type. Same for aggregate, which can be even more efficient since it can operate by columns (making inference and specialization easier). The fact that we allow aggregate to return either a scalar or a vector makes it harder to optimize, cf. this PR.

See https://github.com/JuliaLang/julia/issues/21672.


#9

It’s looking like pandas did this for well thought out reasons that don’t necessarily have to do with the limitations of Python. I would not be at all opposed to imitating them for this sort of thing; it is definitely not one of the things that I don’t like about pandas.


#10

One of the things I’m always looking for is a by-like function that returns a simple Array of the result of applying something to SubDataFrames (a bit like R’s tapply, though that only applies an operation to a single column),


#11

But returning a standard Array would lose all information regarding the groups. Maybe you can do that with Query if you collect into an Array. It would make more sense to return a NamedArray/AxisArray, with one dimension per grouping variable. AFAICT, that would be similar to what Pandas does with multi-index.

The problem is that there are so many combinations of types of operations, return values and containers to hold them that the API design isn’t simple. I guess we could allow choosing the type of the returned container by passing it as a first argument.


#12

What I like about “tidyverse” in R is that every function always returns another dataframe (tibble). Keeps things simple to reason about unlike base datafame operations which sometimes return an array and sometimes dataframe.

So I think the operations on DataFrames in Julia should also always return another DataFrame in the end for the user. The internals are a different matter of course.


#13

Like I suggested earlier, it would certainly good to have specific functions for which it is assumed this is the case and up to the user to ensure that the internal types are consistent. I don’t agree that this would be a good thing to do in general, we definitely should still have a general form of by.


#14

Can you expand on why that isn’t a great idea? If a user wants to work with an array or vector, that transformation is easy to do on their own.


#15

For one, I don’t think we should be overly confident in our ability to predict every conceivable use case.

That said, since DataFrames now contain arbitrary AbstractVector types, it might make sense to always return DataFrames from by as long as there is also a more general mapreduce like framework for doing groupby. It would be nice to be able to perform an arbitrary function on sub-dataframes and then apply an arbitrary reducer. I’d just want to be really sure that we are not preventing any interesting uses. I remember the limitations of groupby occasionally being a major frustration for me with pandas, but it’s been so long since I’ve used it now that I can’t think of a specific example.


#16

Just in case: I did not argue against having a general by function. My only point was that all functions of DataFrames should output a DataFrame (by default).


#17

To follow up on this, I’ve just filed a PR fixing it: https://github.com/JuliaData/DataFrames.jl/pull/1261