DataFramesMeta.jl and the state of the DataFrames ecosystem

Hi all,

I’m currently working on a small package that’s supposed to make data wrangling easier (i.e. providing macros for commonly-used operations). As part of that, I’ve found DataFramesMeta.jl quite useful (in particular @with), but it seems that it’s no longer actively maintained. Could you give me a short update on which packages are viable alternatives that are actively maintained – is it just Query.jl? Is the plan to integrate more functionality into DataFrames.jl? What’s generally the roadmap for the DataFrames ecosystem?

Thanks a lot for any pointers.

1 Like

I would say more functionality from DataFramesMeta is being integrated into DataFrames. See this PR that just got merged today adding a transform-like function to DataFrames.

Query and DataFramesMeta are both big dependencies for a package, though. Base dataframes is super flexible so I would recommend just using that.

DataFramesMeta is also very stable and definitey still useable. Development will probably pick up again after DataFrames 1.0 comes out. There isn’t harm in using DataFramesMeta.

5 Likes

Query.jl is not really a big dependency, it is a pretty lightweight package that doesn’t depend on many other things.

3 Likes

DataFramesMeta already have very convenient functions for data wrangling tasks. It would be great if the new data wrangling package focuses on performance and memory efficiency.

2 Likes

Thank you to all of you, in particular to those that work on these packages.

@Yifan_Liu, I’m afraid my package won’t focus on performance, at least not initially. In the applications that I work on, performance/memory is not the binding constraint in the data wrangling part, it’s more about cutting development time.

1 Like

Actually our idea is to add to DataFrames functions similar to DataFramesMeta, but without using macros (like select and transform), so that DataFramesMeta can just provide convenience macros around them. Then maybe DataFramesMeta will deserve some cleanup.

What kind of macros would you like to develop?

6 Likes

I’ve found the existing macros/functions somewhat cumbersome for my purposes – or perhaps for my habits, which come from years of using Stata. Two things in particular:

  • The split-apply-combine operations automatically aggregate DataFrames to the group level when the function returns scalars. That does not work well with an approach where you first construct all variables at a more disaggregate level, and then aggregate up. In my experience, this approach is less prone to human errors than the alternative, which is to construct aggregates using by operations, and perhaps having to join some of them together.
  • Making split-apply-combine operations work in the way you would like to when the data contain missings is non-trivial. It usually involves a mixture of skipmissings and completecases, and by the time I’ve solved it, I long for Stata, which uses only cases where all variables are nonmissing. Of course that’s not always what you want, but it’s pretty straightforward to get what you want.

Again, I do not want to put the blame on the design of the split-apply-combine macros in these packages, it’s just that I seem to be very slow when I use them.

So my goal is to implement something like Stata’s syntax for the basic data wrangling operations, to help me solve my two-language problem (data cleaning in Stata, model estimation in Julia). Julia’s metaprogramming makes that relatively easy, and it’s also a good exercise for myself. So far I’ve been mostly using @with from DataFramesMeta (beyond the functions in DataFrames.jl) and I would be happy if this were to be continued to be supported (or possibly implemented within DataFrames).

2 Likes

Regarding your first point, you mean that DataFarmes collapses too much when doing grouped operations? Something like

julia> using DataFrames

julia> df = DataFrame(a = [1, 1, 2, 2], b = [1, 3, 5, 7])
4×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 1     │ 3     │
│ 3   │ 2     │ 5     │
│ 4   │ 2     │ 7     │

julia> df_ideal = DataFrame(a = [1, 1, 2, 2], mean_b = [2, 2, 6, 6])
4×2 DataFrame
│ Row │ a     │ mean_b │
│     │ Int64 │ Int64  │
├─────┼───────┼────────┤
│ 1   │ 1     │ 2      │
│ 2   │ 1     │ 2      │
│ 3   │ 2     │ 6      │
│ 4   │ 2     │ 6      │

Where by operations return df_ideal? This is available doing @transform from DataFramesMeta on a grouped dataframe.

julia> gd = groupby(df, :a)
julia> @transform(gd, mean_b = mean(:b))
4×3 DataFrame
│ Row │ a     │ b     │ mean_b  │
│     │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 1     │ 2.0     │
│ 2   │ 1     │ 3     │ 2.0     │
│ 3   │ 2     │ 5     │ 6.0     │
│ 4   │ 2     │ 7     │ 6.0     │

I expect this same functionality will be added to base DataFrames soon (if it isn’t on a milestone somewhere now I will definitely work on it).

I definitely feel tis. I think the new select and transform functionality in DataFrames will make it easier to solve this problem in DataFrames. But yeah it’s not the easiest right now.

2 Likes

Yes indeed, that’s the situation. Your solution with @transform is nice; I had solved it by repeating when the mapped function would otherwise return a scalar.
On this point, I think the documentation of @transform could be improved to say that it allows also modification of existing columns (but I understand that DFM is not a priority now).

Thanks so much!

Great, looking forward.

Yes, select and transform in DataFrames will ideally work on GroupedDataFrame to make this kind of thing possible. Help would be welcome to make this happen sooner. :slight_smile:

Could you give an example of what you’re doing, what Stata does, and what you think would be a solution in Julia?

1 Like

I would be happy to help, to the extent possible, but I’m still in the process of understanding the interfaces, let alone the internals of DataFrames.jl. I’m also a few months away from my tenure review, so perhaps not the most reliable contributor…

Suppose you have unbalanced panel on hourly wages of workers (indexed by :worker) employed at firms (:firm) in a particular year (:year):

julia> a = DataFrame(firm = [1,1,1,1,2,2,2,2], worker = [1,1,2,2,3,3,4,4], year = [2000,2001,2000,2001,2000,2001,2000,2001], wage = 20 .+ 20.0 .* rand(Float64,8))
allowmissing!(a,[:wage])
a[[2,3,6,8],:wage] .= missing
8×4 DataFrame
│ Row │ firm  │ worker │ year  │ wage     │
│     │ Int64 │ Int64  │ Int64 │ Float64⍰ │
├─────┼───────┼────────┼───────┼──────────┤
│ 1   │ 1     │ 1      │ 2000  │ 38.3329  │
│ 2   │ 1     │ 1      │ 2001  │ missing  │
│ 3   │ 1     │ 2      │ 2000  │ missing  │
│ 4   │ 1     │ 2      │ 2001  │ 30.5861  │
│ 5   │ 2     │ 3      │ 2000  │ 35.4777  │
│ 6   │ 2     │ 3      │ 2001  │ missing  │
│ 7   │ 2     │ 4      │ 2000  │ 35.0093  │
│ 8   │ 2     │ 4      │ 2001  │ 25.7551  │

You would like to calculate average wages paid by each firm; first for all years, then separately for the years 2000 and 2001.
In Stata, that’s simply

bysort firm: egen averageWage = mean(wage)
bysort firm year: egen averageWageInYear = mean(wage)

In doing so, Stata ignores missing values in the means, which is what you probably want to do.

Using DataFrames.jl and DataFramesMeta.jl in Julia, I naively start with:

julia> gd = groupby(a, [:firm])
@transform(gd, averageWage = mean(:wage))
8×5 DataFrame
│ Row │ firm  │ worker │ year  │ wage     │ averageWage │
│     │ Int64 │ Int64  │ Int64 │ Float64⍰ │ Missing     │
├─────┼───────┼────────┼───────┼──────────┼─────────────┤
│ 1   │ 1     │ 1      │ 2000  │ 38.3329  │ missing     │
│ 2   │ 1     │ 1      │ 2001  │ missing  │ missing     │
│ 3   │ 1     │ 2      │ 2000  │ missing  │ missing     │
│ 4   │ 1     │ 2      │ 2001  │ 30.5861  │ missing     │
│ 5   │ 2     │ 3      │ 2000  │ 35.4777  │ missing     │
│ 6   │ 2     │ 3      │ 2001  │ missing  │ missing     │
│ 7   │ 2     │ 4      │ 2000  │ 35.0093  │ missing     │
│ 8   │ 2     │ 4      │ 2001  │ 25.7551  │ missing     │

Eh. Not what you want. Using skipmissing gives you what you want:

gd = groupby(a, [:firm])
@transform(gd, averageWage = mean(skipmissing(:wage)))
8×5 DataFrame
│ Row │ firm  │ worker │ year  │ wage     │ averageWage │
│     │ Int64 │ Int64  │ Int64 │ Float64⍰ │ Float64     │
├─────┼───────┼────────┼───────┼──────────┼─────────────┤
│ 1   │ 1     │ 1      │ 2000  │ 38.3329  │ 34.4595     │
│ 2   │ 1     │ 1      │ 2001  │ missing  │ 34.4595     │
│ 3   │ 1     │ 2      │ 2000  │ missing  │ 34.4595     │
│ 4   │ 1     │ 2      │ 2001  │ 30.5861  │ 34.4595     │
│ 5   │ 2     │ 3      │ 2000  │ 35.4777  │ 32.0807     │
│ 6   │ 2     │ 3      │ 2001  │ missing  │ 32.0807     │
│ 7   │ 2     │ 4      │ 2000  │ 35.0093  │ 32.0807     │
│ 8   │ 2     │ 4      │ 2001  │ 25.7551  │ 32.0807     │

That works, but compared to the Stata version it requires two lines and a function from DataFramesMeta that is scantly documented. The first function someone would look for is probably by from DataFrames.jl, which even has an option skipmissing, but that unfortunately does something completely different.

Now suppose you want to calculate the increase in hourly wages for each worker. In our data here, that’s only defined for worker 4. In Stata you do

bysort worker (year): gen wageIncrease = wage[_n] - wage[_n-1]

In Julia, the best solution I came up with is:

a[!,:wageIncrease] = missings(Float64,size(a,1))
sort!(a, [:worker, :year])
f = _df -> @with _df begin
    for i = 2:size(_df,1)
        :wageIncrease[i] = :wage[i] - :wage[i-1]
    end
    _df
end
by(a, [:worker], f)
8×6 DataFrame
│ Row │ worker │ firm  │ worker_1 │ year  │ wage     │ wageIncrease │
│     │ Int64  │ Int64 │ Int64    │ Int64 │ Float64⍰ │ Float64⍰     │
├─────┼────────┼───────┼──────────┼───────┼──────────┼──────────────┤
│ 1   │ 1      │ 1     │ 1        │ 2000  │ 25.0556  │ missing      │
│ 2   │ 1      │ 1     │ 1        │ 2001  │ missing  │ missing      │
│ 3   │ 2      │ 1     │ 2        │ 2000  │ missing  │ missing      │
│ 4   │ 2      │ 1     │ 2        │ 2001  │ 32.9265  │ missing      │
│ 5   │ 3      │ 2     │ 3        │ 2000  │ 30.7823  │ missing      │
│ 6   │ 3      │ 2     │ 3        │ 2001  │ missing  │ missing      │
│ 7   │ 4      │ 2     │ 4        │ 2000  │ 31.2793  │ missing      │
│ 8   │ 4      │ 2     │ 4        │ 2001  │ 32.7701  │ 1.49086      │

I’m sure there must be an easier way to achieve the same. The above is nice because it allows for relatively easy way of addition conditions to the assignment operation.

In this case the presence and propagation of missings is not a problem.

Overall, I think my grouped operations tend to fall into one of two categories: either I want to do operations on the whole vector of variables in each group, such as in the example with mean() above, or it’s a row-wise operation. I cannot think of an example in the latter case where the presence of missings would make my code above not do what I want, and propagation of the missings is fine. In the former case, it’s tricky. Suppose you want to calculate correlations of two variables within groups. Then you’d want to limit yourself to rows where both variables are present (completecases). In other cases it’s again different.

Bottom line: the current way of doing it requires skipmissing() and completecases() in the right places on vector-based operations-- which is not the end of the world, but requires (1) awareness of the fact that missings spread faster than the Coronavirus in vector-based operations, (2) more code.

Stata is, like with many other things, somewhat inconsistent: some egen operations (the closest equivalent of split-apply-combine with vector-valued operations) work only on rows where all variables are non-missing (e.g. cor()), others on where at least one variable is nonmissing (rowtotal()). On vector-valued operations that use one vector only (total(), mean(), etc) I think it’s equivalent to enclosing the variable in a skipmissing().

I have not thought carefully enough about this issue to suggest a solution. My intuition (and what I plan to implement in my package) is that a separation of split-apply-combine for row-wise operations on scalars and for vector-valued operations is useful; also because it helps the user to know whether she needs to use broadcasting or not.

2 Likes

Let me preface this by saying I have a lot of experience with Stata and I have a good understanding of it’s benefits and weaknesses. A main motivation for me to contribute to DataFrames is because I want to bring it up to par with Stata’s ease of use.

True, but bysort firm is the grouping step in this command.

@transform(groupby(df, [:firm, :year]), averageWage = mean(skipmissing(:wage)))

is not too bad. I think skipmissing is a pretty elegant solution for being explicit about handling missing values.

Another pain point that your example highlights, without you realizing it, is what happens when a firm-year is missing for all observations. mean(skipmissing(:wage)) will throw an error in this case. We don’t have a good solution for this at the moment.
EDIT: this is not correct. skipmissing will work as long as the vector is Union{T, Mising}[missing, missing]. But it will fail if the vector is Missing[missing, missing], which could be a problem when reading from CSV, I guess.

The first function someone would look for is probably by from DataFrames.jl, which even has an option skipmissing , but that unfortunately does something completely different.

We should definitely improve the documentation for this. In particular skipmissing is probably not the best keyword argument name for groupby.

There are much easier ways to do this. I would recommend the ShiftedArrays package. A simple google search for “lag in Julia” would have led you to that package.

@transform(groupby(a, :worker), wageIncrease = :wage - lag(:wage))

Stata’s defaults for rowtotal are definitely hard to learn, missings are zero in a sum unless all values are missing, in which case it returns missing. But I agree they give the right answer most of the time.

I would really encourage you contribute to DataFrames! Full disclosure, i was in the same position as you are now a year ago and and wrote @transform(groupby(a, :firm)...) and got it merged to alleviate my frustration.We can help review your PRs.

Aside from improvements to documentation, my vision for how to alleviate your problems are thee-fold

  1. Generic missing improvements. skipmissing by expanding it to work with multiple vectors, try to and be nicer for an iterator with all missings. Issue filed here. mean is never going to work the way Stata does, but we can make explicit handling of missings easier.
  2. Make the yet-to-be-released select and transform methods accept grouped data frames. this would, I think, solve your most immediate problem.
  3. Emulate Stata’s if syntax. let’s say you want to find the maximum year for a worker where you aren’t missing their wage. In Stata this would be
bysort worker: egen max_year = max(year) if !missing(wage)
@if(df, ismissing(:wage) .== false) |> # subset
groupby(:worker) |> 
@transform(max_year = maximum(:year))

Every column in no_missing_a is a view into the vectors of a but just with the indices that are non-missing for all elements, or maybe the indices that are non-missing for the columns used in the @transform call. Maybe it’s a bit more verbose but Stata’s lack of verbosity can definitely be bad at times. EDIT: I should add in that all the indices not selected are filled with missings, just like Stata.

3 Likes

Your three propositions sound very reasonable to me, and I agree that Stata’s lack of verbosity can be a problem.

Some more suggestions:

  • On the if point, one thing that I think is lacking in Stata, is verbosity on which set of observations are arguments, and which ones are being assigned. It would be good if the DataFrames.jl implementation could be also clear on that.
  • Not directly related to the above, but still relevant: it would be great if (at least one version of) sort! could throw a warning if the order of rows is not uniquely determined by the sorting operation. I don’t know how many empirical results have become irreproducible because sloppy sorting operations resulted in indeterminate row orders. If people don’t object, I’m going to file an issue.

Sorry, but I disagree. I worry that it would lead to undetected bugs-- people forgetting about the skipmissing all while being blissfully ignorant about it.

Of course, and writing the lag function yourself would be easy as well. The point is that none of these solutions are a one-liner. Just think about how fast you write data cleaning code in Stata. I think we can get to that point here as well, all while correcting the things that drive us nuts in Stata.

All your points above are great. please file issues for them in DataFrames.

Keep in mind that technically speaking, mean([1, 2, missing, 5]) really should, technically return missing, as we don’t know what the 3rd value is. So the default behavior is relative.

Our behavior mimics R’s except instead of na.rm = TRUE we change the input argument via skipmissings rather than a keyword argument. There are good reasons for this.

  1. People want to write functions and not worry about missing values. Enforcing skipmissing means that developers can be much more focused. Something like
function sum_squares(v)
    m = mean(v)
    s = zero(eltype(v))
    for iv in v
        s += (v - m)^2
    end
end

will work on anything. you don’t want to be unsure whether or not the functions people write properly handle missing values. skipmissing does this for you.

  1. Adding keyword arguments to existing functions is hard to do in Julia, but adding new methods to existing functions is easy.

Just think about how fast you write data cleaning code in Stata. I think we can get to that point here as well, all while correcting the things that drive us nuts in Stata.

Your point about Stata being faster is fair. I definitely haven’t done enough major data cleaning projects (20k column surveys) in Julia yet to be a good judge.

My favorite part about stata is the way you can construct commands via meta-programming. It kills me that you can’t do anything close to that in R. I think we are making good progress on that front by avoiding making users use literals.

But when you seed code like

groupby(a, firm) |> 
@transform(mean_wage = mean(skipmissing(:wage))) |> 

I don’t mind it that much since it reads more like a sentense and is explicit.

4 Likes

Could you elaborate on that?

reg `outcome_var` `regressors`, vce(`cluster_var`)
for var in income* {
    replace `var`_s = `var`_m if `var` > .5
}

stuff like that.

Another reason Stata is tough is that no one write functions in Stata because they don’t know how. But it’s easy to do, so Stata is actually better off than people think it is.

I think the point about lag is just a documentation issue. People should do using StatsKit instead of just using DataFrames when they want an equivalent of Stata. Then it doesn’t matter in which package lag is defined as long as it’s available.

Regarding missing values, I agree having to put skipmissing everywhere is inconvenient, but it’s not worse than R’s na.rm argument. What could be done is that another macro called e.g. @skipmissing could be added to DataFrameMeta pipelines to automatically wrap columns in skipmissing, to avoid repeating it and thinking about where you need to put it.

Well technically I guess we would implement this by checking that allunique(col), so you could do this by hand. Maybe we could add a keyword argument, but would that really be better? Again, that’s probably just a matter of documenting this and showing it in tutorials.

Overall, I encourage you to discuss these issues with us here or on DataFrames/DataFramesMeta GitHub issues rather than starting a new package. In the end everyone will benefit from having convenient APIs in a single package. I know @mkborregaard recently complained about missing values handling too.

3 Likes

I “griped” but yes.
skipmissing isn’t a problem when you do single-vector operations, but as soon as you need multiple vectors it becomes harder. We’ve yet to see how a multi-argument skipmissing will resolve this

With this PR which is almost ready, you’ll just write e.g. cor(skipmissings(x, y)...). Maybe not very easy to explain to newcomers, but at least relatively easy to type.

1 Like

Yes exactly - my hope is that this will resolve most of the issues - and we’ll see :smiley: