DataFramesMeta.jl and the state of the DataFrames ecosystem

I think that’s a good idea, though then it would make sense to have an equivalent @completecases too (and the name is probably not ideal-- it’s not immediately clear what the macro does).

I agree. The point is not to have a better API to DataFrames. Rather, the point is to have an API that Stata users are very familiar with. I hope to feed some ideas into the design of the interface in DataFrames.jl, but am well aware that data manipulation habits are very persistent, and what is a good design for some may not be a good one for others (at least in the short run).

This may be true from your perspective, but please keep in mind that Julia users come from various backgrounds, which does not necessarily include Stata; also, Stata should not be considered an API to imitate.

Stata’s approach of missing values can be described as trying to guess what the uses wants in common cases. The problem with this is that this involves a lot of implicit assumptions and can lead to silent mistakes in analysis. A prominent example is missing values in logical statements: in simple cases people learn tricks like

gen var2 = (var < 10) if var !=. 

but it is easy to forget about them for more complex code. Even seasoned Stata users make these mistakes all the time (and are usually unaware of it, unless the analysis is replicated with other programs).

Julia’s approach can be described as propagating missing unless the user deals with it explicitly. This may seem inconvenient at first glance, because some people expect that something “obvious” can be done to missing values. This is not true for any nontrivial code though: discussions reveals that what people consider the “right” approach can be very different. Also, this approach does not compose well.

Choosing how missing values are to be handled explicitly leads to much cleaner code. It also meshes well with Julia’s design: in most cases it can be done at low or zero cost using various wrappers and iterators.

6 Likes

Tamas, I agree with all your points. This is exactly why I am hesitant to suggest changes to DataFrames.jl itself, but see value in a different interface.

1 Like

On the one hand, I can see why having a familiar interface would be useful, and if you want to make a package like StataDataFrames.jl that wraps DataFrames with a familiar API, no one is going to stop you.

On the other hand, that approach could lead to a lamentable fragmentation of the ecosystem. One might imagine someone doing something similar for dyplr syntax, pandas syntax etc. In some ways, the presence of Query.jl ecosystem already represents such fragmentation, and that ecosystem uses a different paradigm for missing values etc.

I’m some ways, this is inevitable I suppose, especially as the community grows, and I don’t fault the Query folks for wanting to take a different approach. But, unless there’s something actually wrong with the DataFrames approach (as opposed to just being unfamiliar), it might be better in your case to write some Stata->DataFrames.jl cheat sheets or “getting started with DataFrames for Stata users” blog posts or something.

I really appreciate how thoughtful your posts have been, and I think this is a great discussion to have. I definitely think it’s worth learning from the things other languages do right. But it sounds like, for the most part, the DF1.0 interface is going to be able to do what you want, though perhaps with a bit more verbosity. It would be a shame IMO to write a wrapper around that for the sake of saving a couple lines of code.

2 Likes

I think this is an important issue, and it would be great if we could discuss this. I’m most grateful for all the work on DataFrames.jl and related packages, and I certainly don’t want to have a negative impact on their development (though, frankly, I doubt that I could!). My impression (also based on @floswald 's view) is that people that would prefer a Stata-like interface to DataFrames.jl are a small minority among Julia users. Hence, there would be no risk of a split in interface that the community uses. And, to re-iterate, I’m well aware of the fact that the Stata-type approach has deficiencies as well. I just think that one shouldn’t throw the baby out with the bathwater.

That’s actually how I started thinking of all this. But in the end I’m using all this to write papers, so I thought I might as well turn it into code that I can use.

1 Like

Great thread, thanks all! Actually already learned some new tricks :slight_smile:
Not much to contribute though over and above reiterating what I told @jmboehm offline: creating that stata-like interface for Dataframes.jl won’t do any harm.
I think some place with such tips and tricks would be a great resource. The tutorials by @bkamins are awesome for that, I should spend more time with them.

just for completeness and to be fair, you can do stuff like that in R.

  1. You can always paste together a string exactly as you want and eval that
  2. in particular, a formula can be given as just a string, so all regression stuff is easy to construct programmatically
  3. slightly more advanced is tidy and purrr etc like here for example.

of course none of this is proper metaprogramming will remain a hack forever.

I’ve filed a github issue here to discuss skipmissing-related improvements with the new select methods.

Great! It would be good to collect an issue full of stata like statements like that which people think are useful?

The new select just got merged yesterday. And transform is being worked on but has not gotten merged yet. People should check out master and play around and file issues as needed. We really want to hear your feedback before 1.0, but it’s important to understand existing features well first.

3 Likes

Experimenting with a REPL mode for data cleaning/analysis in Julia would be interesting. I wonder if it has advantages compared to macros etc.

1 Like

you have column names autocompletion for data frames

this could be cool. a more stata-esque workflow. It would probably be a very hard package to write but it’s feasible using something like HeaderREPLs.jl.

In some ways that’s what I have in mind and what I’m trying to work on (GitHub - jmboehm/Douglass.jl: Stata-like toolkit for data wrangling on Julia DataFrames but progress is slow; I doubt that I’m the most qualified person for this). First step is to write a parser, then implement the commands, then the REPL mode.

I love the REPL idea. If someone wanted to put that ontop of Query.jl, I’d be happy to help with advice and guidance.

2 Likes

An update on this, the following just got merged into DataFrames master:

EDIT: This was broken earlier but is fixed now.

julia> by(df, [:id, :val], :, :income => mean)

Notice the : in the function call. This preserves all existing columns. The resulting mean of income by group will be “spread”, so to speak, across each group.

This is exactly equivelent to

bysort id val: egen income_mean = mean(income)

in Stata. And imo pretty much just as pretty.

12 Likes

Just a small fix, the syntex should be

julia> by(df, [:id, :val], :, :income => mean)

See here for an example.

2 Likes