Is there light at the end of the DataFrames tunnel?

question

#1

I keep trying to experiment with a version of the “DataFrames of the future”, as I understand it. So I would like to have a development version of DataFrames 0.11.0 combined with versions of Nulls, CategoricalArrays, CSV, Query, RData and perhaps Feather so that I can see what changes are necessary in my MixedModels package.

I haven’t been able to install a consistent set of the support packages. The https://github.com/JuliaData/DataFrames.jl/issues/1232 issue lists packages that need to be converted but there doesn’t seem to be a lot of activity there. I imagine many package authors are in the same position I am of not knowing how to begin.

Is there a roadmap for the conversion? Are there any milestones? Is there even consensus on what the final version will look like?


#2

I can’t speak for the main devs, but I have recently been using the latest master of DataFrames along with the latest masters of Nulls, CSV, DataStreams and Feather on “real” data, and so far things are actually looking pretty good! Using Nulls so far is proving much easier than using NullableArrays ever was. Performance-wise there is definitely room for improvement, but without doing any rigorous benchmarking my feeling is that things are ok; I can’t imagine it’s all that far behind pandas.

I think that once the latest Nulls versions of everything get tagged, the situation will suddenly seem quite a lot better. My impression is that once this happens, the ecosystem will no longer seem to be in a state of chaos.


#3

Using the master branch of all cited packages should work, but for MixedModels you’ll probably also need the StatsModels PR for the Nulls port. Note that Pkg wil likely complain about version conflicts if you have installed packages which depend on DataFrames, so you may want to use a separate library (as you noted on the issue).

I think the new state of DataFrames is quite stabilized now. My priority is to check that it works fine with DataArrays (once ported to Nulls) so that people who need performance don’t suffer from major regressions. Then we should be able to tag a release, which should help packages to progressively be ported. It would be nice to get more testing before that, though, to limit the amount of breakage faced by users.

Overall the porting process shouldn’t be difficult, as the new release will be quite similar to the old one, except that it uses Null rather than NA, and no longer forces using DataArray columns.


#4

How do you manage to install the latest masters of that suite of packages? For me the Pkg.add, Pkg.checkout sequence doesn’t work on some of the later packages because of dependency conflicts with master versions of earlier packages.


#5

One of the problems right now is that we are waiting on Package3, the current state of the package manager is not good. I have just been cloning repos. I can’t deal with the package manager anymore, it drives me crazy some times.


#6

I’ve had a number of extensive conversations about this and I do think we’re really close now. The Nulls business is well sorted out and gradually percolating through the system. The rest of the problem is just disentangling the “element type lie” from DataArrays and DataFrames and separating the high-level generic stuff from the low-level implementation of an “abstract data frame” – there’s also a lot of progress there, although I have to confess I have less of a good handle on that. The final piece of the puzzle is named tuples, which are slated for 1.0 but have stalled out a bit, but once we get that in, I think we’ll be in good shape and then we just need to spend some time getting on the same page.


#7

Are you guys planning to tag all the Nulls stuff before 0.7 releases? I think that would be a really good idea so people installing 0.7 would start out with the current data ecosystem.

I have been using the masters for quite a while now, and have been quite happy, much happier than I was with Nullable. To be honest the biggest issue right now is that the package manager is hysterical when I have so many masters pulled (I’ve given up on it and just use git).

Also, why was Nulls changed to Missings?


#8

Yes, we do, but there was a debate about the best name to represent missing values, which has just been changed to missing because most people considered it was more explicit. Now we just need to sort out a remaining issue with CategoricalArrays and we should be ready.


#9

Of course we have to make sure that every package gets searched and replaced for missing. I have to say that name change baffles me, but whatever, I’m just looking forward to everything being on the same page.


#10

You can read the whole exchange on the #data channel on Slack.


#11

Since slacks history is ephemeral I would suggest to record the decision and/or discussion either on Github or here on Discourse.


#12

That’s a very good point, much of the discussion has now been copied to the PR


#13

This doesn’t really matter to me, and missing sounds fine. Missings, however, does not. There’s a convention to use the plural form in package names, but only when it makes sense, surely? Wouldn’t MissingValues (though longer) make a lot more sense as a name?

Sorry for bikeshedding in a neighbouring town here, I just found that name so awkward and cryptic.


#14

There’s been some discussion at https://github.com/JuliaLang/METADATA.jl/pull/12007, but basically the Julia convention is to use the plural of the type for package names, “missings” has many occurrences even in serious publications, and anyway the package is supposed to be temporary.


#15

Why is It going to be temporary?


#16

Because we expect missing to be defined in Base directly, which (among other things) will help with some problems with functions like ==, all and any regarding three-valued logic in the presence of missing values.


#17

Oh, in that case, forget I said anything. Sounds perfectly fine, then.


#18

As for this, I though that convention was simply a fallback when no other reasonable name presented itself.


#19

I know it really doesn’t help to bring this up now, but I feel compelled to point out that missing will not necessarily mean “there is a value but it’s missing”, it may just as often mean “there is no value”. What’s the derivative at 0 of \sum_{n=0}^{\infty}2^{-n}\cos(12^{n}\pi x)? It’s not missing, it doesn’t exist. What’s the shoe size of a Burmese python? It’s not missing, it doesn’t exist. What is the mass of the electromagnetic glueball? It’s not missing, it doesn’t exist.

There are lots of cases where values simply don’t exist, but, for better or worse, they wind up having fields in a dataframe somewhere (admittedly the examples above are more colorful than realistic). That’s why null is a great name, because it doesn’t specify. It’s also a lot quicker to type.


#20

In that case you should use nothing (or a similar object, we could support several types of “invalid” values), not missing. The three-valued logic semantics of missing do not apply when there is no uncertainty. These two situations are incorrectly conflated when using NA since it’s not obvious whether it means “not applicable” or “not available” (actually, it means the latter). That’s why missing is a great name, because it does specify what it means.