Is there light at the end of the DataFrames tunnel?

dmbates · October 10, 2017, 7:48pm

I keep trying to experiment with a version of the “DataFrames of the future”, as I understand it. So I would like to have a development version of DataFrames 0.11.0 combined with versions of Nulls, CategoricalArrays, CSV, Query, RData and perhaps Feather so that I can see what changes are necessary in my MixedModels package.

I haven’t been able to install a consistent set of the support packages. The https://github.com/JuliaData/DataFrames.jl/issues/1232 issue lists packages that need to be converted but there doesn’t seem to be a lot of activity there. I imagine many package authors are in the same position I am of not knowing how to begin.

Is there a roadmap for the conversion? Are there any milestones? Is there even consensus on what the final version will look like?

ExpandingMan · October 10, 2017, 7:57pm

I can’t speak for the main devs, but I have recently been using the latest master of DataFrames along with the latest masters of Nulls, CSV, DataStreams and Feather on “real” data, and so far things are actually looking pretty good! Using Nulls so far is proving much easier than using NullableArrays ever was. Performance-wise there is definitely room for improvement, but without doing any rigorous benchmarking my feeling is that things are ok; I can’t imagine it’s all that far behind pandas.

I think that once the latest Nulls versions of everything get tagged, the situation will suddenly seem quite a lot better. My impression is that once this happens, the ecosystem will no longer seem to be in a state of chaos.

nalimilan · October 10, 2017, 8:08pm

Using the master branch of all cited packages should work, but for MixedModels you’ll probably also need the StatsModels PR for the Nulls port. Note that Pkg wil likely complain about version conflicts if you have installed packages which depend on DataFrames, so you may want to use a separate library (as you noted on the issue).

I think the new state of DataFrames is quite stabilized now. My priority is to check that it works fine with DataArrays (once ported to Nulls) so that people who need performance don’t suffer from major regressions. Then we should be able to tag a release, which should help packages to progressively be ported. It would be nice to get more testing before that, though, to limit the amount of breakage faced by users.

Overall the porting process shouldn’t be difficult, as the new release will be quite similar to the old one, except that it uses Null rather than NA, and no longer forces using DataArray columns.

dmbates · October 10, 2017, 8:09pm

How do you manage to install the latest masters of that suite of packages? For me the Pkg.add, Pkg.checkout sequence doesn’t work on some of the later packages because of dependency conflicts with master versions of earlier packages.

ExpandingMan · October 10, 2017, 8:10pm

One of the problems right now is that we are waiting on Package3, the current state of the package manager is not good. I have just been cloning repos. I can’t deal with the package manager anymore, it drives me crazy some times.

StefanKarpinski · October 10, 2017, 11:21pm

I’ve had a number of extensive conversations about this and I do think we’re really close now. The Nulls business is well sorted out and gradually percolating through the system. The rest of the problem is just disentangling the “element type lie” from DataArrays and DataFrames and separating the high-level generic stuff from the low-level implementation of an “abstract data frame” – there’s also a lot of progress there, although I have to confess I have less of a good handle on that. The final piece of the puzzle is named tuples, which are slated for 1.0 but have stalled out a bit, but once we get that in, I think we’ll be in good shape and then we just need to spend some time getting on the same page.

ExpandingMan · November 15, 2017, 2:35pm

Are you guys planning to tag all the Nulls stuff before 0.7 releases? I think that would be a really good idea so people installing 0.7 would start out with the current data ecosystem.

I have been using the masters for quite a while now, and have been quite happy, much happier than I was with Nullable. To be honest the biggest issue right now is that the package manager is hysterical when I have so many masters pulled (I’ve given up on it and just use git).

Also, why was Nulls changed to Missings?

nalimilan · November 15, 2017, 3:00pm

Yes, we do, but there was a debate about the best name to represent missing values, which has just been changed to missing because most people considered it was more explicit. Now we just need to sort out a remaining issue with CategoricalArrays and we should be ready.

ExpandingMan · November 15, 2017, 3:33pm

Of course we have to make sure that every package gets searched and replaced for missing. I have to say that name change baffles me, but whatever, I’m just looking forward to everything being on the same page.

mkborregaard · November 15, 2017, 4:24pm

You can read the whole exchange on the #data channel on Slack.

vchuravy · November 17, 2017, 1:01am

Since slacks history is ephemeral I would suggest to record the decision and/or discussion either on Github or here on Discourse.

mkborregaard · November 17, 2017, 9:06am

That’s a very good point, much of the discussion has now been copied to the PR
https://github.com/JuliaData/Missings.jl/pull/51

DNF · November 17, 2017, 9:36am

This doesn’t really matter to me, and missing sounds fine. Missings, however, does not. There’s a convention to use the plural form in package names, but only when it makes sense, surely? Wouldn’t MissingValues (though longer) make a lot more sense as a name?

Sorry for bikeshedding in a neighbouring town here, I just found that name so awkward and cryptic.

nalimilan · November 17, 2017, 10:56am

There’s been some discussion at https://github.com/JuliaLang/METADATA.jl/pull/12007, but basically the Julia convention is to use the plural of the type for package names, “missings” has many occurrences even in serious publications, and anyway the package is supposed to be temporary.

Diego_Javier_Zea · November 17, 2017, 11:16am

Why is It going to be temporary?

nalimilan · November 17, 2017, 11:25am

Because we expect missing to be defined in Base directly, which (among other things) will help with some problems with functions like ==, all and any regarding three-valued logic in the presence of missing values.

DNF · November 17, 2017, 11:43am

Oh, in that case, forget I said anything. Sounds perfectly fine, then.

DNF · November 17, 2017, 11:46am

As for this, I though that convention was simply a fallback when no other reasonable name presented itself.

ExpandingMan · November 17, 2017, 2:44pm

I know it really doesn’t help to bring this up now, but I feel compelled to point out that missing will not necessarily mean “there is a value but it’s missing”, it may just as often mean “there is no value”. What’s the derivative at 0 of \sum_{n=0}^{\infty}2^{-n}\cos(12^{n}\pi x)? It’s not missing, it doesn’t exist. What’s the shoe size of a Burmese python? It’s not missing, it doesn’t exist. What is the mass of the electromagnetic glueball? It’s not missing, it doesn’t exist.

There are lots of cases where values simply don’t exist, but, for better or worse, they wind up having fields in a dataframe somewhere (admittedly the examples above are more colorful than realistic). That’s why null is a great name, because it doesn’t specify. It’s also a lot quicker to type.

nalimilan · November 17, 2017, 3:20pm

In that case you should use nothing (or a similar object, we could support several types of “invalid” values), not missing. The three-valued logic semantics of missing do not apply when there is no uncertainty. These two situations are incorrectly conflated when using NA since it’s not obvious whether it means “not applicable” or “not available” (actually, it means the latter). That’s why missing is a great name, because it does specify what it means.

Topic		Replies	Views
Announcement: DataFrames Future Plans Data announcement	27	7937	July 4, 2017
DataTables or DataFrames? Data question	32	15373	November 19, 2018
[ANN][Important] Juliabox upgrade to new packages Community	2	884	May 23, 2018
Getting our act together in the data ecosystem Data	4	1787	July 4, 2017
What have we learned from DataFrames in Julia? Community poll	4	1649	November 29, 2017

Is there light at the end of the DataFrames tunnel?

Related topics