DataTables or DataFrames?

After a lot of reading, I’m still not certain about the state of data handling in Julia.

My understanding is there is a migration over to DataTables due to type-safety considerations, however there are still a number of functions that are not supported?

Will DataFrames eventually be deprecated completely?

3 Likes

There might be a road map and a plan somewhere, but these road maps have changed a fair bit over the last two years, and I wouldn’t be surprised if they change again. I think at this point the long term story is simply not set in stone.

In the meantime, take a look at https://github.com/davidanthoff/IterableTables.jl. That package might make it easier to live with the current multitude of different packages by providing options to convert between different table types and also enabling a variety of packages that traditionally only worked with DataFrames to now work with any tabular data source. My https://github.com/davidanthoff/Query.jl package also works with any of the available table types. If you want to use IterableTables and Query at the same time, you’ll need to be on master for both. My goal is to tag a new joint release for both packages in about two weeks.

Best,
David

4 Likes

Hi Natasha,

I’m one of the stats/data developers. In short, I think the answer is DataFrames.

Why DataFrames? At the moment, I think it provides superior usability and more widespread, mature support in the ecosystem, at the cost of slightly poorer performance compared to DataTables.


Now, for a bit of background. Continue at your own risk…

Some of my colleagues may disagree on this, but my take is that at this point, the future of DataFrames and DataTables is somewhat unclear. Originally the plan was to convert DataFrames from a DataArrays backend to a NullableArrays backend, as NullableArrays currently offers somewhat improved performance over DataArrays due to its type stability. NullableArrays uses the Nullable type everywhere, which ensures that the output of a given function has a predictable type. DataArrays effectively uses Union{T, NAtype} for some type T. Since Unions are not (yet!) well optimized by the compiler, the performance lags behind that of NullableArrays.

A while back, the DataFrames master branch made the switch to NullableArrays, and a release was planned. As people began to try out the master branch, it became clear that the change to Nullable was massively breaking, particularly without more widespread package support or infrastructure in place, and that DataFrames became more frustrating to work with. A example of this comes from an issue that sparked my proposal to separate the two packages. We decided to keep the classic DataFrames alive and well going forward and maintain its Nullable counterpart as DataTables. We considered eventually deprecating DataFrames in favor of DataTables, but I, as well as some of my colleagues, still prefer DataFrames and don’t want that to happen.

DataArrays and its NA value and NullableArrays and the Nullable type offer different mental models of data values. A Nullable is actually a container that contains 0 or 1 value. NA, as well as Union{T, NAtype}, is simply a scalar. One might expect, coming from R, SAS, or other statistical software, that one could use NA as a scalar and that it would propagate through arithmetic operations. Using Nullables requires thinking a little differently; currently one must use broadcast to obtain what’s called “lifting,” where an operation returns a null value when passed a null value. This way of thinking is preferred by some and disliked by others.

An enormous amount of thought and care has gone into the discussion for what to do next. This discussion has been spearheaded largely by John Myles White and other JuliaStats developers, with input from the community. We (the developers) are hoping that the next release of Julia will bring optimizations for union types, which will permit optimization of the DataArrays-style approach to missing data, and that the relevant types and operations can be moved into Base.


Now, a word on the state of the ecosystem…

Some packages have (in my opinion, too hastily) adopted DataTables (e.g. CSV and RCall), whereas many are still set up to use DataFrames exclusively (e.g. Gadfly and GLM). We’re hoping that this can eventually be reconsiled by providing a tabular data abstraction that enforces a common API that packages can code against, which would allow users to say using MyFavoriteTable and so long as MyFavoriteTable adheres to the abstract table API, things “just work.” That’s the ultimate goal. There has been some work toward this, but we aren’t there yet.

In the meantime, the best choice of tabular data storage in Julia depends somewhat on your needs, but for general purposes, I’d recommend that users and package authors continue to use and support DataFrames.


I realize that’s an incredibly long-winded answer, but I hope it will prove useful to you and to anyone else sharing your concerns or curiosity.

Regards,
Alex

22 Likes

Some packages have (in my opinion, too hastily) adopted DataTables (e.g. CSV and RCall), whereas many are still set up to use DataFrames exclusively (e.g. Gadfly and GLM). We’re hoping that this can eventually be reconsiled by providing a tabular data abstraction that enforces a common API that packages can code against, which would allow users to say using MyFavoriteTable and so long as MyFavoriteTable adheres to the abstract table API, things “just work.” That’s the ultimate goal. There has been some work toward this, but we aren’t there yet.

Note that this actually works with GitHub - queryverse/IterableTables.jl: Implementations of the TableTraits.jl interface for various packages today, and for both of the examples that @ararslan mentioned, i.e. Gadfly and GLM (and some others too). The only slightly annoying thing right now is that the package is not registered in METADATA, I’m just waiting for a new release of GitHub - mauro3/SimpleTraits.jl: Simple Traits for Julia and then I’ll register it.

And there is one more case: some packages have started to use DataFrames with NullableArrays for their columns. I hope that we can at least move away from that soon, i.e. either use DataFrames with DataArrays, or DataTables with NullableArrays, but not mix and match.

1 Like

I appreciate the long answer. This is very enlightening!

I have also found @quinnj’s DataStreams to be extremely helpful. The idea is that one can provide a “bare minimum” standard interface for each type of tabular data which can later be “glued in” to other code. This means that, in principle, you only have to write <100 lines of code to make a tabular data structure “compatible” with every other tabular data structure. A bunch of packages have already implemented this (the DataTables implementation was just recently merged).

I’ve recently written a wrapper for this, that takes anything that implements the DataStreams interface and expands it’s functionality to have most of what you’d expect from a data frame implementation. It’s still in its early stages, and it really isn’t designed for DataTables, DataFrames interoperability since it’d probably be a bit inefficient for that sort of thing (also I am currently spitting out “blocks” of data to DataTables).

I think DataStreams is a really good way forward as it should provide an interface not only between different tabular formats within Julia, but different file and database formats. Eventually I’d like to see it used to make it trivially easy to use different types of backends.

By the way, initially I was quite concerned that the DataTables approach would be quite problematic, but I’ve been using it for a while now and I’ve found it to be a good approach. There certainly need to be a lot more utility/lifting functions available, but I think once that happens it’ll work quite well.

Pandas.jl provides a fairly frictionless and efficient bridge between Julia and Pandas, so you can take advantage of the mature Pandas DataFrame type until the Julia ecosystem settles down a bit.

2 Likes

Is that being maintained? I tried it a few weeks ago (and just now again) and gave up because even the first couple of lines from the README didn’t work.

Well, its my package and I try to keep it maintained :). I’ll respond to any issues people open.

Thanks all for your helpful and detailed replies :slight_smile: I guess I’ll just keep my eyes on it for now.

I just want to point out that (in my opinion at least) DataFrames and DataTables have one huge advantage over the obviously much-more-mature pandas: they are very, very easy to understand. Part of this is simply a result of Julia being a great language, the other part comes from pandas being bloated and unnecessarily complicated. Pandas has tons of features which are rarely ever used, and because of the limitations of python (the fact that it needs to call C code to do anything and the expression problem) this can make pandas hard to deal with.

In contrast, DataFrames and DataTables are almost ridiculously simple. The only aspects of them which are at all complicated are groupbys and joins. You can very easily pick apart how they work, for example, in the REPL

julia> data = DataTable(A=rand(5), B=rand(Int,5));

julia> fieldnames(data)
2-element Array{Symbol,1}:
 :columns 
 :colindex

julia> typeof.([getfield(data,f) for f ∈ fieldnames(data)])
2-element Array{DataType,1}:
 Array{Any,1}    
 DataTables.Index

julia> typeof(data.columns[1])
NullableArrays.NullableArray{Float64,1}

julia> fieldnames(data.columns[1])
3-element Array{Symbol,1}:
 :values
 :isnull
 :parent

Well, what do you know, a DataTable consists of a Vector{Any} of NullableVector each of which presumably is mostly just a pair of vectors (one containing the data, and one specifying whether or not that particular field is null). You can do all sorts of things like this. Go ahead, try the equivalent of this with pandas using dir and vars, see how far you get. The fact that Julia dataframes are so simple means you can do most of what you can do in pandas without much specialized knowledge about the package, and because it’s Julia, you can do things that would be unthinkably expensive in Python, like iterate over all the fields of large datasets.

I have found even a very immature Julia package like DataTables to be so much easier to use than the venerable Python pandas and that is definitely something that makes me never want to go back to Python.

Thanks for your response :slight_smile: I guess I was actually hoping to write a ‘cheatsheet’ with a comparison between stata, pandas, and dataframes.jl. But at the moment, given that the packages are still being developed and there is no standard, I think I’ll leave it.

I get that pandas seems so big that it’s hard to find exactly what you want to do… but I don’t particularly like the idea of silo-ing every package to such an extent. I think readtable was deprecated in favour of using CSV.jl? I really like to keep my imports to a minimum so I can track what I’m using a bit easier… and importing to write one line of code seems annoying.

I think if you just pick either of DataFrames or DataTables and stick with it for a while you’ll be fine. My only caveat is that if you are dealing with datasets ≳5GB you might find the type stability issue in DataFrames to be an obstacle. I have a few simple tools for doing conversions in such in my DataUtils package which you can just copy over if you don’t want to rely on my package.

Edit: The one other thing I’ve forgotten about is the lack of something like the very nice DataFramesMeta, however @davidanthoff seems to have been working away furiously on Query.jl so I’d imagine that’s quite nice and compatible with lots of things by now. Like I said, the simplicity of DataFrames/DataTables has made it much easier to write simple tools, so I’ve actually found the simple things in my DataUtils package to be quite adequate most of the time.

It’d be nice if DataFrames/DataTables added CSV as a dependency IMHO. I don’t think readtable is deprecated just yet though.

We’ve just deprecated readtable in DataTables, but not in DataFrames. Re-exporting CSV.read under some form is still a possibility, though.

Yes, Query.jl works with DataTables. At least on master I also have an unexported, experimental method interface that might at one point look more like dplyr. You can do things like this with that:

Query.@where(dt, i->i.a==2)

That will return an iterable table, so you can convert that back into any of the supported iterable table sinks, for example back into a DataTable:

DataTable(Query.@where(dt, i->i.a==2))
```
But be warned, I'm still experimenting with the syntax for this method based approach, so things might change.

[quote="nalimilan, post:16, topic:3160, full:true"]
We've just deprecated readtable in DataTables, but not in DataFrames. Re-exporting CSV.read under some form is still a possibility, though.
[/quote]

Another option would be to have some meta package like ``DataVerse`` at some point that just loads everything that one typically needs for data work.
1 Like

Yes, I for one would like to use a metapackage to make things easier. Obviously not a development target, but if there was just a single package I could point to and all of my conversion / file-format woes are solved, that’s great for teaching and REPL usage. Also great for writing scripts. Then some interface like ItearableTables.jl would be a good low-dep developer target.

Honestly, I find the current situation for data formats very confusing, though the Anthoff-verse is pretty nice.

2 Likes

It is definitely easier to understand the internals of DataFrames/DataTables, but I question the assumption that understanding the internals of a tabular data package makes it easier to use. In my opinion, a good tabular data package allows one to abstract away from internals altogether and simply provides a nice API to do things with your data.

I think there’s a few things to decouple here. The question is, who are you targetting? There’s a few different targets for a package:

  • End Users: You want something that end users can use to do anything their heart desires. Number of dependencies really doesn’t matter, as long as it is automatic to install and not excessive. The key is for it to be feature-filled and really well documented.

  • Other developers: You want something that others can easily hack away at to add whatever odd stuff they need. The key is to have concise code so that others can easily dig in. If features are missing, that doesn’t matter because you can assume the user can just read the source and add whatever they need. Documentation isn’t that crucial if you have comments and docstrings.

  • Developer target: Something that you want to offer as a component for other packages. You want this to be as small and as stable of a dependency as possible. The interfaces should be really well documented so that way everything meshes well, but you’re focusing on offering a good small core of features.

DataStreams and IterableTables are developer targets. I think IterableTables has a good future since it fill this niche very well. Pandas is clearly in the first category, as it tries to offer you a universe and assumes most people will never read the source. I kind of think DataFrames and DataTables are slightly confused between the first two. There is both a push for these to be “end user”, but then all of the extra functionality for this is either supposed to be user codes or external packages which extend DataFrames (like DataFramesMeta, StatPlots for plotting, CSV.jl, etc.).

But if you had to ask “what is the package I can go to with one comprehensive and coherent documentation that can kind of just do anything related to tabular data?” I don’t think we have a good answer to that. Instead, when someone has a complicated workflow and is looking for an end-to-end, we get pointed to a chain of 3 packages with a few other options.

I think this is partially because conditional dependencies don’t really have a good answer yet, but also because we have kind of been taking “there exists a solution” to mean “it’s easy to find all of the pieces for solutions and patch them together”. In my experience, it isn’t that easy if you’re not “in the know” with what the latest solutions are.

6 Likes