DataTables or DataFrames?

question

#14

I think if you just pick either of DataFrames or DataTables and stick with it for a while you’ll be fine. My only caveat is that if you are dealing with datasets ≳5GB you might find the type stability issue in DataFrames to be an obstacle. I have a few simple tools for doing conversions in such in my DataUtils package which you can just copy over if you don’t want to rely on my package.

Edit: The one other thing I’ve forgotten about is the lack of something like the very nice DataFramesMeta, however @davidanthoff seems to have been working away furiously on Query.jl so I’d imagine that’s quite nice and compatible with lots of things by now. Like I said, the simplicity of DataFrames/DataTables has made it much easier to write simple tools, so I’ve actually found the simple things in my DataUtils package to be quite adequate most of the time.


#15

It’d be nice if DataFrames/DataTables added CSV as a dependency IMHO. I don’t think readtable is deprecated just yet though.


#16

We’ve just deprecated readtable in DataTables, but not in DataFrames. Re-exporting CSV.read under some form is still a possibility, though.


#17

Yes, Query.jl works with DataTables. At least on master I also have an unexported, experimental method interface that might at one point look more like dplyr. You can do things like this with that:

Query.@where(dt, i->i.a==2)

That will return an iterable table, so you can convert that back into any of the supported iterable table sinks, for example back into a DataTable:

DataTable(Query.@where(dt, i->i.a==2))
```
But be warned, I'm still experimenting with the syntax for this method based approach, so things might change.

[quote="nalimilan, post:16, topic:3160, full:true"]
We've just deprecated readtable in DataTables, but not in DataFrames. Re-exporting CSV.read under some form is still a possibility, though.
[/quote]

Another option would be to have some meta package like ``DataVerse`` at some point that just loads everything that one typically needs for data work.

#18

Yes, I for one would like to use a metapackage to make things easier. Obviously not a development target, but if there was just a single package I could point to and all of my conversion / file-format woes are solved, that’s great for teaching and REPL usage. Also great for writing scripts. Then some interface like ItearableTables.jl would be a good low-dep developer target.

Honestly, I find the current situation for data formats very confusing, though the Anthoff-verse is pretty nice.


#19

It is definitely easier to understand the internals of DataFrames/DataTables, but I question the assumption that understanding the internals of a tabular data package makes it easier to use. In my opinion, a good tabular data package allows one to abstract away from internals altogether and simply provides a nice API to do things with your data.


#20

I think there’s a few things to decouple here. The question is, who are you targetting? There’s a few different targets for a package:

  • End Users: You want something that end users can use to do anything their heart desires. Number of dependencies really doesn’t matter, as long as it is automatic to install and not excessive. The key is for it to be feature-filled and really well documented.

  • Other developers: You want something that others can easily hack away at to add whatever odd stuff they need. The key is to have concise code so that others can easily dig in. If features are missing, that doesn’t matter because you can assume the user can just read the source and add whatever they need. Documentation isn’t that crucial if you have comments and docstrings.

  • Developer target: Something that you want to offer as a component for other packages. You want this to be as small and as stable of a dependency as possible. The interfaces should be really well documented so that way everything meshes well, but you’re focusing on offering a good small core of features.

DataStreams and IterableTables are developer targets. I think IterableTables has a good future since it fill this niche very well. Pandas is clearly in the first category, as it tries to offer you a universe and assumes most people will never read the source. I kind of think DataFrames and DataTables are slightly confused between the first two. There is both a push for these to be “end user”, but then all of the extra functionality for this is either supposed to be user codes or external packages which extend DataFrames (like DataFramesMeta, StatPlots for plotting, CSV.jl, etc.).

But if you had to ask “what is the package I can go to with one comprehensive and coherent documentation that can kind of just do anything related to tabular data?” I don’t think we have a good answer to that. Instead, when someone has a complicated workflow and is looking for an end-to-end, we get pointed to a chain of 3 packages with a few other options.

I think this is partially because conditional dependencies don’t really have a good answer yet, but also because we have kind of been taking “there exists a solution” to mean “it’s easy to find all of the pieces for solutions and patch them together”. In my experience, it isn’t that easy if you’re not “in the know” with what the latest solutions are.


#21

Agree, coming from R and non CS background, I find it painful to working with data in Julia. At work, I’m heavy user of R’s dplyr, and though we have here DataFramesMeta.jl and Query.jl, things become complicated when dealing with missing values. At Julia’s Base we have Nullable and in DataFrames.jl we have NAs, if I’m using functions from Base with no methods for handling NAs then it’s a problem. I’m used to working with single type in R that works seamlessly on every function/operation, and that is the class NA. I’m hoping that we can have single type for missing values, either Nullable or NA. I do like Nullable by the way, it always reminds me of R if I work with NAs.

I second that motion for Anthoff-verse, I’m still learning Query.jl. :wink:


#22

Part of the point I was making was that it is much easier to work with DataTables and DataFrames even when some of the core features are under active development than it would be to do something analogous with something like pandas. However, since pandas is pretty mature (despite the fact that they refuse to do a 1.0 release), it isn’t really relevant.

Otherwise I agree with @ChrisRackauckas, it depends on what you are doing. I have found that getting data from tabular formats to “linear algebra” formats amenable to machine learning, optimization, or whatever else in a consistent uniform way that doesn’t have to be specially tailored to each individual application to be a surprisingly onerous task that I sometimes find pandas to be surprisinlgy ill-suited for, in no small part because the opacity of the package and level of specialized knowledge required to work with it effectively are obstacles. Granted, most of what I am saying is more a Python vs Julia thing, not a pandas vs DataTables thing. There’s no question that we’ll all be much better off when the Julia “data science ecosystem” is more mature.


#23

For what it’s worth, the released Pandas.jl is now updated to be fully functional on Julia .5 and Julia .6.


#24

To expand on something I touched on in my long post above, I’ll make note of the current thinking regarding null representations.

Our idea is to have two concepts of null values: the “data scientist’s null,” which is a scalar that behaves like NaN but for any type, and the “software engineer’s null,” which is a container of 0 or 1 elements. The former is akin to R’s NA (and by extension the current DataArrays NA), and the latter is akin to Rust’s Option (and similar to the Base Nullable type as it stands, but hopefully without any arithmetic defined). The latter would be used for things like tryparse and match, while the former would be used for computations.

The performance penalty due to the type instability introduced by representing a possibly missing value of type T as Union{T, Null} (as in DataArrays) will be lessened in Julia 0.7/1.0, as extensive optimizations for Unions (for precisely this purpose) are planned. I believe that PRs for those improvements could come at any time now that 0.6 has branched.

As I understand it, the idea is then to ditch DataArrays and NullableArrays entirely, as they will be effectively obsolete, and instead have arrays like Vector{<:Union{T, Null}}. This will provide unity for tabular data representations, since one needn’t worry about whether something is a DataArray or a NullableArray.

Anyway, just some :banana: for thought.


#25

This sounds like magic, but the result would be so wonderful that I’m giddy with excitement! Being able to use simple Vectors sounds almost too good to be true. :grin:


#26

Thank you – both for this update and your previous one. Where is the best place (or places) to track conversations about dataframes and null representations as they develop over time?

FWIW, I’m very much in the end-user camp, and am looking forward to the fruits of the great work a lot of folks are doing on dataframes and missing value representation. It’s obvious that a great deal of thinking has gone into the best path forward that takes into account the needs of a wide variety of users.

From a purely selfish perspective, I look forward to the day where I can confidently switch over 90% of my data munging into Julia. What’s stopping me (and I imagine others) right now is just the friction around common data analysis tasks, and the longer-term uncertainty.

As important as the internal implementation of these things are, I still believe a clean, clear and easy to use API for dataframes as the most important thing for getting people to switch from Pandas or R to Julia. DataFramesMeta, Query, and StructuredQueries are all great steps in the right direction. Again, I’m speaking of just a subset of users (including myself) and understand there are many others with different needs for whom the current work is solving. But making working with data dead simple from an end-user’s perspective is in my mind is the best way to catapult the user community.

Again, many thanks for the work and look forward to continuing to follow the discussion.


#27

The best way to follow progress is to subscribe to DataFrames/DataArrays and DataTables/NullableArrays. You can also subscribe to the issues tagged as nullable in Julia.

The framework hasn’t stabilized yet, but we are starting to move to a common representation of missing values which should hopefully be ready for Julia 1.0.


#28

What are the news about DataTables vs DataFrames and the many packages moving to missings.jl?


#29

DataTables is officially deprecated. Packages should depend on the new DataFrames 0.11 instead.


#30

I think DataFrames and Query will finally survive as a mainstream Julia data wrangling frame. I remember that when I first started using R, there were quite some packages in this data area, but eventually they boiled down to the tidyverse and data.table. This applies to the evolution of most programming languages that depend on packages, and the final survivor will be pretty good to use and worth the wait and uncertainty.

I remember when I used CERN ROOT many years ago, there was no uncertainty at all, at the same time there is no surprise and improvement, either. :sunglasses:


#31

What about Tables.jl?


#32

See Tables.jl: a table interface for everyone and Tables.jl vs TableTraits.jl (was TextParse.jl is fast again). But please don’t ask for unrelated questions in old threads, thanks!


#33

What about it? @Juan, you’re going to need to be much more specific in your questions if you expect folks to donate their time to answering you. Going through lots of old threads and asking very short and general questions gives the impression that you’re not invested in learning the answer — even though I know that’s not true. Clearly you’ve been reading and following lots of the development given your recent activity on other threads.

So I’m going to ask you to spend a bit more time formulating your questions — if you show some investment in the topic at hand then people are going to be much more apt to answer your questions. As it is, I think you’re just annoying lots of folks by pinging/emailing everyone who participated in these long threads from a long time ago.