DataTables or DataFrames?

Agree, coming from R and non CS background, I find it painful to working with data in Julia. At work, I’m heavy user of R’s dplyr, and though we have here DataFramesMeta.jl and Query.jl, things become complicated when dealing with missing values. At Julia’s Base we have Nullable and in DataFrames.jl we have NAs, if I’m using functions from Base with no methods for handling NAs then it’s a problem. I’m used to working with single type in R that works seamlessly on every function/operation, and that is the class NA. I’m hoping that we can have single type for missing values, either Nullable or NA. I do like Nullable by the way, it always reminds me of R if I work with NAs.

I second that motion for Anthoff-verse, I’m still learning Query.jl. :wink:

3 Likes

Part of the point I was making was that it is much easier to work with DataTables and DataFrames even when some of the core features are under active development than it would be to do something analogous with something like pandas. However, since pandas is pretty mature (despite the fact that they refuse to do a 1.0 release), it isn’t really relevant.

Otherwise I agree with @ChrisRackauckas, it depends on what you are doing. I have found that getting data from tabular formats to “linear algebra” formats amenable to machine learning, optimization, or whatever else in a consistent uniform way that doesn’t have to be specially tailored to each individual application to be a surprisingly onerous task that I sometimes find pandas to be surprisinlgy ill-suited for, in no small part because the opacity of the package and level of specialized knowledge required to work with it effectively are obstacles. Granted, most of what I am saying is more a Python vs Julia thing, not a pandas vs DataTables thing. There’s no question that we’ll all be much better off when the Julia “data science ecosystem” is more mature.

For what it’s worth, the released Pandas.jl is now updated to be fully functional on Julia .5 and Julia .6.

5 Likes

To expand on something I touched on in my long post above, I’ll make note of the current thinking regarding null representations.

Our idea is to have two concepts of null values: the “data scientist’s null,” which is a scalar that behaves like NaN but for any type, and the “software engineer’s null,” which is a container of 0 or 1 elements. The former is akin to R’s NA (and by extension the current DataArrays NA), and the latter is akin to Rust’s Option (and similar to the Base Nullable type as it stands, but hopefully without any arithmetic defined). The latter would be used for things like tryparse and match, while the former would be used for computations.

The performance penalty due to the type instability introduced by representing a possibly missing value of type T as Union{T, Null} (as in DataArrays) will be lessened in Julia 0.7/1.0, as extensive optimizations for Unions (for precisely this purpose) are planned. I believe that PRs for those improvements could come at any time now that 0.6 has branched.

As I understand it, the idea is then to ditch DataArrays and NullableArrays entirely, as they will be effectively obsolete, and instead have arrays like Vector{<:Union{T, Null}}. This will provide unity for tabular data representations, since one needn’t worry about whether something is a DataArray or a NullableArray.

Anyway, just some :banana: for thought.

11 Likes

This sounds like magic, but the result would be so wonderful that I’m giddy with excitement! Being able to use simple Vectors sounds almost too good to be true. :grin:

1 Like

Thank you – both for this update and your previous one. Where is the best place (or places) to track conversations about dataframes and null representations as they develop over time?

FWIW, I’m very much in the end-user camp, and am looking forward to the fruits of the great work a lot of folks are doing on dataframes and missing value representation. It’s obvious that a great deal of thinking has gone into the best path forward that takes into account the needs of a wide variety of users.

From a purely selfish perspective, I look forward to the day where I can confidently switch over 90% of my data munging into Julia. What’s stopping me (and I imagine others) right now is just the friction around common data analysis tasks, and the longer-term uncertainty.

As important as the internal implementation of these things are, I still believe a clean, clear and easy to use API for dataframes as the most important thing for getting people to switch from Pandas or R to Julia. DataFramesMeta, Query, and StructuredQueries are all great steps in the right direction. Again, I’m speaking of just a subset of users (including myself) and understand there are many others with different needs for whom the current work is solving. But making working with data dead simple from an end-user’s perspective is in my mind is the best way to catapult the user community.

Again, many thanks for the work and look forward to continuing to follow the discussion.

1 Like

The best way to follow progress is to subscribe to DataFrames/DataArrays and DataTables/NullableArrays. You can also subscribe to the issues tagged as nullable in Julia.

The framework hasn’t stabilized yet, but we are starting to move to a common representation of missing values which should hopefully be ready for Julia 1.0.

4 Likes

What are the news about DataTables vs DataFrames and the many packages moving to missings.jl?

DataTables is officially deprecated. Packages should depend on the new DataFrames 0.11 instead.

2 Likes

I think DataFrames and Query will finally survive as a mainstream Julia data wrangling frame. I remember that when I first started using R, there were quite some packages in this data area, but eventually they boiled down to the tidyverse and data.table. This applies to the evolution of most programming languages that depend on packages, and the final survivor will be pretty good to use and worth the wait and uncertainty.

I remember when I used CERN ROOT many years ago, there was no uncertainty at all, at the same time there is no surprise and improvement, either. :sunglasses:

3 Likes

What about Tables.jl?

See Tables.jl: a table interface for everyone and Tables.jl vs TableTraits.jl (was TextParse.jl is fast again). But please don’t ask for unrelated questions in old threads, thanks!

1 Like

What about it? @Juan, you’re going to need to be much more specific in your questions if you expect folks to donate their time to answering you. Going through lots of old threads and asking very short and general questions gives the impression that you’re not invested in learning the answer — even though I know that’s not true. Clearly you’ve been reading and following lots of the development given your recent activity on other threads.

So I’m going to ask you to spend a bit more time formulating your questions — if you show some investment in the topic at hand then people are going to be much more apt to answer your questions. As it is, I think you’re just annoying lots of folks by pinging/emailing everyone who participated in these long threads from a long time ago.

7 Likes