Common API for tabular data backends

This is handled in JuliaDB by the keyword flatten = false: meaning if the grouping function returns a table (and not a namedtuple) the result will be a table of tables, unless the user sets flatten = true (or calls flatten on the resulting table). I understand this is changing would be quite breaking for DataFrames as traditionally by in DataFrames encouraged returning a DataFrame and flattened by default, but with NamedTuples in base I think encouraging to return a NamedTuple and requiring an explicit flatten if one returns a DataFrame may be better.

We could use different operations for this. map on a GroupedDataFrame currently returns a GroupedDataFrame, and we have combine to get a DataFrame. This sounds similar to flatten. mutate should probably combine/flatten, so maybe itā€™s enough to have it in the common API. DataFrames can always keep by for convenience.

Another difference is that JuliaDB has the concept of an ID column, which DataFrames doesnā€™t have.

We should probably open up an issue for each of these in TableOperations.jl. One for getindex, one for setindex (or mutate), one for grouping etc.

In terms of simplicity, it is very difficult to beat tidyverse. I really appreciate the efforts of unifying these functions.

On the other hand, I wish to see something like Rā€™s data.table.

To be honest I think Stata beats dplyr in most instances for data manipulation. Whole industries run on stata and people donā€™t want to switch. One reason Stata has succeeded is itā€™s syntax for grouped operations, appending, collapsing, and more. Iā€™ll see if I can write something longer clarifying my views on this.

3 Likes

I agree, but I think itā€™s only able to because the language is organized around one purpose and a ā€œone open data set at a timeā€ model so thereā€™s no syntax wasted specifying the dataset from which the variable is being called, etc. Not sure you can get that in a world with multiple concurrent datasets or in a general purpose language.

1 Like

Could you kindly provide the list of industries that use stata as their main data wrangling tool? To be honest, this is the first time for me to hear Stata be in the discussion of data science.

I used Stata and Gretl like 15 years ago, I am not sure how Stata has changed, but back then, it was far from a pleasant experience to use

1 Like

In terms of data.table, other than the inplace operations, the indices for fast sorting, joining, grouping may be useful.

Stata is almost exclusive to economics (govt and academia). Other than people that have been using it since Stata 6 or something like that, I doubt many consider it a great tool for data wrangling / manipulation.

It is extremely prevalent in economics/econometrics. The grouping syntax meshes rather well with corresponding tools for inference (eg run these regressions on the grouped dataset), and it handles large datasets well.

The only downside to Stata is that it is difficult to do something that does not fit well into the concept of working with tables. But advanced Stata users are very ingenious in this respect ā€” if you have the chance, watch an experienced Stata user work for a week, it is a real education. This aspect of Stata is not something Julia should replicate though IMO, and neither should we copy the tidyverse verbatim.

I think that a ā€œJulianā€ solution will require quite a few iterations, experimentation, and design changes, so I would not sweat too much about minor details here as the interface is likely to be changed a few more times in the medium run.

What I would like to see is a simplification of the API, which allows building up operations from a limited vocabulary. My current playground is FunctionalTables.jl, which is not production read or stable (but I use it anyway for daily work). I found that I can combine ā€œaggregationā€ and ā€œmapping by groupsā€ by making functions return one-line tables. Eg in the example below, each map call does the same thing, and the aggregator function is just the shorthand for the first functional form.

julia> ft = FunctionalTable((a = isodd(i), b = i) for i in 1:10)
FunctionalTable of 10 rows, no ordering
    a = Bool[true, false, true, false, true, ā€¦]
    b = Int8[1, 2, 3, 4, 5, ā€¦]

julia> map(by(ft, :a)) do _, table
       (b = mean(table.b), )
       end
map(by(ft, :a)) do _, table
FunctionalTable of 2 rows, ordering ā†‘a
    a = Bool[false, true]
    b = Float64[6.0, 5.0]

julia> map(by(ft, :a)) do _, table
       map(mean, columns(table))
       end
FunctionalTable of 2 rows, ordering ā†‘a
    a = Bool[false, true]
    b = Float64[6.0, 5.0]

julia> map(aggregator(mean), by(ft, :a))
FunctionalTable of 2 rows, ordering ā†‘a
    a = Bool[false, true]
    b = Float64[6.0, 5.0]

The ignored argument _ is the key. I think one can express all n-table join operations by traversing the grouped tables, and just calling a function for rows of the same key. But this needs coding up and is WIP.

1 Like

I donā€™t get how functionaltable is better than existing packages, is there any performance gain? Is :a and table.a exchangeable? Can the operations be chained? The syntax seems more verbose.

This illustrates another goal I would like to work towards with this new API, and one of the things I like about Stata as opposed to dplyrā€¦ fewer parentheses.

Having everything be a tuple or named tuple is a recipe for lisp-y parentheses confusion.

Standardizing to have Vectors of inputs (type stability doesnā€™t matter if they are just symbols referring to column names), along with do function syntax where possible, along with an API that works well with piping, can go a long way to reducing this.

I am not yet convinced it is better (it is an experiment) ā€” it exploits the fact that tables are immutable for compression and sorting. But I used it only as an illustration for how aggregation and split-apply-combine can be done with the same interface and a small wrapper function.

IMO the tidyverse functionality (in R) relies heavily on some special features of R, mostly non-standard evaluation. Because of this, I donā€™t think it is easy to transplant the tidyverse surface syntax to Julia, so perhaps experimenting with new approaches more suited tio Julia is reasonable (I am not saying that this shoud be done in major packages like DataFrames though). As long as we have support interop packages like Tables.jl, it is easy to combine various approaches.

2 Likes

I agree with you on this point. Julia data wrangling tools should look like Julia. When I work with Julia, I need to translate large amount of MATLAB code, so I code exactly as I did in MATLAB which is array oriented. When it comes to data cleaning, I use dataframesmeta, which looks very similar with r and is data frame oriented. I guess style consistency is a thing for me.

This is a good point, instead of groupby, groupmap and groupreduce as discussed above, one may consider groupby, map(f, groupby(...)) and reduce(op, groupby(...)), even though Iā€™m still not fully happy with the terminology as map is mapping across groups, i.e. apply f on each group, whereas reduce is reducing withing groups, so itā€™s more of a map(t -> reduce(op, t), groupby).

4 Likes

One example of a universal table interface working in the wild is with Rā€™s new sf package. sf tables just tibbles, but with geo-data attached to each observation, and they are wonderful to work with.

I can use all dplyr verbs on them just as I could a data.frame or a tibble object. The verbs themselves modify the attached table and another set of st_ methods modify the spatial data attached. I donā€™t have to learn anything new, I donā€™t have to change the way I write code, it \text{Just Works}^{\text{tm}}

I donā€™t know the particulars of how class inheritance works in R, but itā€™s easy to imagine how a universal table API could make the introduction of new structs holding custom data very easy.

Another note about my experience working with these objects. Piping is a very useful style to adhere to because it forces the user to write methods with the same type of input as output if they want to use their own functions in a line of pipes along with mutate and filter.

Having all the table operations take in and return the same type would be a great goal to have.

Actually, this is already possible, and Iā€™m working on basic geometry types for geology etc that conform the table interface:

Only tests, no documentation for now :wink:

3 Likes

I love piping syntax but would it work like streams rather than generating tons of intermediate data? it would be awesome with some kind of ā€œpipe-fusionā€ feature.

1 Like

To some extent all of these goals are things I tried to address in LightQuery.jl

And because the package is so young, feel free to suggest syntax/API changes and I can incorporate them.

@sdanischā€™s example nicely highlights that the core design of Query.jl is not about tables, but is all about iterators of values. It just happens that it also works well for tables. I think that kind of monadic design, that I simply copied from LINQ (just really want to make clear that this was not my idea), is one of these super simple but powerful ideas that make for a super composable design that extends well beyond the tabular world. There is a ton of academic research work out of Microsoft Research and a bunch of really nice Channel9 interviews on this stuff, for folks that are interested.

Everything is fused into streams in Query.jl, something like source |> @filter(_.a==3) |> @mutate(b=3*_.x) |> DataFrame has no intermediate temporaries.

6 Likes