This is handled in JuliaDB by the keyword flatten = false
: meaning if the grouping function returns a table (and not a namedtuple) the result will be a table of tables, unless the user sets flatten = true
(or calls flatten
on the resulting table). I understand this is changing would be quite breaking for DataFrames as traditionally by
in DataFrames encouraged returning a DataFrame
and flattened by default, but with NamedTuples
in base I think encouraging to return a NamedTuple
and requiring an explicit flatten
if one returns a DataFrame
may be better.
We could use different operations for this. map
on a GroupedDataFrame
currently returns a GroupedDataFrame
, and we have combine
to get a DataFrame
. This sounds similar to flatten
. mutate
should probably combine/flatten, so maybe itās enough to have it in the common API. DataFrames can always keep by
for convenience.
Another difference is that JuliaDB has the concept of an ID column, which DataFrames doesnāt have.
We should probably open up an issue for each of these in TableOperations.jl
. One for getindex
, one for setindex
(or mutate), one for grouping etc.
In terms of simplicity, it is very difficult to beat tidyverse. I really appreciate the efforts of unifying these functions.
On the other hand, I wish to see something like Rās data.table.
To be honest I think Stata beats dplyr
in most instances for data manipulation. Whole industries run on stata and people donāt want to switch. One reason Stata has succeeded is itās syntax for grouped operations, appending, collapsing, and more. Iāll see if I can write something longer clarifying my views on this.
I agree, but I think itās only able to because the language is organized around one purpose and a āone open data set at a timeā model so thereās no syntax wasted specifying the dataset from which the variable is being called, etc. Not sure you can get that in a world with multiple concurrent datasets or in a general purpose language.
Could you kindly provide the list of industries that use stata as their main data wrangling tool? To be honest, this is the first time for me to hear Stata be in the discussion of data science.
I used Stata and Gretl like 15 years ago, I am not sure how Stata has changed, but back then, it was far from a pleasant experience to use
In terms of data.table
, other than the inplace operations, the indices for fast sorting, joining, grouping may be useful.
Stata is almost exclusive to economics (govt and academia). Other than people that have been using it since Stata 6 or something like that, I doubt many consider it a great tool for data wrangling / manipulation.
It is extremely prevalent in economics/econometrics. The grouping syntax meshes rather well with corresponding tools for inference (eg run these regressions on the grouped dataset), and it handles large datasets well.
The only downside to Stata is that it is difficult to do something that does not fit well into the concept of working with tables. But advanced Stata users are very ingenious in this respect ā if you have the chance, watch an experienced Stata user work for a week, it is a real education. This aspect of Stata is not something Julia should replicate though IMO, and neither should we copy the tidyverse verbatim.
I think that a āJulianā solution will require quite a few iterations, experimentation, and design changes, so I would not sweat too much about minor details here as the interface is likely to be changed a few more times in the medium run.
What I would like to see is a simplification of the API, which allows building up operations from a limited vocabulary. My current playground is FunctionalTables.jl, which is not production read or stable (but I use it anyway for daily work). I found that I can combine āaggregationā and āmapping by groupsā by making functions return one-line tables. Eg in the example below, each map
call does the same thing, and the aggregator
function is just the shorthand for the first functional form.
julia> ft = FunctionalTable((a = isodd(i), b = i) for i in 1:10)
FunctionalTable of 10 rows, no ordering
a = Bool[true, false, true, false, true, ā¦]
b = Int8[1, 2, 3, 4, 5, ā¦]
julia> map(by(ft, :a)) do _, table
(b = mean(table.b), )
end
map(by(ft, :a)) do _, table
FunctionalTable of 2 rows, ordering āa
a = Bool[false, true]
b = Float64[6.0, 5.0]
julia> map(by(ft, :a)) do _, table
map(mean, columns(table))
end
FunctionalTable of 2 rows, ordering āa
a = Bool[false, true]
b = Float64[6.0, 5.0]
julia> map(aggregator(mean), by(ft, :a))
FunctionalTable of 2 rows, ordering āa
a = Bool[false, true]
b = Float64[6.0, 5.0]
The ignored argument _
is the key. I think one can express all n-table join operations by traversing the grouped tables, and just calling a function for rows of the same key. But this needs coding up and is WIP.
I donāt get how functionaltable is better than existing packages, is there any performance gain? Is :a and table.a exchangeable? Can the operations be chained? The syntax seems more verbose.
This illustrates another goal I would like to work towards with this new API, and one of the things I like about Stata as opposed to dplyrā¦ fewer parentheses.
Having everything be a tuple or named tuple is a recipe for lisp-y parentheses confusion.
Standardizing to have Vector
s of inputs (type stability doesnāt matter if they are just symbols referring to column names), along with do
function syntax where possible, along with an API that works well with piping, can go a long way to reducing this.
I am not yet convinced it is better (it is an experiment) ā it exploits the fact that tables are immutable for compression and sorting. But I used it only as an illustration for how aggregation and split-apply-combine can be done with the same interface and a small wrapper function.
IMO the tidyverse functionality (in R) relies heavily on some special features of R, mostly non-standard evaluation. Because of this, I donāt think it is easy to transplant the tidyverse surface syntax to Julia, so perhaps experimenting with new approaches more suited tio Julia is reasonable (I am not saying that this shoud be done in major packages like DataFrames though). As long as we have support interop packages like Tables.jl, it is easy to combine various approaches.
I agree with you on this point. Julia data wrangling tools should look like Julia. When I work with Julia, I need to translate large amount of MATLAB code, so I code exactly as I did in MATLAB which is array oriented. When it comes to data cleaning, I use dataframesmeta, which looks very similar with r and is data frame oriented. I guess style consistency is a thing for me.
This is a good point, instead of groupby
, groupmap
and groupreduce
as discussed above, one may consider groupby
, map(f, groupby(...))
and reduce(op, groupby(...))
, even though Iām still not fully happy with the terminology as map
is mapping across groups, i.e. apply f
on each group, whereas reduce
is reducing withing groups, so itās more of a map(t -> reduce(op, t), groupby)
.
One example of a universal table interface working in the wild is with Rās new sf
package. sf
tables just tibbles, but with geo-data attached to each observation, and they are wonderful to work with.
I can use all dplyr
verbs on them just as I could a data.frame
or a tibble
object. The verbs themselves modify the attached table and another set of st_
methods modify the spatial data attached. I donāt have to learn anything new, I donāt have to change the way I write code, it \text{Just Works}^{\text{tm}}
I donāt know the particulars of how class inheritance works in R, but itās easy to imagine how a universal table API could make the introduction of new structs holding custom data very easy.
Another note about my experience working with these objects. Piping is a very useful style to adhere to because it forces the user to write methods with the same type of input as output if they want to use their own functions in a line of pipes along with mutate
and filter
.
Having all the table operations take in and return the same type would be a great goal to have.
Actually, this is already possible, and Iām working on basic geometry types for geology etc that conform the table interface:
Only tests, no documentation for now
I love piping syntax but would it work like streams rather than generating tons of intermediate data? it would be awesome with some kind of āpipe-fusionā feature.
To some extent all of these goals are things I tried to address in LightQuery.jl
And because the package is so young, feel free to suggest syntax/API changes and I can incorporate them.
@sdanischās example nicely highlights that the core design of Query.jl is not about tables, but is all about iterators of values. It just happens that it also works well for tables. I think that kind of monadic design, that I simply copied from LINQ (just really want to make clear that this was not my idea), is one of these super simple but powerful ideas that make for a super composable design that extends well beyond the tabular world. There is a ton of academic research work out of Microsoft Research and a bunch of really nice Channel9 interviews on this stuff, for folks that are interested.
Everything is fused into streams in Query.jl, something like source |> @filter(_.a==3) |> @mutate(b=3*_.x) |> DataFrame
has no intermediate temporaries.