Common API for tabular data backends


#1

Following the discussion on Slack I would like to open a topic here to discuss the possibility to provide a common API for different packages/ecosystems providing tabular data handling.

The major rationale is that this kind of common API will make users’ life easier on two levels:

  1. you will not have to qualify method name with package name in case two packages define the same function name (we have this for functions from Base, but some common operations like select or transform/mutate do not have functions defined in base).
  2. on a higher level the functions could have the same API (this is harder as not only the list of functions has to be agreed on but also their API)

Given the discussion we already had there is an additional layer of lazy vs eager evaluation of tabular data transformations - which I think also could be discussed here (both approaches have their uses).

Probably the first step could be to try to pin the list of the common functions that could be provided.


#2

When figuring out which features to implement, it’s helpful to start with an existing package. DataFramesMeta exports the following:

  • @where: filter rows
  • @transform: add a new column (setindex)
  • @orderby: sorting columns
  • @select: selecting columns (kind of getindex)
    • This always returned a dataframe, but theoretically this package should allow for getindex to return: (1) the same type of table (2) an array or (3) a row
  • groupby: group a dataset based on a key
  • @by: perform an operation across different groups of a dataset

This seems like a superset the operations a generic table operations package would want to implement.

Note that I am just using DataFramesMeta as a way to kick-off the conversation. Additionally, DataFrames uses macros for non-standard evaluation of arguments, which is something I would very much like to avoid with a generic table package. Ideally it should be as easy to perform table manipulations inside a function as outside a function.


#3

So my list of basic generic functions is the following (I am omitting ! versions that could possibly be considered):

  • Core
    • subset columns: select
    • add columns: mutate
    • compute new columns, dropping old: transform
    • subset rows: filter, first, last (all in Base)
    • sorting: sort (in Base)
  • Optional
    • remove duplicates: unique (in Base)
    • long2wide: stack
    • wide2long: unstack
    • joins: join
    • bind rows: vcat (in Base)
    • bind columns: hcat (in Base)
    • grouped transformation: by

Functions in Base are less problematic IMO.


#4

I favor this tabular base API approach… From the very best (tidyverse),

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names.
  • filter() picks cases based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.
  • gather()/spread() reshape data
  • group_by() split/apply
  • join() joins of data sets
  • cat() for vertical (vcat) and horizontal (hcat) concatenation
  • unique() distinct rows

#6

Do we want all table types to implement, say, Base.vcat in the same way? With the same API? Or do we want a lightweight Table wrapper and implement Base.vcat on the Table object?


#7

Here are my opinions:

The same way - probably it is not. The same API - ideally yes, but it might not be possible to agree on, so the minimum is the same name and the same operation “in general”.

This is problematic unfortunately. For functions defined in Base I think that Table wrapper should not be used at all (as it would have to be a catch-all type approach). For custom functions probably Table wrapper could implement a default functionality that could be overwritten by the packages - e.g. for performance.


#8

I think it is better to stick to an API based on functions, not macros, and build syntactic-sugar-coated macro DSLs on top of that. With a common API like @bkamins describes, various DSL approaches could just use the same common functions as a backend.


#9

While I really don’t have anything specific to add in terms of the API, I just want to say that these kind of efforts to unify and standardize are really welcome. Thank you!


#10

Thanks @bkamins for your effort on API polish and unification!

I completely agree that the operations which are basically Base operations on a vector of rows (sort, sortperm, filter, first, last, unique,…) are probably the easiest part API wise as we should just stick to Base.

Concerning your propose API, I like it a lot but I have a couple of remarks.

  • I find the mutate versus transform difference confusing. To a non-native speaker it may not be clear what is the difference in meaning between the two words and it may be difficult to remember which is which. Do we really need two functions or could we just have one (transform) with the rule that if a column name does not exist, the column gets added, otherwise it gets modified? This is similar to the behavior of setindex! on a dictionary. If we want two separate functions I would prefer more distinct names (like addcols vs changecols or something like that, JuliaDB has pushcol and setcol).

  • by may need a bit of discussion. So far DataFrames uses by and JuliaDB uses groupby (instead of DataFrames by), the DataFrames.groupby(df, col) corresponds to JuliaDB.groupby(identity, df, col). On top of that, JuliaDB distinguishes the case where the summary function takes the whole subtable (i.e. groupby(mean, ...)) or when the summary function “reduces” an iterable of rows using a online statistics or binary function (i.e. groupreduce(+, ...)).

For the latter remark (grouping API), maybe something like the following may be clearer:

  • group (or groupby) simply groups the data without applying any transformation
  • groupapply groups the data and applies a function
  • groupreduce groups the data and reduces it using a binary function or online statistics from OnlineStats

2 is definitely harder but I think a good way forward could be a hybrid way were there is a “cross-packages” minimal API that is granted to always work (and may even have an implemented fallback in a common package) that package authors could use to write code that works with any table (I imagine a TablesMeta package to unify DataFramesMeta and JuliaDBMeta implemented like this). Then each package could have extra methods to make things easier for the user or for backward compatibility.

From what I understand the rule in Base distinguishes filter from Iterators.filter to decide eager versus lazy. Maybe we could do the same and have Tables.select versus Tables.Iterators.select (or even to different packages, say TableOperations for eager and TableQueries for lazy) but the API would be the same.


#11

We should also nail down what df[1] returns for all tables. Ideally this is just a row which is either a named tuple or a view that looks like a named tuple.


#12

Yes - such consensus would help us make a bold move in DataFrames.jl which we debate over for the last 3 months :smile:.

This decision should also cover df[collection_of_indices] and df[collection_of_Bool] cases.


#13

I am not a native speaker, but for me the distinction is (maybe better names should be chosen):

  • mutate: at start retains all columns from the source and adds/modifies columns following the specification
  • transform: starts with an empty table and only adds what was specified

EDIT we could have keep kwarg to differentiate these behavior a la SAS.


#14

Another thing that could help resolve this debate is to have selecting a single column totally separate from getindex. The distinction between df[col] and df[[col]] is pretty subtle. Having a getcol function similar to what @piever diescribed that always returned a vector would be nice.


#15

This is my understanding. df[col] would then be gone, as it would select a row not a column, The same with df[cols]. We would have df.col which would be clear what it does and getcol for cases when col is dynamically generated or a Symbol that is not accepted by the compiler in getproperty syntax. And to get a table we would use select.


#16

Ah, my bad, I had misunderstood the intended behavior. Your mutate would take a function from namedtuples to namedtuples and add the resulting variables as new columns (or modify if columns exist already), say:

# mutate is the same as @transform in JuliaDBMeta (which is what confused me)
mutate(row -> (; Ratio = row.SepalLength / row.SepalWidth), iris) # result has columns of iris + Ratio
transform(row -> (; Ratio = row.SepalLength / row.SepalWidth), iris) # result has one column

If my understanding is correct, should we use map instead of transform then? mutate would simply be hcat(df, map(f, df)) (curious whether we need a mergecols to specify that if the same name reappears the rightmost occurrence of the column prevails).


#17

I understand this is a big change (and probably extremely breaking) but I think it is a nice step towards unifying DataFrames API with StructArrays, TypedTables and IndexedTables.


#18

Good point. map is a natural function here.


#19

Actually I am in favor of this change. The only thing is risking that users’ code will massively break. But I think that if we have a global agreement on this behavior across table-related packages then it would be less painful.


#20

Agreed! Julia’s very young, tabular dataset packages doubly so. Let’s get this right…


#21

I agree we shouldn’t have both mutate and transform. dplyr uses mutate to keep existing columns, and transmute to drop them (transform was already used by base R for a similar behavior to mutate). It would indeed make sense to use map for the function only returns new columns; we could also allow select to do that (like in JuliaDB and similar to SQL).

+1 for the general scheme. But I’d prefer groupby to group and groupmap to groupapply (though map(f, group(...)) isn’t too bad at that point).

Also we have found in DataFrames that two kind of groupapply/groupmap operations can make sense, depending on whether you return a GroupedDataFrame (which can make sense when there are several rows per group) or a plain DataFrame. dplyr supports summarize for this: it is a special case where the function must return a single value/row, but it also drops the grouping information since there is only one row per group. Maybe low priority, but it could eventually be useful.

A subtlety of mutate and transmute in dplyr when applied to a grouped table is that they recycle the values if needed so that the number of rows is the same as the original. That’s logical for mutate since columns that are kept need to be filled with something. But that’s less natural for groupapply/groupmap.

Interesting idea. Let’s hear what @quinnj thinks.

I’m also OK with dropping table[col] from DataFrames if other tables want to use it to access rows (we might do it anyway). But table[row, :] is also available and is more explicit (especially for users coming from R).

It could be useful to have a GitHub repo to file separate issues for each API domain, as the discussion is going to become quite complex very quickly.