[WIP] Announcing Volcanito.jl: a backend-agnostic interface for tabular data

> df <- data.frame(x = c(1, 5, 6))
> mutate(df, deviation = x - mean(x))
  x deviation
1 1        -3
2 5         1
3 6         2

Agreed that it’s useful to have some sugar for doing this beyond the explicit aggregation and joining. That’s obviously what inspired SQL to add WINDOW as a syntactic feature :slight_smile:

What makes me so averse to x - mean(x) is that it makes it impossible to do something like the following with any automatic parallelism:

df <- data.frame(x = c(1, 1, 2, 3))
mutate(df, foo = x + length(unique(x)))

The problem is that a composed function like length(unique(x)) isn’t tractable to automatic parallelization. If the system has to always assume non-parallelized functions might be called it can either decide that (a) users need to explicitly ensure all functions they use explicitly describe their parallelization strategy (which is how distributed DB’s like Presto work when users add new aggregation functions) or it can choose to (b) never provide automatic parallelization. I think the latter approach is a pretty bad long-term bet for the foreseeable future of computing hardware.

As a side question, do you think it is possible to define these operations taking into account metadata? Say I have additional information like spatial coordinates, timestamps, etc for each row of the table and that this information lives outside the table object. How we can apply a @groupby for example and retain these metadata in the results? It would be nice if these efforts could handle these use cases.

The simplest approach would be to operate on the indices of the rows and introduce intermediate functions like @groupby_inds that developers could leverage to extract the indices of both the table and the metadata.

There’s GitHub - wookay/Octo.jl: Octo.jl 🐙 is an SQL Query DSL in Julia

2 Likes

As a side question, do you think it is possible to define these operations taking into account metadata? Say I have additional information like spatial coordinates, timestamps, etc for each row of the table and that this information lives outside the table object. How we can apply a @groupby for example and retain these metadata in the results? It would be nice if these efforts could handle these use cases.

I may not be understanding, but this seems either (a) like it should fall out naturally of normal SQL-style operations or (b) is a bit of a niche use case. Hard to imagine I’ll have enough spare time to ever get to (b).

The amazing thing is that this is not a _str macro!

9 Likes