[WIP] Announcing Volcanito.jl: a backend-agnostic interface for tabular data

johnmyleswhite · August 30, 2020, 1:21am

> df <- data.frame(x = c(1, 5, 6))
> mutate(df, deviation = x - mean(x))
  x deviation
1 1        -3
2 5         1
3 6         2

Agreed that it’s useful to have some sugar for doing this beyond the explicit aggregation and joining. That’s obviously what inspired SQL to add WINDOW as a syntactic feature

What makes me so averse to x - mean(x) is that it makes it impossible to do something like the following with any automatic parallelism:

df <- data.frame(x = c(1, 1, 2, 3))
mutate(df, foo = x + length(unique(x)))

The problem is that a composed function like length(unique(x)) isn’t tractable to automatic parallelization. If the system has to always assume non-parallelized functions might be called it can either decide that (a) users need to explicitly ensure all functions they use explicitly describe their parallelization strategy (which is how distributed DB’s like Presto work when users add new aggregation functions) or it can choose to (b) never provide automatic parallelization. I think the latter approach is a pretty bad long-term bet for the foreseeable future of computing hardware.

juliohm · August 30, 2020, 1:31pm

As a side question, do you think it is possible to define these operations taking into account metadata? Say I have additional information like spatial coordinates, timestamps, etc for each row of the table and that this information lives outside the table object. How we can apply a @groupby for example and retain these metadata in the results? It would be nice if these efforts could handle these use cases.

The simplest approach would be to operate on the indices of the rows and introduce intermediate functions like @groupby_inds that developers could leverage to extract the indices of both the table and the metadata.

dpsanders · August 30, 2020, 2:52pm

There’s GitHub - wookay/Octo.jl: Octo.jl 🐙 is an SQL Query DSL in Julia

johnmyleswhite · August 30, 2020, 3:00pm

As a side question, do you think it is possible to define these operations taking into account metadata? Say I have additional information like spatial coordinates, timestamps, etc for each row of the table and that this information lives outside the table object. How we can apply a @groupby for example and retain these metadata in the results? It would be nice if these efforts could handle these use cases.

I may not be understanding, but this seems either (a) like it should fall out naturally of normal SQL-style operations or (b) is a bit of a niche use case. Hard to imagine I’ll have enough spare time to ever get to (b).

kristoffer.carlsson · August 30, 2020, 3:13pm

The amazing thing is that this is not a _str macro!

Topic		Replies	Views
JuliaData BoF @ JuliaCon2023 discussion Data discussion	2	466	August 14, 2023
Tables.jl: a table interface for everyone Data tables	19	10679	November 19, 2018
ANN: JuliaDB.jl Community	40	9688	November 13, 2018
Intro to the Queryverse, a Julia data science stack - tutorial this Thursday Community announcement , queryverse	2	2145	June 15, 2018
[ANN] SparkSQL.jl release 1.0.0 Package Announcements	2	690	June 19, 2021

[WIP] Announcing Volcanito.jl: a backend-agnostic interface for tabular data

Related topics