JuliaData BoF @ JuliaCon2023 discussion

bkamins · August 5, 2023, 5:10pm

Below I recap a summary from the BoF Data Ecosystem discussion during JuliaCon2023 (for sure, it is probably opinionated, so feel free to comment to add your thoughts).
I recap things I took out as most important to work on from this discussion (and other related discussions I had during the conference).

Four major areas were discussed:

It would be great to have robust distributed tabular data processing. It becomes increasingly important as the ability to work with out-of-core is a must for many users. DTables.jl is a good candidate, but it needs polishing. Contributions/friendly user tests are welcome.
A coupled topic is a flexible and easy way to read/store distributed data (especially from the cloud). @quinnj mentioned CloudStore.jl as a package to fill this gap.
Decoupling the “DataFrames.jl query language” from the DataFrame object would be good. In this way, other tabular data formats could use the same interface. This would also make the meta-packages (DataFramesMeta.jl, DataFrameMacros.jl, Tidier.jl) work automatically with different backends. @krynju has gone through this process for DTables.jl, which was doable and not very hard (though not all operations are supported; this is expected, though, as different data storage formats might impose additional constraints on allowed operations in comparison to fully-flexible DataFrames.jl). In general, it does not have to be strictly this language but some other transformation specification language (that, in particular, would allow lazy query processing). Also, it would be good to have a two-way translator between such a language and SQL.
It would be good to have Polars.jl bindings + ensure that DuckDB.jl bindings are stable and maintained (this was not discussed much during the meeting, but I had such discussions separately). I agree with this opinion. These two technologies will continue to improve in the future, and many potential users will want to keep using them when switching to Julia from e.g. Python.

rssdev10 · August 13, 2023, 11:48pm

Lazy processing…

Lazy means keeping a chain of functions and applying them with an actual data when needed. So there are few interesting possibilities here:

easy to draw a diagram of the table transformation starting from the DataFrame() to the point where the transformation is applied. It helps to debug code.
transform operators can be wrapped by an external executor/optimizer (a kind of logical execution plan);
operators can be run in a distributed mode on a cluster with automatic load balancing and collecting results…;
operators can be optimized before execution;
operators can be partially converted to an SQL query, providing seamless integration with an SQL DB. In this case, an initial data request and some operations can be performed by a DB, but others can be performed locally with Julia only;
lazy operators with DB cursor support can provide processing of large tables;
…

juliohm · August 14, 2023, 2:05am

Regarding lazy transforms over tables, check:

Here is the display of pipelines:

julia> using TableTransforms

julia> pipe = (Select(1:5) → PCA()) ⊔ (Interquartile() → ColTable())
ParallelTableTransform
├─ SequentialTransform
│  ├─ Select([1, 2, 3, 4, 5], nothing)
│  ├─ ZScore(all)
│  └─ EigenAnalysis(:V, nothing, 1.0)
└─ SequentialTransform
   ├─ Scale(all, 0.25, 0.75)
   └─ ColTable()

We currently use Transducers.jl to run some transforms in parallel over the columns of the input table or over the branches of parallel pipelines. It would be nice to consider Dagger.jl for parallel over rows.

Topic		Replies	Views
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7436	August 21, 2020
A serious data start-up structured around a Julia data manipulation framework for larger-than-RAM data Offtopic	24	855	September 16, 2024
ANN: JuliaDB.jl Community	40	9707	November 13, 2018
What's the latest and greatest in data in Julia Data	29	2115	August 15, 2024
DataTables or DataFrames? Data question	32	15378	November 19, 2018

JuliaData BoF @ JuliaCon2023 discussion

Related topics