JuliaData BoF @ JuliaCon2023 discussion

Below I recap a summary from the BoF Data Ecosystem discussion during JuliaCon2023 (for sure, it is probably opinionated, so feel free to comment to add your thoughts).
I recap things I took out as most important to work on from this discussion (and other related discussions I had during the conference).

Four major areas were discussed:

  1. It would be great to have robust distributed tabular data processing. It becomes increasingly important as the ability to work with out-of-core is a must for many users. DTables.jl is a good candidate, but it needs polishing. Contributions/friendly user tests are welcome.
  2. A coupled topic is a flexible and easy way to read/store distributed data (especially from the cloud). @quinnj mentioned CloudStore.jl as a package to fill this gap.
  3. Decoupling the “DataFrames.jl query language” from the DataFrame object would be good. In this way, other tabular data formats could use the same interface. This would also make the meta-packages (DataFramesMeta.jl, DataFrameMacros.jl, Tidier.jl) work automatically with different backends. @krynju has gone through this process for DTables.jl, which was doable and not very hard (though not all operations are supported; this is expected, though, as different data storage formats might impose additional constraints on allowed operations in comparison to fully-flexible DataFrames.jl). In general, it does not have to be strictly this language but some other transformation specification language (that, in particular, would allow lazy query processing). Also, it would be good to have a two-way translator between such a language and SQL.
  4. It would be good to have Polars.jl bindings + ensure that DuckDB.jl bindings are stable and maintained (this was not discussed much during the meeting, but I had such discussions separately). I agree with this opinion. These two technologies will continue to improve in the future, and many potential users will want to keep using them when switching to Julia from e.g. Python.

Lazy processing…

Lazy means keeping a chain of functions and applying them with an actual data when needed. So there are few interesting possibilities here:

  • easy to draw a diagram of the table transformation starting from the DataFrame() to the point where the transformation is applied. It helps to debug code.
  • transform operators can be wrapped by an external executor/optimizer (a kind of logical execution plan);
  • operators can be run in a distributed mode on a cluster with automatic load balancing and collecting results…;
  • operators can be optimized before execution;
  • operators can be partially converted to an SQL query, providing seamless integration with an SQL DB. In this case, an initial data request and some operations can be performed by a DB, but others can be performed locally with Julia only;
  • lazy operators with DB cursor support can provide processing of large tables;

Regarding lazy transforms over tables, check:

Here is the display of pipelines:

julia> using TableTransforms

julia> pipe = (Select(1:5) → PCA()) ⊔ (Interquartile() → ColTable())
├─ SequentialTransform
│  ├─ Select([1, 2, 3, 4, 5], nothing)
│  ├─ ZScore(all)
│  └─ EigenAnalysis(:V, nothing, 1.0)
└─ SequentialTransform
   ├─ Scale(all, 0.25, 0.75)
   └─ ColTable()

We currently use Transducers.jl to run some transforms in parallel over the columns of the input table or over the branches of parallel pipelines. It would be nice to consider Dagger.jl for parallel over rows.

1 Like