Below I recap a summary from the BoF Data Ecosystem discussion during JuliaCon2023 (for sure, it is probably opinionated, so feel free to comment to add your thoughts).
I recap things I took out as most important to work on from this discussion (and other related discussions I had during the conference).
Four major areas were discussed:
- It would be great to have robust distributed tabular data processing. It becomes increasingly important as the ability to work with out-of-core is a must for many users. DTables.jl is a good candidate, but it needs polishing. Contributions/friendly user tests are welcome.
- A coupled topic is a flexible and easy way to read/store distributed data (especially from the cloud). @quinnj mentioned CloudStore.jl as a package to fill this gap.
- Decoupling the “DataFrames.jl query language” from the
DataFrame
object would be good. In this way, other tabular data formats could use the same interface. This would also make the meta-packages (DataFramesMeta.jl, DataFrameMacros.jl, Tidier.jl) work automatically with different backends. @krynju has gone through this process for DTables.jl, which was doable and not very hard (though not all operations are supported; this is expected, though, as different data storage formats might impose additional constraints on allowed operations in comparison to fully-flexible DataFrames.jl). In general, it does not have to be strictly this language but some other transformation specification language (that, in particular, would allow lazy query processing). Also, it would be good to have a two-way translator between such a language and SQL. - It would be good to have Polars.jl bindings + ensure that DuckDB.jl bindings are stable and maintained (this was not discussed much during the meeting, but I had such discussions separately). I agree with this opinion. These two technologies will continue to improve in the future, and many potential users will want to keep using them when switching to Julia from e.g. Python.