Future directions for DataFrames.jl

Hi,

Currently DataFrames.jl is designed for in-memory processing of data frames using CPU. While still we are working on improvements (the coming year) in this area I would like to make a pool on which general design direction we should take in the long term (several years):

  • GPU support
  • handling data larger than fits into RAM
  • improve current support of multi-threading

0 voters

I have made the pool single choice to understand the major area that the community would prefer us to concentrate on.

We do not even know yet what is possible in these areas - this pool will provide us with information which area we should concentrate on in our investigations.

Thank you for voting!

41 Likes

I went for larger than Ram because I think it would be a bit of a killer feature (I donā€™t think thereā€™s a framework out there currently that does it seamlessly?), but I wonder if you could comment on what the multi threading option entails? Is it about multi threading more internals for performance gain or better support for users to multithread their analysis?

6 Likes

Dask

3 Likes

seamless is difficult to define. there are a few approaches out there. I am experimenting with some approaches in Julia and have made an approach in R in {disk.frame}.

3 Likes

This is a tantalizing set of choices. Iā€™ll ramble a bit. Improving multi-threading in my opinion will have the most value for most people. Everyone has more cores, and more RAM - from the low-powered laptop to the biggest servers. This will help everyone.

A natural progression would be to think about out of core next - for people on a single computer to work on large datasets. However the real question for out of core is - maybe it is ok if your data is 2-3x of RAM. But if your dataset is 10x RAM or even bigger - will the I/O cost make it practically intractable? Do we have learnings from other systems around how effective this is? Many such users seem to have adopted distributed memory largely due to the extra I/O bandwidth.

GPUs are interesting because the memory size keeps growing, and you can really enable some very interesting applications with the performance they offer. I wonder how well RAPIDS performs on recent GPUs and how much of a game changer this can be.

One question is, whether using abstractions like KernelAbstractions or Dagger, we can target more hardware configurations with the same codebase? @vchuravy and @jpsamaroo may have some thoughts.

-viral

21 Likes

Just to flag that some libraries (dask) end up taking on both parallelism and out-of-core processing together. That might be too ambitious, but Iā€™m not sure that parallelism (albeit perhaps not multithreading per se) and out-of-core processing have to be mutually exclusive priorities!

6 Likes

Dagger will also be taking on this idea from dask going forward; out-of-core (specifically automatic swap-to-disk-and-back) is something that I want to implement other the next few months. With that support available, out-of-core and multithreaded+multiprocessing parallelism will be available and automatic by simply using Dagger.

12 Likes

There has been some interest in Dagger + Table interface https://github.com/JuliaParallel/Dagger.jl/pull/230 so that is probably a good stepping stone.

4 Likes

While I know that the community really wants out-of-core, I donā€™t think this functionality should go into DataFrames, or be DataFrames-specific. There are other Tables types that are currently fully in-memory, but would be able to be stored on disk efficiently (potentially more efficiently than DataFrames due to type stability).

Instead, I want Dagger.jl to naturally extend the processing capabilities of existing Tables-compatible types. Iā€™ve been working with Krystian Guliński, who just implemented the new Dagger.DTable distributed table type. I think this new distributed table is the ideal place for out-of-core to exist, since itā€™s backed by Daggerā€™s rapidly-evolving scheduler which can already do multithreading and multiprocessing, as well as (basic) GPU computing, automatically. Note that the DTable wraps existing Tables-compatible types, so itā€™s agnostic to the underlying storage format, and will inherit their in-memory operational performance and tradeoffs.

22 Likes

sorry for potentially going tangent here. I wonder how this would work if the on-disk file requires ā€œinterpretationā€, for example Arrow or HDF5, or more bizarre, in-house format that users know how to write parser, but potentially want a way to hook into Dagger.

To clarify, I donā€™t think the proposal is to put out-of-core things into DataFrames. Ideally, DataFrames would formalize the assumptions it has on vectors and the APIs it uses with vectors internally so that distributed arrays will ā€œjust workā€.

9 Likes

Weā€™ll try to do the optimal thing and use their native I/O capabilities when available. For things like DataFrames and other pure-Julia data types, weā€™ll probably use some mixture of Serialization and Mmap.

Understood and agreed. Itā€™s good to be explicit about these things so that onlookers donā€™t get the wrong ideas about whatā€™s being proposed (and it helps me understand what the JuliaData contributors have in mind).

1 Like

So to be explicit:

With my question I do not want to say that the three areas I am asking about are mutually exclusive nor that they would have to be hardcoded into DataFrames.jl (as they are not exclusive and they do not have to be fully hardcoded into DataFrames.jl).

I made the pool single choice, as with multiple choice the obvious answer is to pick all three things should be supported :smiley:.

What I wanted to learn is what area of functionality has currently highest demand for in the community. Having this knowledge we can best prioritize efforts. Thank you for participating in the pool. Also - the discussion we have here is very welcome and I hope it can lead to good design directions for the future.

13 Likes

While we are at it, I want to plug adding metadata to data frames as a desired future step.

I know it has been attempted twice now with no success, and that there exists lots of disagreement on the implementation, but itā€™s still something I would like to see eventually!

8 Likes

Tough choices (I will answer from a purely personal perspective for my use case, discarding what my be more useful to broader audience).

My first instinct was out of RAM processing, but in reality I would move to a server and if the data still canā€™t fit into RAM Postgres usually gets the job done for me (also not sure how many people use JuliaDB? Maybe there is not even a need for such a feature in DataFrames because of JuliaDB?).

On improve multi-threading I was not really sure what this entails. Are joins multithreaded yet? Can they be? If yes that seems like a great feature.

GPU support might actually be a USP of DataFrames? Could potentially be super useful depending on what kind of operations it can speed up. However, I hope it could also work for AMD as that is the only GPU acess I have :frowning:

In the end I voted for improved multi-threading support, but if it was an option I would have also voted for adding metadata to Dataframes as @pdeffebach mentioned. I come from a social science background and I think this feature would enable a huge potential user base of DataFrames, but for them it is currently not viable because working without metadata gets very annoying rather quickly in some circumstances.

1 Like

Anecdotally I had this exact use case and it was non-controversial for Dagger in my case. I used FileTrees where my custom parser was in the load function.

It took some tweaking and effort for the whole thing to roll though, such as FileTrees stack-overflowing on a large and deep directory structure and getting too many open files exceptions if I used too many machines as well as what I think is ā€˜normal cluster jankā€™ such as workes refusing requests for (to me) unkown reasons bringing the whole thing down.

None of this seemed to have anything to do with a custom parser though. I will ofc see if I can narrow down the issues the point where it is meaningful to file issues/PRs.

There may be a unique opportunity for Julia in general here as the options for libraries for manipulating tables in a generalized distributed computing context are extremely limited. Much to my frustration, I have found that in corporate ā€œbig data worldā€, even in the context of ā€œdata scienceā€ many of the practitioners seem to barely even understand the concept of generalized distributed computing and in many cases are blithely accepting the limitations of narrowly focused relational database libraries and SQL-like interfaces, it seemingly not occurring to them that they should not have to do absolutely everything they could possibly want to do with a single API which is of course wildly inappropriate for all sorts of tasks except what it was designed for.

Indeed, I find the options currently available for this kind of work highly unsatisfactory:

  • spark is very specifically designed for relational database operations, and doing anything else within the framework it provides is extremely painful at best. Deployment and dealing with the JVM are also truly awful.
  • dask is much closer to an generalized distributed computing framework with distributed table implementations in principle, but in practice it is prohibitively slow.
  • Other database software and SQL interfaces such as presto, clickhouse or any of the older SQL databases are even more narrowly focused than spark.

I do have a number of thoughts about package management and code organization, however, and in this I will echo some of what @jpsamaroo was saying.

I think it is essential to keep individual Julia packages light and focused so as not to compromise their usefulness as dependencies. We should endeavor to maintain the impressive level of code sharing currently typical in the Julia ecosystem, particularly in JuliaData. To this end, itā€™s not clear to me how much, if any, functionality related to distributed computing or tables larger than RAM should live in DataFrames.jl itself. We definitely donā€™t want to go down any roads that turn the package into a bloated monstrosity.

Ideally, DataFrames.jl would ā€œget out of the wayā€ of any packages that want to use it as a dependency. In particular, if there is anything that can be done to make DataFrames.jl useful for packages that want to implement distributed computing functionality, by all means it should do that. A good example might be the distributed table implementation which just went into Dagger.jl. In my opinion changes to DataFrames.jl which can make it more useful for that implementation would be a good thing, but putting such and implementation in DataFrames.jl itself seems like a bad idea.

While Iā€™m at it, at the moment Dagger.jl is doing exactly the sort of thing Iā€™m warning against by including that tables implementation. It really should be a separate package that depends on Dagger.jl and DataFrames.jl, more of my thoughts on the subject here.

7 Likes

Adding metadata is on the agenda. The PR for adding medata is here.

The key problem with metadata is not implementing it (it is easy) but the API we want to provide (this is very hard). I would encourage people interested to comment in https://github.com/JuliaData/DataFrames.jl/pull/1458 on the desired design of medatata.

When we end up with the desired rules how metadata should be handled (or at least mental model for it) I am sure it will be quickly implemented by someone (probably https://github.com/JuliaData/DataFrames.jl/pull/1458 should be closed as it will be easier to write it from scratch when we know what we want to do, but this is a technical nuisance). Thank you!

3 Likes

Or Tullio.jl, or Floops/Transducers and friends cc: @tkf @shashi and @mcabbott

Shashi had a dagger + tullio idea from a while back IIRC

Also thereā€™s already a start on a dtable in dagger: https://github.com/JuliaParallel/Dagger.jl/blob/e6b3cdca6a50b983a51250fd479dfae17f7944a6/src/table/dtable.jl

Another question, is DF even the right abstraction? Should there be something from a higher level structured loop/op representation like from one of those above packages, that sits on Tables.jl?

And dagger would be an execution backend or a middle IR.

Edit: Sorry, was a bit hasty with my reply and missed some of the comments from @jpsamaroo and @vchuravy who made some similar points.

IMO there are two prominent ways we can really improve over Dask with these expression DSLs and and tables.jl. One as @viralbshah mentioned was a , consolidated codebase. A second is not having a limited ā€œblessedā€ api with a hard abstraction barrier.

The cool thing about DF.jl is that you can just loop over vectors and manipulate things manually. With Dask, the methods are hardcoded DAG + in memory dataframes chunking, or maybe composing some higher level ops. If we have an expression frontend thatā€™s smart enough to compile parallel/distributed/ GPU code for different kinds of tables, users can reach out and touch the data structure instead of being limited to chaining pre-canned operations.

2 Likes