Future directions for DataFrames.jl

There may be a unique opportunity for Julia in general here as the options for libraries for manipulating tables in a generalized distributed computing context are extremely limited. Much to my frustration, I have found that in corporate “big data world”, even in the context of “data science” many of the practitioners seem to barely even understand the concept of generalized distributed computing and in many cases are blithely accepting the limitations of narrowly focused relational database libraries and SQL-like interfaces, it seemingly not occurring to them that they should not have to do absolutely everything they could possibly want to do with a single API which is of course wildly inappropriate for all sorts of tasks except what it was designed for.

Indeed, I find the options currently available for this kind of work highly unsatisfactory:

  • spark is very specifically designed for relational database operations, and doing anything else within the framework it provides is extremely painful at best. Deployment and dealing with the JVM are also truly awful.
  • dask is much closer to an generalized distributed computing framework with distributed table implementations in principle, but in practice it is prohibitively slow.
  • Other database software and SQL interfaces such as presto, clickhouse or any of the older SQL databases are even more narrowly focused than spark.

I do have a number of thoughts about package management and code organization, however, and in this I will echo some of what @jpsamaroo was saying.

I think it is essential to keep individual Julia packages light and focused so as not to compromise their usefulness as dependencies. We should endeavor to maintain the impressive level of code sharing currently typical in the Julia ecosystem, particularly in JuliaData. To this end, it’s not clear to me how much, if any, functionality related to distributed computing or tables larger than RAM should live in DataFrames.jl itself. We definitely don’t want to go down any roads that turn the package into a bloated monstrosity.

Ideally, DataFrames.jl would “get out of the way” of any packages that want to use it as a dependency. In particular, if there is anything that can be done to make DataFrames.jl useful for packages that want to implement distributed computing functionality, by all means it should do that. A good example might be the distributed table implementation which just went into Dagger.jl. In my opinion changes to DataFrames.jl which can make it more useful for that implementation would be a good thing, but putting such and implementation in DataFrames.jl itself seems like a bad idea.

While I’m at it, at the moment Dagger.jl is doing exactly the sort of thing I’m warning against by including that tables implementation. It really should be a separate package that depends on Dagger.jl and DataFrames.jl, more of my thoughts on the subject here.

7 Likes

Adding metadata is on the agenda. The PR for adding medata is here.

The key problem with metadata is not implementing it (it is easy) but the API we want to provide (this is very hard). I would encourage people interested to comment in https://github.com/JuliaData/DataFrames.jl/pull/1458 on the desired design of medatata.

When we end up with the desired rules how metadata should be handled (or at least mental model for it) I am sure it will be quickly implemented by someone (probably https://github.com/JuliaData/DataFrames.jl/pull/1458 should be closed as it will be easier to write it from scratch when we know what we want to do, but this is a technical nuisance). Thank you!

3 Likes

Or Tullio.jl, or Floops/Transducers and friends cc: @tkf @shashi and @mcabbott

Shashi had a dagger + tullio idea from a while back IIRC

Also there’s already a start on a dtable in dagger: Dagger.jl/dtable.jl at e6b3cdca6a50b983a51250fd479dfae17f7944a6 · JuliaParallel/Dagger.jl · GitHub

Another question, is DF even the right abstraction? Should there be something from a higher level structured loop/op representation like from one of those above packages, that sits on Tables.jl?

And dagger would be an execution backend or a middle IR.

Edit: Sorry, was a bit hasty with my reply and missed some of the comments from @jpsamaroo and @vchuravy who made some similar points.

IMO there are two prominent ways we can really improve over Dask with these expression DSLs and and tables.jl. One as @viralbshah mentioned was a , consolidated codebase. A second is not having a limited “blessed” api with a hard abstraction barrier.

The cool thing about DF.jl is that you can just loop over vectors and manipulate things manually. With Dask, the methods are hardcoded DAG + in memory dataframes chunking, or maybe composing some higher level ops. If we have an expression frontend that’s smart enough to compile parallel/distributed/ GPU code for different kinds of tables, users can reach out and touch the data structure instead of being limited to chaining pre-canned operations.

2 Likes

Jumping in here to strongly second the point about Dask in particular. As simple and elegant as the partitioning model is in theory, in practice I’ve found Dask’s implementation to be an extremely leaky abstraction. Stare at it funny, and the runtime will decide to run everything serially or materialize all of your data anyways. Hopefully DTable does not experience the same pathologies, but I’ll have to claim ignorance there :slight_smile:

2 Likes

This is essentially what I discussed in [ANN] Folds.jl: threaded, distributed, and GPU-based high-level data-parallel interface for Julia and

KernelAbstractions and Dagger are great in that they are very flexible and generic. But having more higher-level structured representation helps for lowering to target-specific programs. As always, adding more constraints can help for encoding structures of your code which then can be exploited by the underlying framework for performance and composability.

14 Likes

I think scaling to about 10x is still doable. Especially with new tech like 3D XPoint drives becoming more common. For example, I have played with the Fannie Mae data (200G in CSV format) and can manipulate it using {disk.frame} in R which works on the principle of splitting up the data into smaller files and using multiple process to process the same file.

Beyond that size, you are probably better off with a cluster based approach, but cluster based should be a last resort. Most companies don’t need a spark but they end up paying the spark tax just so they can say “big data”.

Yeah. I had done a quick study of the possibility of doing this with groupby. There’s a part of the groupby code that materialized an index in memory. This may not fit well with out-of-core datasets. See my research notes.

So for groupby to do that it might need to make that part of the code generic, which can add pressures to readability and maintainability. For now, I think it’s a case of let hundred flowers boom and let people try out multiple approaches. I am keen to try Dagger.DTable soon.

Huge fan of DataFrames.jl—I think it’s much more ergonomic than eg Pandas.

I sometimes see Database-like ops benchmark mentioned when folks compare R and Julia, and it seems like in most cases DF.jl stacks up well, but does OOM more frequently on certain benchmarks. Perhaps focusing on memory footprint would be a useful intermediate step towards supporting larger and larger data sizes?

Also interesting to see how poorly Dask does from a performance standpoint

6 Likes

That PR, though, bounces you over to What metadata should be #2276, which then bounces you over to DataAPI metadata method #22

And this is the issue - the problem is complex on the API level. I have agreed with @pdeffebach to try to come up with some clean proposal this year.

update Julia benchmarks by bkamins · Pull Request #232 · h2oai/db-benchmark · GitHub should fix this at least partially and in the future what this PR does will be done automatically. We just need to wait for benchmark maintainers to merge it and run the benchmarks to see the impact.

2 Likes

Here I have summarized the state of multi-threading support in DataFrames.jl.

Further work on multi-threading will extend the range of operations in which it is used automatically.

Manual multi-threading is already fully supported as internally we only use task-based multi-threading constructs. Therefore users should just use the standard functionality of Julia to perform manual multi-threading.

4 Likes

My priority:

  • Make It Work = handling data larger than fits into RAM
  • Make It Right
  • Make It Fast = improve current support of multi-threading ; GPU support

Make It Work Make It Right Make It Fast

This formulation of this statement has been attributed to KentBeck; it has existed as part of the UnixWay for a long time.See ButlerLampson’s “Hints for Computer System Design” (1983) Butler W. Lampson and Stephen C. Johnson and Brian W. Kernighan’s “The C Language and Models for Systems Programming” in Byte magazine (August 1983) ("the strategy is definitely: first make it work, then make it right, and, finally, make it fast."
)

4 Likes

On March 16, 2021, Micron announced that it would cease development of 3D XPoint in order to develop products based on Compute Express Link (CXL)

seems like a rapidly changing field and idk if “your local cluster” will have these SSDs…

1 Like

dplyr + tidyr + purrr

2 Likes

woah… there hasn’t been wide adoption of 3D XPoint yet. Gonna read up on CXL.

But faster disks just means better out-of-core performance. Big memory is eating big data.

Thanks

But Intel might still be in 3D XPoint though

That’s a good benchmark, current of almost for Julia’s DF, but I found for out-of-core: Benchmarking disk.frame 2: disk.frame beats Dask again! disk.frame beats JuliaDB again! Anyone else wanna challenge? • disk.frame

I didn’t notice a time-stamp, so I’m not sure if it’s current. Note, Dagger.jl is a dependency of JuliaDB, and is that currently the best out-of-core support? Which I do not need, currently…

For under $1000, you can RAID0 4x nvme drives…I have a 4TB array with sustained 7-8GB/s (ie disk filling) read/write using Samsung 970 Pro (note that many more recent drives typically optimize for consumer workflows of high burst but low sustained).

Top of the line interconnects are 200Gb/s (25GB/s), and that’s theoretical max, and usually shared, so I’d be surprised if a cluster approach would be faster until jobs are compute bound.

Routinely read 100GB of data into ram in about 15 seconds. Personally, if I would probably choose a beefy machine over cluster for data up to ~5TB, but depends on specific computation workflow of course.

1 Like

I think there’s still room for improvement in the syntax and ergonomics. It’s quite clunky to do many basic data transformations compared to something like kdb+/q.
It would also be really nice to have type stability so that one can use dataframes anywhere without worrying about a huge drop in performance.