Future directions for DataFrames.jl

Jumping in here to strongly second the point about Dask in particular. As simple and elegant as the partitioning model is in theory, in practice I’ve found Dask’s implementation to be an extremely leaky abstraction. Stare at it funny, and the runtime will decide to run everything serially or materialize all of your data anyways. Hopefully DTable does not experience the same pathologies, but I’ll have to claim ignorance there :slight_smile:

2 Likes

This is essentially what I discussed in [ANN] Folds.jl: threaded, distributed, and GPU-based high-level data-parallel interface for Julia and

KernelAbstractions and Dagger are great in that they are very flexible and generic. But having more higher-level structured representation helps for lowering to target-specific programs. As always, adding more constraints can help for encoding structures of your code which then can be exploited by the underlying framework for performance and composability.

14 Likes

I think scaling to about 10x is still doable. Especially with new tech like 3D XPoint drives becoming more common. For example, I have played with the Fannie Mae data (200G in CSV format) and can manipulate it using {disk.frame} in R which works on the principle of splitting up the data into smaller files and using multiple process to process the same file.

Beyond that size, you are probably better off with a cluster based approach, but cluster based should be a last resort. Most companies don’t need a spark but they end up paying the spark tax just so they can say “big data”.

Yeah. I had done a quick study of the possibility of doing this with groupby. There’s a part of the groupby code that materialized an index in memory. This may not fit well with out-of-core datasets. See my research notes.

So for groupby to do that it might need to make that part of the code generic, which can add pressures to readability and maintainability. For now, I think it’s a case of let hundred flowers boom and let people try out multiple approaches. I am keen to try Dagger.DTable soon.

Huge fan of DataFrames.jl—I think it’s much more ergonomic than eg Pandas.

I sometimes see Database-like ops benchmark mentioned when folks compare R and Julia, and it seems like in most cases DF.jl stacks up well, but does OOM more frequently on certain benchmarks. Perhaps focusing on memory footprint would be a useful intermediate step towards supporting larger and larger data sizes?

Also interesting to see how poorly Dask does from a performance standpoint

6 Likes

That PR, though, bounces you over to What metadata should be #2276, which then bounces you over to DataAPI metadata method #22

And this is the issue - the problem is complex on the API level. I have agreed with @pdeffebach to try to come up with some clean proposal this year.

https://github.com/h2oai/db-benchmark/pull/232 should fix this at least partially and in the future what this PR does will be done automatically. We just need to wait for benchmark maintainers to merge it and run the benchmarks to see the impact.

2 Likes

Here I have summarized the state of multi-threading support in DataFrames.jl.

Further work on multi-threading will extend the range of operations in which it is used automatically.

Manual multi-threading is already fully supported as internally we only use task-based multi-threading constructs. Therefore users should just use the standard functionality of Julia to perform manual multi-threading.

4 Likes

My priority:

  • Make It Work = handling data larger than fits into RAM
  • Make It Right
  • Make It Fast = improve current support of multi-threading ; GPU support

Make It Work Make It Right Make It Fast

This formulation of this statement has been attributed to KentBeck; it has existed as part of the UnixWay for a long time.See ButlerLampson’s “Hints for Computer System Design” (1983) http://research.microsoft.com/en-us/um/people/blampson/33-hints/webpage.html and Stephen C. Johnson and Brian W. Kernighan’s “The C Language and Models for Systems Programming” in Byte magazine (August 1983) (“the strategy is definitely: first make it work, then make it right, and, finally, make it fast.”
)

4 Likes

On March 16, 2021, Micron announced that it would cease development of 3D XPoint in order to develop products based on Compute Express Link (CXL)

seems like a rapidly changing field and idk if “your local cluster” will have these SSDs…

1 Like

dplyr + tidyr + purrr

2 Likes

woah… there hasn’t been wide adoption of 3D XPoint yet. Gonna read up on CXL.

But faster disks just means better out-of-core performance. Big memory is eating big data.

Thanks

But Intel might still be in 3D XPoint though

That’s a good benchmark, current of almost for Julia’s DF, but I found for out-of-core: Benchmarking disk.frame 2: disk.frame beats Dask again! disk.frame beats JuliaDB again! Anyone else wanna challenge? • disk.frame

I didn’t notice a time-stamp, so I’m not sure if it’s current. Note, Dagger.jl is a dependency of JuliaDB, and is that currently the best out-of-core support? Which I do not need, currently…

For under $1000, you can RAID0 4x nvme drives…I have a 4TB array with sustained 7-8GB/s (ie disk filling) read/write using Samsung 970 Pro (note that many more recent drives typically optimize for consumer workflows of high burst but low sustained).

Top of the line interconnects are 200Gb/s (25GB/s), and that’s theoretical max, and usually shared, so I’d be surprised if a cluster approach would be faster until jobs are compute bound.

Routinely read 100GB of data into ram in about 15 seconds. Personally, if I would probably choose a beefy machine over cluster for data up to ~5TB, but depends on specific computation workflow of course.

1 Like

I think there’s still room for improvement in the syntax and ergonomics. It’s quite clunky to do many basic data transformations compared to something like kdb+/q.
It would also be really nice to have type stability so that one can use dataframes anywhere without worrying about a huge drop in performance.

DataFramesMeta.jl provides macros that construct src => fun => dest transformations with a more convenient syntax. It’s maintained by the JuliaData collaborators, so it has a high bus factor.

Take a look at the tutorial here.

3 Likes

Not having GPU or multi-threading support will just make your computation slower. Being unable to handle larger datasets will completely prevent you from computing anything.
Then it’s much more important to have an out-of-core functionality.

Anyway, we don’t just need a sort of SQL database system to perform simple read and write operations, we need to be able to apply any function and library to that data, even if intermediate operations need more memory than RAM.

Wow, this is hard… Let’s go one per one…

GPU support

It could be quite nice to have faster calculations when doing dataframe operations. However, as for myself it is rare that the calculations related to data-frame manipulations are the bottleneck when working with in-RAM datasets. It is more generally the predictive models. Especially the training of experiments during the design phases.

Improve current support of multi-threading

Similar story, while I believe that distributing the operations is an important topic for large datasets.

Handling data larger than fits into RAM

For me, this would be the first thing to do. Then we will need the above performance improvements because of that…

Also, I believe that Julia has what it takes to become a very good Big Data language. It is faster than Java and Scala, easier to write. Even compared to Dask & Cie, it makes tools easier to develop and is more consistently performant since it does not need C/C++ under the hood. If Julia can build a strong ecosystem around big data it will have serious user advantages.

6 Likes