Future directions for DataFrames.jl

ToucheSir · August 12, 2021, 5:52pm

Jumping in here to strongly second the point about Dask in particular. As simple and elegant as the partitioning model is in theory, in practice I’ve found Dask’s implementation to be an extremely leaky abstraction. Stare at it funny, and the runtime will decide to run everything serially or materialize all of your data anyways. Hopefully DTable does not experience the same pathologies, but I’ll have to claim ignorance there

tkf · August 12, 2021, 8:00pm

This is essentially what I discussed in [ANN] Folds.jl: threaded, distributed, and GPU-based high-level data-parallel interface for Julia and

KernelAbstractions and Dagger are great in that they are very flexible and generic. But having more higher-level structured representation helps for lowering to target-specific programs. As always, adding more constraints can help for encoding structures of your code which then can be exploited by the underlying framework for performance and composability.

xiaodai · August 12, 2021, 11:43pm

I think scaling to about 10x is still doable. Especially with new tech like 3D XPoint drives becoming more common. For example, I have played with the Fannie Mae data (200G in CSV format) and can manipulate it using {disk.frame} in R which works on the principle of splitting up the data into smaller files and using multiple process to process the same file.

Beyond that size, you are probably better off with a cluster based approach, but cluster based should be a last resort. Most companies don’t need a spark but they end up paying the spark tax just so they can say “big data”.

xiaodai · August 12, 2021, 11:48pm

Yeah. I had done a quick study of the possibility of doing this with groupby. There’s a part of the groupby code that materialized an index in memory. This may not fit well with out-of-core datasets. See my research notes.

So for groupby to do that it might need to make that part of the code generic, which can add pressures to readability and maintainability. For now, I think it’s a case of let hundred flowers boom and let people try out multiple approaches. I am keen to try Dagger.DTable soon.

tbenst · August 13, 2021, 12:16am

Huge fan of DataFrames.jl—I think it’s much more ergonomic than eg Pandas.

I sometimes see Database-like ops benchmark mentioned when folks compare R and Julia, and it seems like in most cases DF.jl stacks up well, but does OOM more frequently on certain benchmarks. Perhaps focusing on memory footprint would be a useful intermediate step towards supporting larger and larger data sizes?

Also interesting to see how poorly Dask does from a performance standpoint

blackeneth · August 13, 2021, 2:59am

That PR, though, bounces you over to What metadata should be #2276, which then bounces you over to DataAPI metadata method #22

bkamins · August 13, 2021, 6:08am

And this is the issue - the problem is complex on the API level. I have agreed with @pdeffebach to try to come up with some clean proposal this year.

bkamins · August 13, 2021, 6:11am

https://github.com/h2oai/db-benchmark/pull/232 should fix this at least partially and in the future what this PR does will be done automatically. We just need to wait for benchmark maintainers to merge it and run the benchmarks to see the impact.

bkamins · August 13, 2021, 7:54am

Here I have summarized the state of multi-threading support in DataFrames.jl.

Further work on multi-threading will extend the range of operations in which it is used automatically.

Manual multi-threading is already fully supported as internally we only use task-based multi-threading constructs. Therefore users should just use the standard functionality of Julia to perform manual multi-threading.

ImreSamu · August 13, 2021, 8:22am

My priority:

Make It Work = handling data larger than fits into RAM
Make It Right
Make It Fast = improve current support of multi-threading ; GPU support

Make It Work Make It Right Make It Fast

This formulation of this statement has been attributed to KentBeck; it has existed as part of the UnixWay for a long time.See ButlerLampson’s “Hints for Computer System Design” (1983) http://research.microsoft.com/en-us/um/people/blampson/33-hints/webpage.html and Stephen C. Johnson and Brian W. Kernighan’s “The C Language and Models for Systems Programming” in Byte magazine (August 1983) (“the strategy is definitely: first make it work, then make it right, and, finally, make it fast.”
)
…

jling · August 13, 2021, 8:35am

On March 16, 2021, Micron announced that it would cease development of 3D XPoint in order to develop products based on Compute Express Link (CXL)

seems like a rapidly changing field and idk if “your local cluster” will have these SSDs…

bigdataman2015 · August 13, 2021, 10:04am

dplyr + tidyr + purrr

xiaodai · August 13, 2021, 10:51am

woah… there hasn’t been wide adoption of 3D XPoint yet. Gonna read up on CXL.

But faster disks just means better out-of-core performance. Big memory is eating big data.

Thanks

But Intel might still be in 3D XPoint though

Palli · August 13, 2021, 11:49am

That’s a good benchmark, current of almost for Julia’s DF, but I found for out-of-core: Benchmarking disk.frame 2: disk.frame beats Dask again! disk.frame beats JuliaDB again! Anyone else wanna challenge? • disk.frame

I didn’t notice a time-stamp, so I’m not sure if it’s current. Note, Dagger.jl is a dependency of JuliaDB, and is that currently the best out-of-core support? Which I do not need, currently…

tbenst · August 15, 2021, 8:04pm

For under $1000, you can RAID0 4x nvme drives…I have a 4TB array with sustained 7-8GB/s (ie disk filling) read/write using Samsung 970 Pro (note that many more recent drives typically optimize for consumer workflows of high burst but low sustained).

Top of the line interconnects are 200Gb/s (25GB/s), and that’s theoretical max, and usually shared, so I’d be surprised if a cluster approach would be faster until jobs are compute bound.

Routinely read 100GB of data into ram in about 15 seconds. Personally, if I would probably choose a beefy machine over cluster for data up to ~5TB, but depends on specific computation workflow of course.

robsmith11 · August 15, 2021, 10:38pm

I think there’s still room for improvement in the syntax and ergonomics. It’s quite clunky to do many basic data transformations compared to something like kdb+/q.
It would also be really nice to have type stability so that one can use dataframes anywhere without worrying about a huge drop in performance.

pdeffebach · August 15, 2021, 10:43pm

DataFramesMeta.jl provides macros that construct src => fun => dest transformations with a more convenient syntax. It’s maintained by the JuliaData collaborators, so it has a high bus factor.

Take a look at the tutorial here.

Juan · August 16, 2021, 12:43am

Not having GPU or multi-threading support will just make your computation slower. Being unable to handle larger datasets will completely prevent you from computing anything.
Then it’s much more important to have an out-of-core functionality.

Anyway, we don’t just need a sort of SQL database system to perform simple read and write operations, we need to be able to apply any function and library to that data, even if intermediate operations need more memory than RAM.

Thibaut · August 18, 2021, 8:14pm

Wow, this is hard… Let’s go one per one…

GPU support

It could be quite nice to have faster calculations when doing dataframe operations. However, as for myself it is rare that the calculations related to data-frame manipulations are the bottleneck when working with in-RAM datasets. It is more generally the predictive models. Especially the training of experiments during the design phases.

Improve current support of multi-threading

Similar story, while I believe that distributing the operations is an important topic for large datasets.

Handling data larger than fits into RAM

For me, this would be the first thing to do. Then we will need the above performance improvements because of that…

Also, I believe that Julia has what it takes to become a very good Big Data language. It is faster than Java and Scala, easier to write. Even compared to Dask & Cie, it makes tools easier to develop and is more consistently performant since it does not need C/C++ under the hood. If Julia can build a strong ecosystem around big data it will have serious user advantages.

Topic		Replies	Views
DataFrames.jl development survey Data question , dataframes	52	2940	September 27, 2020
The state of DataFrames.jl H2O benchmark Package Announcements dataframes	53	9319	January 1, 2025
What have we learned from DataFrames in Julia? Community poll	4	1649	November 29, 2017
What's the best way to work with millions of rows of data? Performance	7	2080	February 24, 2020
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4017	April 24, 2020

Future directions for DataFrames.jl

Make It Work Make It Right Make It Fast

Related topics