Dask and Dask-cuDF Julia alternative?

ldsands · July 16, 2019, 10:25pm

Hello everyone,

TLDR: looking for information on Julia’s Dask and Dask-cuDF alternatives

I’m having to analyze some data for my social science degree and it is very large. I’m using python, pandas and dask to parrellelize everything. I have kind of kept an eye on Julia as it has matured with interest. I was wondering if there is a dask alternative that allows for all pandas like functions (dataframes.jl) to run in parrellel and more importantly if there is a similar dataframes package that allows for processing on my GPU (RTX 2070). Thank you!

jpsamaroo · July 17, 2019, 1:36am

JuliaDB.jl is almost guaranteed what you’ll want, although it does not currently have GPU support. Could you give me an example of which operations you’ve found to be faster on the GPU? I’m interested in setting aside some time to add GPU support to JuliaDB for operations which are significantly faster on GPUs.

ldsands · July 17, 2019, 2:29am

Thanks for the quick reply and forgive my (mostly) uninformed thoughts.

I’ll have to check out JuliaDB. So as for GPU I’m not really sure what works so much faster on that vs CPU as I’m someone with basically no computer science training. I’ve taught myself R and now python. In my limited usage however a few essential things that would be needed are functions like groupby functions, joining, splitting, means sums other similar simple calculations. Pretty much any basic function supported by pandas or dplyr-like functions from R. Unfortunately, I’ve already run into bugs with dask-cuDF (it’s still pretty new) so I didn’t get to see just how much of an increase it would have given me but from the few functions I was able to get working I was seeing 6-20x speed increases over my 16 thread CPU. Also it looks like the dask people are trying to implement a lot of the pandas functions with cuDF so I would imagine that a lot of it would be faster on GPU.

It looks like there are some people doing some work with Julia on GPUs (JuliaGPU · GitHub). Again I don’t really know what I’m doing but maybe you can use some of their code and integrate it into JuliaDB to get GPU access to users who have one.

I do have another question for you, are there any plans to integrate arrow based columnar data handling in JuliaDB? Thanks again.

zgornel · July 17, 2019, 6:46am

Dagger.jl should provide data parallelism while Dispatcher.jl asynchronous/parallel task scheduling; caching support for Dispatcher is available as well in DispatcherCache.jl
AFAIK none have GPU support. On the other hand, GPU data processing seems a bit of a feature creep.

jpsamaroo · July 17, 2019, 6:58pm

Dagger already underlies JuliaDB, but JuliaDB is what better matches the OP’s question (w.r.t dataframe-like functionality, which Dagger does not have). Dispatcher also appears to be very similar to Dagger, since both implement task schedulers (which are task-parallel), although Dagger also has its own array interface which is data-parallel.

EDIT: GPU support is much closer to being feature creep for Dagger, but I feel is well suited to JuliaDB.

zgornel · July 17, 2019, 9:39pm

To me it feels creep in itself. GPU scalability is very rare (more than 2 GPU machines) yet 50+ cores or servers is easier to achieve and much more useful.

ldsands · July 18, 2019, 1:32pm

I can see that thinking and it is true however GPU computation is becoming more common in data science and the differences in speed are truly incredible. I realize that I can’t contribute so I’m not a good person to really discuss the practicalities of doing this but if it could be done there are certainly major benefits. Below I’ve included some benchmarks from a comparison of some analysis on CPUs vs GPUs the increases in speed might be worth investigating.

Architecture	Time
Single CPU Core	2hr 39min
Forty CPU Cores	11min 30s
One GPU	1 min 37s
Eight GPUs	19s
GPU Dask Arrays, first steps

zgornel · July 18, 2019, 2:34pm

That is quite impressive even though I believe it is more of a contrived example. Rocklin is pretty good at marketing

In the end, it is a question of costs: unless one needs to throw out massive models every few hours, the added complexity, technical debt as well as additional human related costs, make GPUs not worth for most businesses (cloud or not). Julia has actually made huge leaps in reducing GPU cost; it is an extraordinary deflationary language imho.

A standardized and open tensor processing unit a la x87 would be interesting though.

ldsands · July 18, 2019, 3:03pm

All true. Before I ran into errors I was able to get about 10x increases over my 16 thread CPU doing the same tasks.

If I were able to contribute, I might push for this a bit more but since I can’t, I’ll just keep an eye on JuliaDB. At least for now dask-cuDF seems to be the only way to go for what I want.

Topic		Replies	Views
Julia vs Python's Dask: Known speed comparisons? Julia at Scale question , parallel , distributed	13	5045	October 15, 2019
If tuplex can do it. So can Julia! Data	4	806	July 21, 2021
Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning? Machine Learning question , big-data	27	2994	December 1, 2020
Future directions for DataFrames.jl Data package , dataframes	47	6540	June 3, 2022
JuliaDB, dataframes: Speculations over the future of data packages Data	24	7437	August 21, 2020

Dask and Dask-cuDF Julia alternative?

Related topics