Hello everyone,
TLDR: looking for information on Julia’s Dask and Dask-cuDF alternatives
I’m having to analyze some data for my social science degree and it is very large. I’m using python, pandas and dask to parrellelize everything. I have kind of kept an eye on Julia as it has matured with interest. I was wondering if there is a dask alternative that allows for all pandas like functions (dataframes.jl) to run in parrellel and more importantly if there is a similar dataframes package that allows for processing on my GPU (RTX 2070). Thank you!
JuliaDB.jl is almost guaranteed what you’ll want, although it does not currently have GPU support. Could you give me an example of which operations you’ve found to be faster on the GPU? I’m interested in setting aside some time to add GPU support to JuliaDB for operations which are significantly faster on GPUs.
6 Likes
Thanks for the quick reply and forgive my (mostly) uninformed thoughts.
I’ll have to check out JuliaDB. So as for GPU I’m not really sure what works so much faster on that vs CPU as I’m someone with basically no computer science training. I’ve taught myself R and now python. In my limited usage however a few essential things that would be needed are functions like groupby functions, joining, splitting, means sums other similar simple calculations. Pretty much any basic function supported by pandas or dplyr-like functions from R. Unfortunately, I’ve already run into bugs with dask-cuDF (it’s still pretty new) so I didn’t get to see just how much of an increase it would have given me but from the few functions I was able to get working I was seeing 6-20x speed increases over my 16 thread CPU. Also it looks like the dask people are trying to implement a lot of the pandas functions with cuDF so I would imagine that a lot of it would be faster on GPU.
It looks like there are some people doing some work with Julia on GPUs (JuliaGPU · GitHub). Again I don’t really know what I’m doing but maybe you can use some of their code and integrate it into JuliaDB to get GPU access to users who have one.
I do have another question for you, are there any plans to integrate arrow based columnar data handling in JuliaDB? Thanks again.
Dagger.jl should provide data parallelism while Dispatcher.jl asynchronous/parallel task scheduling; caching support for Dispatcher
is available as well in DispatcherCache.jl
AFAIK none have GPU support. On the other hand, GPU data processing seems a bit of a feature creep.
Dagger already underlies JuliaDB, but JuliaDB is what better matches the OP’s question (w.r.t dataframe-like functionality, which Dagger does not have). Dispatcher also appears to be very similar to Dagger, since both implement task schedulers (which are task-parallel), although Dagger also has its own array interface which is data-parallel.
EDIT: GPU support is much closer to being feature creep for Dagger, but I feel is well suited to JuliaDB.
To me it feels creep in itself. GPU scalability is very rare (more than 2 GPU machines) yet 50+ cores or servers is easier to achieve and much more useful.
I can see that thinking and it is true however GPU computation is becoming more common in data science and the differences in speed are truly incredible. I realize that I can’t contribute so I’m not a good person to really discuss the practicalities of doing this but if it could be done there are certainly major benefits. Below I’ve included some benchmarks from a comparison of some analysis on CPUs vs GPUs the increases in speed might be worth investigating.
1 Like
That is quite impressive even though I believe it is more of a contrived example. Rocklin is pretty good at marketing
In the end, it is a question of costs: unless one needs to throw out massive models every few hours, the added complexity, technical debt as well as additional human related costs, make GPUs not worth for most businesses (cloud or not). Julia has actually made huge leaps in reducing GPU cost; it is an extraordinary deflationary language imho.
A standardized and open tensor processing unit a la x87 would be interesting though.
All true. Before I ran into errors I was able to get about 10x increases over my 16 thread CPU doing the same tasks.
If I were able to contribute, I might push for this a bit more but since I can’t, I’ll just keep an eye on JuliaDB. At least for now dask-cuDF seems to be the only way to go for what I want.