How to use julia for implementing machine learning algorithms in parallel

Devi_Sree · August 11, 2017, 8:54am

Can someone demonstrate julia’s efficiency by explaining how to write code for machine learning algorithm and then run them in parallel ? Can it break the efficiency of spark-scala.

ChrisRackauckas · August 11, 2017, 9:03am

For examples, I would check out Knet.jl

Knet.jl is a neural net library which basically takes standard Julia code for NNs and makes it easy to add backprop and parallelize it on a GPU. So if you see its README, you’ll see that it’s actually mostly just Julia code for a NN that’ll run in serial without Knet.jl. But then you add their array type and backprop function and you have a full blown parallelized NN that beats TensorFlow quite easily in benchmarks (this is shown both on their repo and I’ve noticed it myself).

Well yeah, that one’s easy because Spark/Scala doesn’t seem to get good performance at all. I’ll let @anon94023334 talk about that since he both loves Scala but doesn’t like the performance you can get out of it. I think a better benchmark is how well something simple like Knet does against popular ML frameworks like TensorFlow and Torch, and the answer is it’s on the good side. Of course, having it be essentially 99% standard Julia code makes it infinitely more flexible though.

Devi_Sree · August 11, 2017, 9:57am

I see a lot of them saying Julia has good performance in parallel mode for ML algorithms but no demo or explanation discussing on the same is available anywhere. My inference is that spark better performs than Julia in terms of time/speed whereas julia has reduced memory consumption comparatively. Pls correct me if I am wrong. A sample demo will better convey the actual effectiveness of Julia. Any links or sources also preferred.

ChrisRackauckas · August 11, 2017, 10:03am

Click on the link above and it has runnable examples you can copy and paste into the REPL.

I’ll let @anon94023334 comment on the horrors of GraphX and Spark performance.

dfdx · August 11, 2017, 11:22am

The comparison isn’t quite correct - Spark and Knet.jl (or similar Julia packages) target pretty different domains. Here’s just a couple of differences:

Spark is designed for distributed computing, Knet.jl is single machine library
Spark can store data on HDFS and load it to local workers without flooding a network (which quickly becomes a bottleneck on large datasets); as far as I know, currently there’s no native Julia solution for processing distributed on-disk datasets locally
Spark algorithms are designed for distributed processing using very restricted set of operations; this gives advantage on big data and multiple machines, but will most likely have slower performance on in-memory data
moreover, not all algorithms may be implemented for distributed computing system
as far as I know, there’s no mature project for supporting GPU on Spark and no easy way to it

For the same reason I don’t understand what “write code for machine learning algorithm and then run them in parallel” means. What algorithm? On what data? What “parallel” means?

Steven_Sagaert · August 11, 2017, 12:00pm

checkout intel HPAT/HiFrames.

That’s cutting edge stuff and much faster for machine learning/HPC than Spark (Spark is more suited for OLAP type stuff).

For “pure” julia there’s also Dagger (a Dask clone from Python): GitHub - JuliaParallel/Dagger.jl: A framework for out-of-core and parallel execution

John_Hearns · August 11, 2017, 12:19pm

Steven, the HPAT stuff looks soooo interesting.
The README refers to Julia 0.4 and I note the latest commit is “change Julia require to 0.5”
I really hope this project is actively developed. Big pat on the back to the developers!

Steven_Sagaert · August 11, 2017, 1:10pm

It’s definitely still active (although they are more working on HiFrames now) but due to changes in the julia internals they are never directly up to date with the bleeding edge latest julia version.

BTW: There’s also a Python version in the making.

John_Hearns · August 11, 2017, 1:43pm

Steven, thankyou for taking the trouble to reply. I have pinged the developer an email also.

anon94023334 · August 11, 2017, 3:03pm

Not a whole lot to add to @dfdx’s comparison, but my experience with Spark has been with GraphX and GraphFrames (and a little bit with the new MLLib).

For large graphs, Spark is good if you want a guarantee that eventually, at some undetermined point in the future, your data will be processed. It is not efficient, and makes no claims to be as far as I know, but it does “handle” very large data sets (in the sense that “not handling” them leads to abends).

For anything requiring performance, I’d stay away from Spark. It’s too hard to predict performance characteristics (visualvm just won’t cut it, and the built-in job analyzer doesn’t give you what you need in an easily-digestible format), and it’s too hard to reason about design tradeoffs that are intended to improve performance. Also, it looks as if the really active development has stalled a bit, especially with GraphX (GraphFrames may be a different story but the GF Pregel code, at least last I checked, is still calling out to GraphX).

I do love Scala and try to code in it wherever I can, but Spark is a different beast altogether.

Topic		Replies	Views
Online/out-of-core machine learning (ML) algorithms needs to compete with H20 & Spark Data	13	2353	March 1, 2018
Expected Performance of Julia within a Spark Environment? General Usage performance , spark	7	565	December 7, 2022
Julia back-end / algorithmic ML development roles at Tangent Works (Slovakia / Belgium / Remote) Jobs	2	783	October 28, 2020
Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning? Machine Learning question , big-data	27	3049	December 1, 2020
Flux or Knet? General Usage question	7	2354	June 3, 2021

How to use julia for implementing machine learning algorithms in parallel

Related topics