Can someone demonstrate julia’s efficiency by explaining how to write code for machine learning algorithm and then run them in parallel ? Can it break the efficiency of spark-scala.
For examples, I would check out Knet.jl
Knet.jl is a neural net library which basically takes standard Julia code for NNs and makes it easy to add backprop and parallelize it on a GPU. So if you see its README, you’ll see that it’s actually mostly just Julia code for a NN that’ll run in serial without Knet.jl. But then you add their array type and backprop function and you have a full blown parallelized NN that beats TensorFlow quite easily in benchmarks (this is shown both on their repo and I’ve noticed it myself).
Well yeah, that one’s easy because Spark/Scala doesn’t seem to get good performance at all. I’ll let @anon94023334 talk about that since he both loves Scala but doesn’t like the performance you can get out of it. I think a better benchmark is how well something simple like Knet does against popular ML frameworks like TensorFlow and Torch, and the answer is it’s on the good side. Of course, having it be essentially 99% standard Julia code makes it infinitely more flexible though.
I see a lot of them saying Julia has good performance in parallel mode for ML algorithms but no demo or explanation discussing on the same is available anywhere. My inference is that spark better performs than Julia in terms of time/speed whereas julia has reduced memory consumption comparatively. Pls correct me if I am wrong. A sample demo will better convey the actual effectiveness of Julia. Any links or sources also preferred.
Click on the link above and it has runnable examples you can copy and paste into the REPL.
I’ll let @anon94023334 comment on the horrors of GraphX and Spark performance.
The comparison isn’t quite correct - Spark and Knet.jl (or similar Julia packages) target pretty different domains. Here’s just a couple of differences:
- Spark is designed for distributed computing, Knet.jl is single machine library
- Spark can store data on HDFS and load it to local workers without flooding a network (which quickly becomes a bottleneck on large datasets); as far as I know, currently there’s no native Julia solution for processing distributed on-disk datasets locally
- Spark algorithms are designed for distributed processing using very restricted set of operations; this gives advantage on big data and multiple machines, but will most likely have slower performance on in-memory data
- moreover, not all algorithms may be implemented for distributed computing system
- as far as I know, there’s no mature project for supporting GPU on Spark and no easy way to it
For the same reason I don’t understand what “write code for machine learning algorithm and then run them in parallel” means. What algorithm? On what data? What “parallel” means?
checkout intel HPAT/HiFrames.
That’s cutting edge stuff and much faster for machine learning/HPC than Spark (Spark is more suited for OLAP type stuff).
For “pure” julia there’s also Dagger (a Dask clone from Python): GitHub - JuliaParallel/Dagger.jl: A framework for out-of-core and parallel execution
Steven, the HPAT stuff looks soooo interesting.
The README refers to Julia 0.4 and I note the latest commit is “change Julia require to 0.5”
I really hope this project is actively developed. Big pat on the back to the developers!
It’s definitely still active (although they are more working on HiFrames now) but due to changes in the julia internals they are never directly up to date with the bleeding edge latest julia version.
BTW: There’s also a Python version in the making.
Steven, thankyou for taking the trouble to reply. I have pinged the developer an email also.
Not a whole lot to add to @dfdx’s comparison, but my experience with Spark has been with GraphX and GraphFrames (and a little bit with the new MLLib).
For large graphs, Spark is good if you want a guarantee that eventually, at some undetermined point in the future, your data will be processed. It is not efficient, and makes no claims to be as far as I know, but it does “handle” very large data sets (in the sense that “not handling” them leads to abends).
For anything requiring performance, I’d stay away from Spark. It’s too hard to predict performance characteristics (visualvm just won’t cut it, and the built-in job analyzer doesn’t give you what you need in an easily-digestible format), and it’s too hard to reason about design tradeoffs that are intended to improve performance. Also, it looks as if the really active development has stalled a bit, especially with GraphX (GraphFrames may be a different story but the GF Pregel code, at least last I checked, is still calling out to GraphX).
I do love Scala and try to code in it wherever I can, but Spark is a different beast altogether.