I was just thinking about what is needed in Julia in order for it to compete with H20 and Spark and be a serious contender in enterprise ML.
I think JuliaDB.jl can handle the data manipulation aspects and OnlineStats.jl can handle summary stats and fit GLMs on streaming/big data etc. So I think we still need
- Parallel/Online/out-of-core Trees + Random Forests
- Parallel/Online/out-of-core K-means
- Parallel/Online/out-of-core XGBoost and LightBGM etc (they are trees but more popular recent choices)
Not sure about SVMs, very rarely use it. The only time I use it is to do some benchmarking but it’s usually beaten by NN which should be covered by Mocha.jl and Knet.jl (but on the website they are deep learning, but I can go shallow (e.g. 2 layers) on them I guess ). Flux.jl seems quite capable but I am not sure how well suited Flux, Mocha and Knet are to online and parallel algorithms as I am not familiar with them.
Only pure Julia implementations should be included. JuML.jl has made some promising start on XGBoost which I have tested. It claims to be faster than C++ in some cases, although I would love to see a side by side test for myself.