I was just thinking about what is needed in Julia in order for it to compete with H20 and Spark and be a serious contender in enterprise ML.
I think JuliaDB.jl can handle the data manipulation aspects and OnlineStats.jl can handle summary stats and fit GLMs on streaming/big data etc. So I think we still need
Parallel/Online/out-of-core Trees + Random Forests
Parallel/Online/out-of-core K-means
Parallel/Online/out-of-core XGBoost and LightBGM etc (they are trees but more popular recent choices)
Not sure about SVMs, very rarely use it. The only time I use it is to do some benchmarking but it’s usually beaten by NN which should be covered by Mocha.jl and Knet.jl (but on the website they are deep learning, but I can go shallow (e.g. 2 layers) on them I guess ). Flux.jl seems quite capable but I am not sure how well suited Flux, Mocha and Knet are to online and parallel algorithms as I am not familiar with them.
Only pure Julia implementations should be included. JuML.jl has made some promising start on XGBoost which I have tested. It claims to be faster than C++ in some cases, although I would love to see a side by side test for myself.
Stochastic gradient decent and the likes are a form of online algorithm, and it’s made out-of-core just by pulling batches from the dataset when you need them. With Flux and Knet you write the code for iterating the dataset, so making it out-of-core is on the user and could probably have a helper function (or maybe JuliaDB makes this happen naturally).
JuML.jl looks like some great work, but someone needs to take the time helping the author get into the community and turn this into reproducible software. Some things to start are that it needs units tests, it needs to get registered, and it should make use of standard tooling. I noticed that it recreates other elements of the ecosystem, like dataframes:
If anyone has the time to help them get this fixed up, that’s a pretty easy way to help Julia (since this repo is already written and just needs some tweaks!).
OnlineStats seems to be on its way to handle most, if not all, of what you are asking for. Implementing trees is in progress, K-means is already there, as is linear SVM. Both ensembles and gradient boosting should be easy to add once tree fitting itself is finalized. And OnlineStats is also integrated with JuliaDB, so things are looking quite good.
Hopefully the other side of tooling for things such as feature creation and selection, hyperparameter tuning, model selection etc will also start to develop more rapidly once Julia 0.7/0.1 lands. I’m talking about things like featuretools or TPOT in Python or mlr in R. Currently even some of the relatively basic things for model evaluation seem to be missing in Julia (or rather, I’m sure many things are there in one form or another but not well organized/integrated).
It’s a bit of work but looks like Julia could be the way to go. It is much easier than Python or R to create advanced customized models, e.g. a new tree splitting algorithm and a specialist loss function.
The other thing I would mention is simply APIs in OnlineStats.jl for simple regression. You need
using LossFunctions, PenaltyFunctions
x = randn(100_000, 10)
y = x * linspace(-1, 1, 10) + randn(100_000)
o = StatLearn(10, .5 * L2DistLoss(), L1Penalty(), fill(.1, 10), SGD())
s = Series(o)
fit!(s, x, y)
coef(o)
predict(o, x)
to fit and predict linear regression, so wrapping that in a osLm or osGLM would be quite cool.
I’m not sure what kind of enterprise ML you refer to, but in my experience most big companies want just SQL and some basic algorithms like logistic regression. Anyway, I can identify at least 4 kinds of tasks, all requiring different approaches.
Little to intermediate computations, endless (but arriving slowly) data stream. You seem to refer to this kind of tasks. For them you normally need 1 multicore node and a set of online data structures / algorithms.
Intensive computations, little data. CPU becomes a bottleneck, a single node doesn’t have enough CPUs to handle the load, so you have to parallelize algorithm to multiple nodes. This complicates things a lot because now you have multiple processes with their own memory and network latency. Things like downpour SGD and asynchronous updates become mandatory to align difference in performance of machines. Not sure, but I think MPI systems provide tools for this.
Little computations, lots of data. In this case network becomes a bottleneck as a single shared resource. The solution is not to move data to computation nodes, but move computations to data, which is the core of MapReduce approach. E.g. to count number of lines in 5Tb of text files stored on 10 nodes you don’t copy data over the network to the processor, but instead run count locally on each node and then combine results on a single node. Hadoop / Spark target this niche.
Intensive computations, a lot of data. Say, training Google Translate model for 2 weeks on 100Pb of data. And we all know what Google came up with to solve it
I come from credit risk modeling, so what I have in mind is something like this. We have 10-20 years of data. Each of row of data is the end-of-month snapshot of a mortgage loan’s status, e.g. balance, interest rate, remaining term, delinquency status etc. So we if a loan was with the bank for 1 years, then we would have 12 rows of data for that one account. For a large-ish bank they would have 100~200 million rows of data where each row can contain 200-2000 columns.
You need to do some data prep on the data, e.g. add a forward looking default flag, if the loan goes into defaulted in December 2016, then on its record in Jan 2016 the default_in_next_12m column will be set to true. You need to do this for all 200 million rows. So it’s a group-by for each loan then some windowing function.
Once you have prepared the data you need tocheck for data quality issues, so lots of mean, max, min computations for various groups. Design some filtering rules to exclude unwanted data etc.
Then you can fit models! Traditionally, you would sample it down, e.g create a balance sample between good and bad just to cut down on data size. But then you may still run into issue with memory, so it would be good to have online algorithms for fitting models with 100 million rows and 2000 columns. Or cut it up into 12.5 million rows datasets and use all 8 cores to fit 8 models and then create an emsemble out of the results etc.
PB is not the sort-size of data I am thinking about. It’s more like 20-50 gigabytes. It feels closer to 4.
We do have plans for great integration between Flux and JuliaDB. Indeed, the feature extraction features in JuliaDB were heavily geared towards Flux models originally. Stay tuned!
Not related to out-of-core so a bit off topic but the discussion of trees reminded me that the DecisionTree.jl could really use some attention. For many, decision trees are as fundamental as linear models so I think it is very important to have them well supported. As it currently stands, the library does not integrate well with others (e.g., no support for DataFrames), the performance could be better, and there seems to be some issue in gradient boosting calculation.
So if someone has the time and (unlike me) knows what they’re doing then I think it would be a good place to devote some time to. People coming from more mature ecosystems will likely consider things like this relatively basic and expect them to “just work”. (Also, I think it should live alongside GLM under JuliaStats; provided that @bensadeghi and JuliaStats people agree, of course.)
So if all cells are Float64, it’s about 3Tb in total? I wouldn’t target Spark users then - in the last company we received 3Tb of events every day, continuously recomputing 20-25Tb of latest snapshot. And this is pretty typical in enterprise world.
I’m not sure online algorithms for 20-50Gb and 8 cores are a good fit for companies too: EC2 instance with 16 cores and 64Gb of RAM (m5.4xlarge) costs only $0.768 per hour, and you can use whatever algorithms you want.
I like the idea of online algorithms for enterprise in Julia, but I’m not sure I understand the target audience.
The place I am consulting makes user run 50g datasets on laptops. So they use SAS juliadb would be awesome once it’s ready.
People still use pandas and data.table but when things can’t fit into memory they use SAS. But SAS can’t handle 50g of data well. It can but it’s slow. Juliadb fills this medium data void. It’s not quite big data but it’s big enough to be not small; so medium data
I am working with large datasets (30–150GB, depending on which), using mmap in Julia with a thin wrapper, on a laptop with 16GB RAM and a fast SSD. Nevertheless, my bottleneck is very clearly IO, to the point that I am becoming somewhat lazy optimizing my Julia code, since it does not matter (IO from an SSD shows up as CPU utilization on my machine, but profiling tells me that actual Julia code is 5–15% of the whole).
To be fair, I am doing basic tabulations and summaries with OnlineStats.jl, something computational would probably highlight the advantages of Julia more.
A small correction: For simple regression in OnlineStats:
Series((x,y), LinReg(n_predictors))
StatLearn is a super flexible type that lets you plug in any LossFunctions.jl loss, PenaltyFunctions.jl penalty, and a variety of stochastic approximation algorithms. Flexibility comes at the cost of syntax.
LinReg (and LinRegBuilder) give you exact linear regression estimates and the syntax is nicer.
Yeah. Please understand SAS is row-oriented. So if I wanted to do a frequency count of a variable it needs to read all columns (actually it’s slightly better but not much). So if I have random access to the columns eg. JuliaDB.jl binary format allows for that then I can perform that much much faster than SAS. Sure IO will be bottleneck but the performance will already be an order of magnitude better than SAS. SAS now had a file format that allows lazy landing into memory - SASHbdat but it’s still row oriented.