Hi,
I am working on a new Julia machine learning package. What I have now is a pure Julia implementation of XGBoost in < 500 lines of code: https://github.com/Statfactory/JuML.jl
It seems to be as fast as original XGBoost but easier to use and with a smaller memory footprint. But don’t take my word for it:) See Readme for example usage with 10M obs airline dataset.
I am looking for feedback, testers, contributors etc
yeah, I may do a sequel, I also want to add GLMnet and Lasso.
I like XGboost, it is crazy effective.
A pure julia implementations can only be a good thing.
I was messing around the other day with giving it a custom loss function in python, and I was like "Now I have made it slow.', (further in python it actually changes the results even for the same loss function, presumably because rounding differences)
Nice. I like GLMnet with LASSO or Elasticnet too. Three suggestions beyond @ChrisRackauckas 's directions for how to make this a registered Julia package,
One of the things I liked about the GLMnet in R was cv.glmnet, which did a little cross validation to find a lambda. For XGBoost and all machine learning packages I feel that hyperparameter CV is useful.
I’m interested in individual models like this fit into machine learning frameworks like MLBase.jl or systems like Caret on R. Because any model is only useful after multiple cross-validation.
In my prior job we used cox proportional hazards (survival) models a lot, and they didn’t always get included in an algorithm. GLMnet did (thank you Hastie) and I encourage you to do the same! It looks like this is under consideration now for the xgboost C package.
I was lucky to get a chance to chat to @statfactory as he was in town in Sydney. I tried the library and it worked flawlessly.
This calls for a native Julia implementation of ML algorithms instead of relying on C++. I had always thought that the implementation is so hard that we need to just use the C++ but I can see now first hand what a skilled person can make of it in a short period of time! It’s really impressive!
There is definitely a commercial company in there. Incorporation into JuliaDB would be awesome, but an independent H2O but in pure Julia is a possibility!
I have managed to improve performance and now it is 2x faster than C++ XGBoost with much smaller memory footprint. I am hoping to release it soon so stay tuned.
Julia allows customized functions and easy to modify codebase. For researchers, improvement or alternative implementation can be quickly trialled
Julia’s lazy programming and functional programming facilities allow low-cost creation of new columns
It is faster than the C++ implementation not because C++ is slow but because the functional style of programming which surfaces many optimizations that are otherwise hard to spot in large code-bases worked on by many people
I’m not sure if it helps but a pure julia implementation of Cox regression is available here. It’s reasonably optimized (at least compared to matlab’s) but it’s single-threaded . I’m not sure what it would take to transform it into a XGBoost style implementation, but I figured having it as a reference could be better than starting from scratch.
Gradient boosting is a really nice and hassle-free ML algorithm and beating XGboost in performance is really impressive. Something like this could be a killer app for Julia machine learning. It would be really cool to see every Kaggle contest being won by solutions that all either use or call Julia code.
Here is some critical feedback:
One of the main points of Gradient Boosting (and extensions like XGBoost) is its application/generalization to any loss (not just linear regression or logistic regression). This generalization is very easily achieved via Multiple Dispatch in julia. In that context, seeing that the package works only for logistic was surprising. Additionally, the code seems to be hard-coded around Logit loss.
It would be great to abstract away the XGBoost algorithm from the direct dependence on loss, and instead, the algorithm can get the gradient (and hessian) from the loss Type (this is where multiple dispatch is so handy). This makes adding new losses very easy.
Thanks.
@rakeshvar Good points. I was wondering similar things. To help me better understand what goes on I am trying to implement a version that works.
I see a two pronged approach as appropriate. For qell known loss function (for which we know the derivatives), it better to have hard coded with convenient wrappers that pass in the loss and their loss. For novel loss functions, the code then passes in derivativea using auto differentaiton libraries.
E.g. two signatures xgboost(....., loss = :logloss) vs xgboost(...., loss = some_novel_func)
I am the author of Julia XGBoost. Unfortunately my focus has now shifted to probabilistic machine learning (Bayesian networks) so I will not be able to develop this package any further.
Curious what’s become of this project? I see there are a few forks but no work. Anyone know if this has been moved or adapted to a Juia working group or organization?