New machine learning package, Julia implementation of XGBoost

statfactory · February 7, 2018, 10:38pm

Hi,
I am working on a new Julia machine learning package. What I have now is a pure Julia implementation of XGBoost in < 500 lines of code:
https://github.com/Statfactory/JuML.jl

It seems to be as fast as original XGBoost but easier to use and with a smaller memory footprint. But don’t take my word for it:) See Readme for example usage with 10M obs airline dataset.

I am looking for feedback, testers, contributors etc

Cheers
Adam

ChrisRackauckas · February 7, 2018, 11:04pm

@oxinabox you should give it a spin in that blog post

ChrisRackauckas · February 7, 2018, 11:06pm

Looking at the repo, you should add .jl to the name, add unit tests, setup CI, and get it registered.

oxinabox · February 8, 2018, 12:53am

yeah, I may do a sequel, I also want to add GLMnet and Lasso.

I like XGboost, it is crazy effective.
A pure julia implementations can only be a good thing.
I was messing around the other day with giving it a custom loss function in python, and I was like "Now I have made it slow.', (further in python it actually changes the results even for the same loss function, presumably because rounding differences)

pasha · February 8, 2018, 1:22am

Nice. I like GLMnet with LASSO or Elasticnet too. Three suggestions beyond @ChrisRackauckas 's directions for how to make this a registered Julia package,

One of the things I liked about the GLMnet in R was cv.glmnet, which did a little cross validation to find a lambda. For XGBoost and all machine learning packages I feel that hyperparameter CV is useful.
I’m interested in individual models like this fit into machine learning frameworks like MLBase.jl or systems like Caret on R. Because any model is only useful after multiple cross-validation.
In my prior job we used cox proportional hazards (survival) models a lot, and they didn’t always get included in an algorithm. GLMnet did (thank you Hastie) and I encourage you to do the same! It looks like this is under consideration now for the xgboost C package.

xiaodai · February 23, 2018, 4:13am

I was lucky to get a chance to chat to @statfactory as he was in town in Sydney. I tried the library and it worked flawlessly.

This calls for a native Julia implementation of ML algorithms instead of relying on C++. I had always thought that the implementation is so hard that we need to just use the C++ but I can see now first hand what a skilled person can make of it in a short period of time! It’s really impressive!

There is definitely a commercial company in there. Incorporation into JuliaDB would be awesome, but an independent H2O but in pure Julia is a possibility!

statfactory · February 23, 2018, 8:39am

I have managed to improve performance and now it is 2x faster than C++ XGBoost with much smaller memory footprint. I am hoping to release it soon so stay tuned.

Adam

xiaodai · February 23, 2018, 11:52am

Just writing this down in case I forget

Julia allows customized functions and easy to modify codebase. For researchers, improvement or alternative implementation can be quickly trialled
Julia’s lazy programming and functional programming facilities allow low-cost creation of new columns
It is faster than the C++ implementation not because C++ is slow but because the functional style of programming which surfaces many optimizations that are otherwise hard to spot in large code-bases worked on by many people

piever · February 23, 2018, 8:27pm

I’m not sure if it helps but a pure julia implementation of Cox regression is available here. It’s reasonably optimized (at least compared to matlab’s) but it’s single-threaded . I’m not sure what it would take to transform it into a XGBoost style implementation, but I figured having it as a reference could be better than starting from scratch.

Ajaychat3 · September 18, 2018, 12:52am

Current xgboost doesn’t seem to work with either Julia 0.7 or 1.0. I am not able to compile it

Olof_Salberger · September 20, 2018, 9:20pm

Gradient boosting is a really nice and hassle-free ML algorithm and beating XGboost in performance is really impressive. Something like this could be a killer app for Julia machine learning. It would be really cool to see every Kaggle contest being won by solutions that all either use or call Julia code.

rakeshvar · September 20, 2018, 9:45pm

Thanks. Just in time. I wanted to compare my algo to xgboost and having a pure julia implementation is great.

rakeshvar · September 20, 2018, 11:21pm

Here is some critical feedback:
One of the main points of Gradient Boosting (and extensions like XGBoost) is its application/generalization to any loss (not just linear regression or logistic regression). This generalization is very easily achieved via Multiple Dispatch in julia. In that context, seeing that the package works only for logistic was surprising. Additionally, the code seems to be hard-coded around Logit loss.
It would be great to abstract away the XGBoost algorithm from the direct dependence on loss, and instead, the algorithm can get the gradient (and hessian) from the loss Type (this is where multiple dispatch is so handy). This makes adding new losses very easy.
Thanks.

xiaodai · September 21, 2018, 12:31am

@rakeshvar Good points. I was wondering similar things. To help me better understand what goes on I am trying to implement a version that works.

I see a two pronged approach as appropriate. For qell known loss function (for which we know the derivatives), it better to have hard coded with convenient wrappers that pass in the loss and their loss. For novel loss functions, the code then passes in derivativea using auto differentaiton libraries.

E.g. two signatures xgboost(....., loss = :logloss) vs xgboost(...., loss = some_novel_func)

statfactory · September 21, 2018, 10:24am

I am the author of Julia XGBoost. Unfortunately my focus has now shifted to probabilistic machine learning (Bayesian networks) so I will not be able to develop this package any further.

Adam

tk3369 · September 21, 2018, 3:27pm

Is this a request for contributors?

statfactory · September 21, 2018, 9:14pm

You can develop it further on a fork if you want.

Joshua_Bowles · December 27, 2018, 1:51pm

Curious what’s become of this project? I see there are a few forks but no work. Anyone know if this has been moved or adapted to a Juia working group or organization?

statfactory · December 27, 2018, 2:52pm

Anyone interested in contributing to the project? I can walk you through my implementation and give ideas for further development.

bpr · December 28, 2018, 7:32am

I’m interested in contributing. I use XGBoost/Python at work, and I’d like to see a pure Julia version.

If you write down your ideas and a description on the github page, it will make it easier for contributors.

Topic		Replies	Views
XGBoost Broken for Mac Mojave and earlier - Initialization Error General Usage question	22	755	September 6, 2022
Custom XGBoost Loss function w/ Zygote. Julia Computing blog post Machine Learning zygote , kaggle	36	4927	April 29, 2020
AnyBoost.jl Package Announcements machine-learning	1	820	June 27, 2019
Anyone has an implementation of generic boosting? Machine Learning	7	871	December 16, 2024
What happened to XLA.jl Machine Learning	16	4746	March 12, 2023

New machine learning package, Julia implementation of XGBoost

Related topics