New machine learning package, Julia implementation of XGBoost


#1

Hi,
I am working on a new Julia machine learning package. What I have now is a pure Julia implementation of XGBoost in < 500 lines of code:

It seems to be as fast as original XGBoost but easier to use and with a smaller memory footprint. But don’t take my word for it:) See Readme for example usage with 10M obs airline dataset.

I am looking for feedback, testers, contributors etc

Cheers
Adam


#2

@oxinabox you should give it a spin in that blog post


#3

Looking at the repo, you should add .jl to the name, add unit tests, setup CI, and get it registered.


#4

yeah, I may do a sequel, I also want to add GLMnet and Lasso.

I like XGboost, it is crazy effective.
A pure julia implementations can only be a good thing.
I was messing around the other day with giving it a custom loss function in python, and I was like "Now I have made it slow.’, (further in python it actually changes the results even for the same loss function, presumably because rounding differences)


#5

Nice. I like GLMnet with LASSO or Elasticnet too. Three suggestions beyond @ChrisRackauckas 's directions for how to make this a registered Julia package,

  1. One of the things I liked about the GLMnet in R was cv.glmnet, which did a little cross validation to find a lambda. For XGBoost and all machine learning packages I feel that hyperparameter CV is useful.

  2. I’m interested in individual models like this fit into machine learning frameworks like MLBase.jl or systems like Caret on R. Because any model is only useful after multiple cross-validation.

  3. In my prior job we used cox proportional hazards (survival) models a lot, and they didn’t always get included in an algorithm. GLMnet did (thank you Hastie) and I encourage you to do the same! It looks like this is under consideration now for the xgboost C package.


#6

I was lucky to get a chance to chat to @statfactory as he was in town in Sydney. I tried the library and it worked flawlessly.

This calls for a native Julia implementation of ML algorithms instead of relying on C++. I had always thought that the implementation is so hard that we need to just use the C++ but I can see now first hand what a skilled person can make of it in a short period of time! It’s really impressive!

There is definitely a commercial company in there. Incorporation into JuliaDB would be awesome, but an independent H2O but in pure Julia is a possibility!


#7

I have managed to improve performance and now it is 2x faster than C++ XGBoost with much smaller memory footprint. I am hoping to release it soon so stay tuned.

Adam


#8

Just writing this down in case I forget

  1. Julia allows customized functions and easy to modify codebase. For researchers, improvement or alternative implementation can be quickly trialled
  2. Julia’s lazy programming and functional programming facilities allow low-cost creation of new columns
  3. It is faster than the C++ implementation not because C++ is slow but because the functional style of programming which surfaces many optimizations that are otherwise hard to spot in large code-bases worked on by many people

#9

I’m not sure if it helps but a pure julia implementation of Cox regression is available here. It’s reasonably optimized (at least compared to matlab’s) but it’s single-threaded . I’m not sure what it would take to transform it into a XGBoost style implementation, but I figured having it as a reference could be better than starting from scratch.