No variability in xgboost outputs? (XGBoost.jl)

Regarding the variability, this can be tuned by using the rng parameter (which has been overlooked in the doc, thanks for pointing out):

pars = EvoTreeRegressor(loss = :linear,
           nrounds = 1, 
           η = 1,
           rowsample = 0.5, 
           colsample = 1,
           nbins = 16,
           max_depth = 10,
           min_weight = 5.0,
           rng=123);

Note that there’s also a nbins that has been specified here. EvoTrees always uses an histogram building approach, which is resulting in heavier preprocessing thank exact method, but with added speed benefit in the context of gradient boosting regression where typically dozens to several hundred trees will be built. It likely won’t be as optimal in a single regression tree context, nor where the number of observations is low.

This can be highlighted when splitting the preprocessing and the tree building part as follows:

julia> @time model, cache = EvoTrees.init_evotree(pars, X, y);
  1.078003 seconds (2.18 M allocations: 1.107 GiB, 34.28% gc time)

julia> @time EvoTrees.grow_evotree!(model, cache);
  0.073624 seconds (101.64 k allocations: 27.129 MiB)

It can be noticed that tree building per se is < 0.1 sec, similar to what was observed with ranger.

Are those data dimensions (10K observations, 700 features) representative of your dataset size?

1 Like