No variability in xgboost outputs? (XGBoost.jl)

Dear all

I am using functions “xgboost” from the R packages “xgboost” and the
Julia package “XGBoost.jl” that both interface the XGBoost library.
May be I missed something
but I observed surprising outputs from XGBoost.jl.
When I fit one single tree with xgboost (Julia), there is
no variability when I re-run the fitting on the same data,
while I use only fractions of the data rows and columns for
each run (subsample, colsample_bytree), and even for each node of
the tree (colsample_bynode).

Since theses fraction selections are random,
I expected to observe different results between the runs
(note: I well observe this variability, as expected, with xgboost under R).

Below is a reproductible example:

n, p, m = 100, 100, 20
X = rand(n, p) ; y = rand(n) ;
Xnew = rand(m, p) ; ynew = rand(m) ; 

Fit of one single tree:

fm = xgboost(X, 1; label = y,
    booster = :gbtree,
    tree_method = :auto,
    num_parallel_tree = 1,
    subsample = .6,
    colsample_bytree = .8,
    colsample_bynode = 1/3,
    max_depth = 6,
    min_child_weight = 1,
    eta = 1,
    verbosity = 0)

Output:

pred = XGBoost.predict(fm, Xnew) ;
sum(ynew - pred).^2 / length(ynew) # MSEP

[1]     train-rmse:0.237486
0.6847792450571403

When I re-run the fitting, I get the same result:

fm = xgboost(X, 1; label = y, ... Same as before)
pred = XGBoost.predict(fm, Xnew) ;
sum(ynew - pred).^2 / length(ynew) # MSEP

[1]     train-rmse:0.237486
0.6847792450571403

I played with argument “seed_per_iteration” but this did not change:
no variability between the runs. Did I miss something?

Actually, I have the same problem when I fit random forests or XGBoost models
with Julia “xgboost” : I don’t observe variability between runs on the same
data while I well observe variability with R xgboost (I used the
same XGBoost parameterization under R and Julia)

fm = xgboost(X, 10; label = y,
    booster = :gbtree,
    tree_method = :auto,
    num_parallel_tree = 1,
    subsample = .8,
    colsample_bytree = .8,
    colsample_bynode = 1/3,
    max_depth = 6,
    min_child_weight = 1,
    eta = .3,
    verbosity = 0)

pred = XGBoost.predict(fm, Xnew) ;
sum(ynew - pred).^2 / length(ynew) # MSEP

[1]     train-rmse:0.231707
[2]     train-rmse:0.184608
[3]     train-rmse:0.156864
[4]     train-rmse:0.134460
[5]     train-rmse:0.115025
[6]     train-rmse:0.094434
[7]     train-rmse:0.078340
[8]     train-rmse:0.069557
[9]     train-rmse:0.060034
[10]    train-rmse:0.050421

0.425254884669798

When I re-run this code above, no variability in the results is observed.

[Another problem that I have is that “verbosity = 0” does not
remove the printing information (from the doc, it should do,
if I well understood). As shown in the example above,
all round step information is printed anyway. (but this is also the case under R …).
Does somebody know how to have a silent return?]

Thanks for any help

1 Like

This looks like a genuine bug to me. I suggest you open an issue at XGBoost.jl.

BTW, if you’re interested in a pure julia gradient tree booster implementation, I suggest EvoTrees.j, which is actively maintained. Like XGBoost.jl and LighGBM.jl it has an MLJ interface.

3 Likes

Thanks, I will open an issue at XGBoost.jl in few days.
I know EvoTrees.jl )

Actually, I observe the same type of behaviour when I built a
single tree with EvoTrees. It seems that the function returns
always the same tree even if rowsample and colsample are set to < 1.


n, p, m = 100, 100, 20

X = rand(n, p) ; y = rand(n) ;

Xnew = rand(m, p) ; ynew = rand(m) ; 

When I run several times:


pars = EvoTreeRegressor(loss = :linear,

           nrounds = 1, η = 1,

           rowsample = .5, colsample = .5,

           max_depth = 10,

           min_weight = 5.0) ;

fm = fit_evotree(pars, X, y) ;

pred = EvoTrees.predict(fm, Xnew)

sum(ynew - pred).^2 / length(ynew) # MSEP

I get always the same result. It looks like the same rows and columns are
selected at each run of the code above.
I observed the same behaviour when nrounds > 1 (boosting): there is variability between
the successive rounds, but not between the runs of the global code (which seems to indicate
that the same rows and columns are selected between runs).

Is there a particular seed that is fixed inside the code of fit_evotree?
(I did not see specific arguments in the docs)

Or did I miss something?

By the way, I am also surprised by the fact that EvoTrees (which is faster
than DecisionTree.jl) seems much slower to build trees than some R packages, such as
“ranger” (CRAN - Package ranger)

Below is an example with X(1000, 700) and y(1000)

Building a single regression tree with ranger in R:


library(ranger)

n <- 1000 ; p <- 700

X <- matrix(runif(n * p), ncol = p)

y <- runif(n)


dat <- data.frame(y, X)

system.time(

    fm <- ranger(y ~ ., data = dat, 

             num.trees = 1, 

             sample.fraction = 1, 

             mtry = 1, 

             max.depth = 10, 

             min.node.size = 5,

             replace = FALSE)

    )


utilisateur     système      écoulé 

       0.07        0.00        0.08 

(0.08 sec for one tree)

Same but with EvoTrees in Julia (I did run the code several times to remove
the initial compilation time):


n, p = 1000, 700

X = rand(n, p) ; y = rand(n) ;


pars = EvoTreeRegressor(loss = :linear,

           nrounds = 1, η = 1,

           rowsample = 1, colsample = 1,

           max_depth = 10,

           min_weight = 5.0) ;

@time fm = fit_evotree(pars, X, y) ;


2.656099 seconds (2.28 M allocations: 3.712 GiB, 42.75% gc time)

Therefore almost 32 times slower.

When I did set max_depth = 20, this became even worst: 0.12 sec for ranger, and
my run with EvoTrees collapsed.

My surprise is that usually, Julia APIs are faster than R APIs. Again, may be
I missed something.

(my config is: Intel(R) Core™ i9-10885H CPU @ 2.40GHz , RAM 32 Go)

Regarding the variability, this can be tuned by using the rng parameter (which has been overlooked in the doc, thanks for pointing out):

pars = EvoTreeRegressor(loss = :linear,
           nrounds = 1, 
           η = 1,
           rowsample = 0.5, 
           colsample = 1,
           nbins = 16,
           max_depth = 10,
           min_weight = 5.0,
           rng=123);

Note that there’s also a nbins that has been specified here. EvoTrees always uses an histogram building approach, which is resulting in heavier preprocessing thank exact method, but with added speed benefit in the context of gradient boosting regression where typically dozens to several hundred trees will be built. It likely won’t be as optimal in a single regression tree context, nor where the number of observations is low.

This can be highlighted when splitting the preprocessing and the tree building part as follows:

julia> @time model, cache = EvoTrees.init_evotree(pars, X, y);
  1.078003 seconds (2.18 M allocations: 1.107 GiB, 34.28% gc time)

julia> @time EvoTrees.grow_evotree!(model, cache);
  0.073624 seconds (101.64 k allocations: 27.129 MiB)

It can be noticed that tree building per se is < 0.1 sec, similar to what was observed with ranger.

Are those data dimensions (10K observations, 700 features) representative of your dataset size?

1 Like

Thanks for your answer Jeremie, and well noted the “rng” argument.
BTW I have one question: argument “colsample” makes a sampling by tree or by node?

Yes I frequently work with such data size, and even more (e.g. 1,000 columns, or even >10,000; n can also be > or >> 1,000).

The origin of my post is that I was looking for a fast Random Forest (RF) tool under Julia, and I did not find any (DecisionTree.jl is for instance very slow for trees and RF, except for small data sizes). Therefore, I started to build my own bagging function and was looking for a fast tree builder.

There are now extremely fast RF tools under R, such as ranger or Rborist (CRAN - Package Rborist). For instance, building a forest of 100 trees of depth 20 over X(1000, 700) takes ~0.3 sec with rangers:

n <- 1000 ; p <- 700
X <- matrix(runif(n * p), ncol = p)
y <- runif(n)

dat <- data.frame(y, X)
tic()
fm <- ranger(y ~ ., data = dat, 
             num.trees = 100, 
             sample.fraction = 1,  # ==> bootstrap sampling
             mtry = 1/3, 
             max.depth = 20, 
             min.node.size = 5,
             replace = TRUE)
toc()
0.28 sec elapsed

I expected to find such tools in Julia. Sure it will come one day).

It can be noticed that tree building per se is < 0.1 sec, similar to what was observed with ranger

Yes, but if I well understood your code, the replication of the EvoTrees.init_evotree step is necessary anyway when we want to build a forest.
I also observed that when I increased max_depth from 10 to 20 in EvoTreeRegressor, it collapsed Julia (even with nbins =16).

Sampling if performed by tree.

I haven’t taken time yet to further benchmark cases where the number of features is quite large (like ~1000). Development of EvoTrees has originally been made around my kind of use cases with number of observations in the 100K to 10M range and number of features in the 50-250 range. As can be seen on the benchmark in the README, performance on such use cases is roughly competitive with XGBoost, which is a quite optimized implementation.

Absolutely. My intention was to highlight that the reason for your EvoTrees experience being slow is more about data prep than tree building. That being said, in a RandomForest context, just like for boosting, if the number of trees is large, then this initialization step will become less relevant.
Also, this initialization has not been as carefully optimized as the tree building step given its marginal impact in large boosted models. Therefore, there are some low hanging fruits on that side.

I’ll keep the option to add a RF option in EvoTrees has it should be fairly simple to integrate within current framework (likely not before a couple of weeks though!).

About Variability in XGBoost.jl:

I finally found argument “seed”, to add in the body of the fonction xgboost and to vary (to get variability). It plays the same role as “rng” in EvoTrees.jl

Thanks for the infos

I did look at the benchmark before using EvoTrees, but may be it would be useful to insist on this size specifity in the doc.

About Random Forests

Actually I was wrong, XGboost already allows to fit RF models: Random Forests(TM) in XGBoost — xgboost 2.0.0-dev documentation
for a very similar type of forest as, for instance, ranger.
(a main difference is that ranger uses bootstrap to sample the observations, while XGBoost samples the observations without replacement, as usual in stochastic gradient boosting).

An example of RF syntax with XGBoost.jl

num_round = 1 ;
fm = xgboost(Xtrain, num_round; label = Float64.(vec(ytrain)),
           seed = Int64(round(rand(1)[1] * 10000)),
           booster = :gbtree,
           tree_method = :exact,
           num_parallel_tree = 50,
           subsample = .7,
           colsample_bytree = 1,
           colsample_bylevel = 1,
           colsample_bynode = 1/3,
           max_depth = 20,
           min_child_weight = 5,
           lambda = 0,
           eta = 1,
           verbosity = 0)

pred = XGBoost.predict(fm, Xtest)

On my data, I observed very similar prediction error rates between ranger and the RF models built from the R and Julia XGBoost packages, as well as for the computation times, e.g.: the three cited APIs used ~0.50 sec to build a forest of 50 trees of max_depth = 20 over X(1100, 700), which is great and very encouraging ).

1 Like