Computations are now alsmost entirely done through KernelAbstractions.jl. Objective is to eventually have full support for AMD / ROCm in addition to current NVIDIA / CUDA devices.
Important performance increase, notably for larger max depth. Training time is now closely increase linearly with depth.
Breaking change: improved reproducibility
Training returns exactly the same fitted model for a given learner (ex: EvoTreeRegressor).
Reproducibility is respected for both cpu and gpu. However, thes result may differ between cpu and gpu. Ie: reproducibility is guaranteed only within the same device type.
The learner / model constructor (ex: EvoTreeRegressor) now has a seed::Int argument to set the random seed. Legacy rng kwarg will now be ignored.
The internal random generator is now Xishiro (was previously MersenneTwister with rng::Int).
Added node weight information in fitted trees
The train weight reaching each of the split/leaf nodes is now stored in the fitted trees. This is accessible via model.trees[i].w for the i-th tree in the fitted model. This is notably inteded to support SHAP value computations.
Very cool! It looks like EvoTrees now consistently beats XGBoost. Do you know how it compares to CatBoost for speed/OOTB accuracy? My impression is that these days CatBoost is the gold standard for boosted decision trees.
I’m hoping I’ll soon be able to actually follow through on my project that could benefit from EvoTrees. The improved reproducibility will be a help there, and TreeSHAP would be very nice to have
I’ve maintained some basic tabular benchmarks here: GitHub - Evovest/MLBenchmarks.jl: ML models benchmarks on public dataset
While I’m aware of good praise for CatBoost, I haven’t seen it outperform on my problems of interest. It can also depends to which extent the hyper-params were properly tuned. XGB, lightGBM and CatBoost can all be of interest; though they remain very similar algos.
Note that oblivious trees is supported, but I’ve only see it underperformaed compared to default binary mode.
To be seen for TreeShap timing. An external contributor has been looking at it. We may push to complete that feature in case he may not be able to complete.
Interesting. The thing I find most striking about https://arxiv.org/pdf/2506.16791 isn’t actually the CatBoost performance (which is reported favorably), but how close the untuned and tuned performance is:
This does line up with the impression I have of CatBoost doing a particularly job with defaults.
Interesting. It does seem rather problem dependent. I notice that in your benchmarks it’s broadly equivalent, with the exception of Boston where the MSE seems markedly improved for CatBoost.
Interesting, these seems similar to the question I just asked over on GitHub (about ordered boosting).
I would not put too much emphasis on untuned performance given that it can vastly depends on the some aribtrary default selection.
For instance, in XGBoost, default has a limited numner of iterations (100) along a high learning rate (0.3), whereas Catboost has a large number of trees 1000 along a lower learning rate.
For EvoTrees, I initially had a very trivial 10 iterations with a high learning-rate. It would not provide a great fit by default, but the intent then was not to have a good default but rather a minimal one. It was assumed that usage of such models would invovle hyper-param tuning.
In light of actual usage, and even how some papers perform algo comparison where defaults arguments are used as evidence for an algo performance, I may reconsider the defaults and opt for a some changes to defaults with larger nrounds, lower eta and some rowsample.
I think it much highlight how tricky benchmarking can be, as even performing an honest hyper-tuning can be non-trivial, some knowledge of the behavior of an algo’s hyper-params is useful in setting an efficient search.
As a user, I do look at untuned perf when available, because if it’s good across a variety of datasets, or at least on a dataset close to my domain, then that’s reassuring that I’ll probably get good results without too much work. Since I’m not trying to evaluate scientifically which system is best, I’m just trying to find something to get the job done easily.