EvoTrees.jl just went through a significant refurbishment for v0.15.0:
https://evovest.github.io/EvoTrees.jl/stable/
Direct handling of Tables
compatible data
It’s not possible to train directly from Tables like structure, most notably DataFrame and named tuples, which are natural ways in which tabular data presents itself:
using EvoTrees, DataFrames
config = EvoTreeRegressor()
dtrain = DataFrame(x_train, :auto)
dtrain.y .= y_train
m = fit_evotree(config, dtrain; target_name="y")
pred = m(dtrain)
Support for the original Matrix/Vector based data remains:
x_train, y_train = rand(1_000, 10), rand(1_000)
m = fit_evotree(config; x_train, y_train)
pred = m(x_train)
Handling of Categorical and Bool types
When using a Table
compatible data input, features with element types Real
(incl. Bool
) and Categorical
are automatically recognized as input features. Alternatively, fnames
kwarg can be used to explicitly specify feature vars.
m = fit_evotree(config, dtrain; target_name="y", fnames=["x1", "x3"]);
Categorical
features are treated accordingly by the algorithm. Ordered variables will be treated as numerical features, using ≤
split rule, while unordered variables are using ==
. Support is currently limited to a maximum of 255 levels. Bool
variables are treated as unordered, 2-levels cat variables.
Improved handling of devices (CPU/GPU)
GPU memory footprint has been significantly reduced thanks to a single histogram kept on GPU ram instead of 3 for every node of a tree.
Training on “cpu” or “gpu” is now controlled over the kwarg
passed to fit_evotree
(no longer part of the model contructor such as EvoTreeRegressor
.
All GPU specific structs
have been removed, common CPU based structs are used for both CPU and GPU based training (GPU specific objects are kep in cache).
Fixed numerical instabilities
EvoTree model contructors used to support the T
kwarg to specify either Float32
or Float64
as the basis for computation, ex: EvoTreeRegressor(T=Float64)
. This has been dropped in v0.15 and instead calculations at the observation level are handled as Float32
while accumulations are done with Float64
. This provides best of both world: it solves some numerical instabilities observed with Float32
on some larger datasets, while keeping performance similar to full Float32
precision.