EvoTrees.jl just went through a significant refurbishment for v0.15.0:
It’s not possible to train directly from Tables like structure, most notably DataFrame and named tuples, which are natural ways in which tabular data presents itself:
using EvoTrees, DataFrames config = EvoTreeRegressor() dtrain = DataFrame(x_train, :auto) dtrain.y .= y_train m = fit_evotree(config, dtrain; target_name="y") pred = m(dtrain)
Support for the original Matrix/Vector based data remains:
x_train, y_train = rand(1_000, 10), rand(1_000) m = fit_evotree(config; x_train, y_train) pred = m(x_train)
When using a
Table compatible data input, features with element types
Categorical are automatically recognized as input features. Alternatively,
fnames kwarg can be used to explicitly specify feature vars.
m = fit_evotree(config, dtrain; target_name="y", fnames=["x1", "x3"]);
Categorical features are treated accordingly by the algorithm. Ordered variables will be treated as numerical features, using
≤ split rule, while unordered variables are using
==. Support is currently limited to a maximum of 255 levels.
Bool variables are treated as unordered, 2-levels cat variables.
GPU memory footprint has been significantly reduced thanks to a single histogram kept on GPU ram instead of 3 for every node of a tree.
Training on “cpu” or “gpu” is now controlled over the
kwarg passed to
fit_evotree (no longer part of the model contructor such as
All GPU specific
structs have been removed, common CPU based structs are used for both CPU and GPU based training (GPU specific objects are kep in cache).
EvoTree model contructors used to support the
T kwarg to specify either
Float64 as the basis for computation, ex:
EvoTreeRegressor(T=Float64). This has been dropped in v0.15 and instead calculations at the observation level are handled as
Float32 while accumulations are done with
Float64. This provides best of both world: it solves some numerical instabilities observed with
Float32 on some larger datasets, while keeping performance similar to full