Problem of memory usage with MLJ+LightGBM and grid search

Hi all,

We are building an application of ML using MLJ and LightGBM (for the ML part) to perform some forecasts on complex time series (for a quite critical business need, in production with daily use)

We have a blocking issue at the training stage of the models tuning (about one hundred models based on roughly 50k obs each and 10 features, tuned on a relatively small hyper params grids, 768 combinations).

The associated memory usage is way too high and breaks the servers (virtual machines).

It is high for a single tuning, but also keeps growing rapidly over the batches of tunings (for the different models).

We took care of not caching models in MLJ (in machines and during the tuning definition).

To perform some diagnostics and ask for you help :slightly_smiling_face:, here (below) is a minimal working example with two different tests that show the problems of memory usage.

  • FIRST TEST: the script performs a simple hyper parameters search. RAM usage jumps by 9 G

  • SECOND TEST: the script performs multiples batches of tunings. RAM usage jumps by 35 G

What are the reasons of such the memory usage for one single tuning and for the sequence of tunings? Most memory usage should be released after individual train and after each tuning? What am I missing here to solve it?

# LOAD DEPENDENCIES ############################################################
using Pkg
Pkg.activate(".")
using MLJ
using Random
using DataFrames
using LightGBM

# My Project.toml file
# Status `~/Project.toml`
#   [a93c6f00] DataFrames v1.6.1
#   [7acf609c] LightGBM v0.6.2
#   [add582a8] MLJ v0.20.3
#   [03970b2e] MLJTuning v0.8.4
#   [9a3f8284] Random
# With Julia 1.10.2 on Ubuntu 23.10

# SOME REPRESENTATIVE TEST DATA ################################################
# A 10 ten column dataframe of regressors (continuous vars)
df = DataFrame(
    map(x -> rand(50_000), 1:10),
    ["x_$i" for i in 1:10]
)
# A target var that is the sum of the coumns plus a random Gaussian noise
DataFrames.transform!(df,  AsTable(1:10) => (x -> sum(x) + randn(50_000)) => :y)

# PREPARATION OF THE MODEL, PIPELINE AND BASIC HYPER PARAMS TUNING STRATEGY ####
# We can pick small grid or more granular grids with the grid resolution param
function prepare_gb_tuned_model(grid_resolution::Int64)
    # Get LightGBM.MLJInterface.LGBMRegressor
    Tree = LightGBM.MLJInterface.LGBMRegressor
    # Instantiate the model
    ml_model = Tree()

    # Pipeline definition
    pipe = OneHotEncoder(ordered_factor=false) |> ml_model     # In my project I would have preprocessing in the pipeline (for categorical variables)

    # Ranges for the grid
    # LGBMRegressor
    ml_ranges = [
        range(pipe, :(lgbm_regressor.num_iterations), lower=50, upper=300),
        range(pipe, :(lgbm_regressor.max_depth), lower=5, upper=20),
        range(pipe, :(lgbm_regressor.feature_fraction), values=[0.7, 0.85, 1.0]),
        range(pipe, :(lgbm_regressor.learning_rate), lower=0.02, upper=0.2, scale=:log),
        range(pipe, :(lgbm_regressor.min_data_in_leaf), lower=10, upper=30, scale=:log),
    ]

    # Tuning strategy
    tuned_model = TunedModel(
        model=pipe,
        tuning=Grid(resolution=grid_resolution),
        ranges=ml_ranges,
        measure=mae,
        train_best=true,
        cache=false     # to make sure I reduce the memory footprint
    )

    return tuned_model
end

mach = machine(
    prepare_gb_tuned_model(4),      # A grid 768 params sets
    select(df, :y), 
    df[: ,end], 
    cache=false     # to make sure I reduce the memory footprint
)

# RUN THE TRAIN AND SEARCH FOR HYPER PARAMS ####################################

# FIRST TEST ===================================================================
# MEMORY STATUS
# free -h
#               total        used        free      shared  buff/cache   available
# Mem:            62Gi       8.0Gi        53Gi       2.0Gi       3.9Gi        54Gi

MLJ.evaluate!(
    mach,
    resampling=CV(nfolds=3),
    measure=mae,
    acceleration=CPU1(),
    verbosity=2,
    # Do not record all obs-pred comparison
    per_observation=false       # it may reduce the memory footprint
)

# MEMORY STATUS
#                total        used        free      shared  buff/cache   available
# Mem:            62Gi        17Gi        43Gi       2.0Gi       4.0Gi        45Gi

# => Memory footprint of 9 G


# SECOND TEST ==================================================================

# MEMORY STATUS
# free -h
#                total        used        free      shared  buff/cache   available
# Mem:            62Gi        17Gi        43Gi       2.0Gi       4.0Gi        45Gi

# => Repeat it 5 times with a map
outputs = map(
    x -> MLJ.evaluate!(
        mach,
        resampling=CV(nfolds=3),
        measure=mae,
        acceleration=CPU1(),
        verbosity=2,
        # Do not record all obs-pred comparison
        per_observation=false       # it may reduce the memory footprint
    ),
    1:5         # do it 5 times      
)

# MEMORY STATUS
# free -h
#                total        used        free      shared  buff/cache   available
# Mem:            62Gi        52Gi       8.3Gi       2.0Gi       4.1Gi       9.6Gi

# => Memory footprint of 35 G

In MLJTuning 0.8.3 full evaluation objects were added to the history. Does using MLJTuning 0.8.2 mitigate your issue?

1 Like

Thanks for the info. I will try 0.8.2 version.

1 Like

@julien_goo Did using the old version help?

1 Like

Hi @ablaom
I did the test with the 0.8.2 version. No memory problem anymore.
What would be the plan for the next release about saving the history?

1 Like

I continued more tests with 20 hours of training at once: No issue with 0.8.2 version.

1 Like

That’s fantastic news, although a little surprising, after some offline discussions.

A proposal for a remedy is:

Current evaluations objects, recently added to TunedModel histories) are too big · Issue #1105 · alan-turing-institute/MLJ.jl · GitHub

Interesting solution!
To evaluate approximately the memory impact of saving predictions and observed values in my context :

  • 3 folds for my custom CV
  • 50 000 rows for the train set
  • search grid of 1000 hyper parameters sets
  • 100 TunedModels to find
    That gives 15 billions floats, possibly multiplied by 2 (predictions and observations).
    This kind of size appears quite in line with the memory bumps observed. Does it helps to diagnose the issue?