XGBoostClassifier on bigger than RAM database

Hello,
I am trying to run XGBoostClassifier on a dataset that doesn’t fit into RAM, so getting OutOfMemoryError(). I’m not soo much proeficient on writing codes but I think solutions could range from mini-batches and/or warm starts but couldn’t find tips on how to implement it (many querys applied to chatGPT/similar, also JuliaHub Ask AI, but nothing really helpfull, only the conversion of data into Arrow, improving memory size for the file). It also seems JuliaDB could work, but community talks about it don’t encourage (seems it is not being mantained and I couldn’t figure a code to run it).

I’m open for trials on whatever seems to help.
Can anyone help with code which might handle the task?

My code is what follows (for smaller database, it works properly):

using CSV, DataFrames, MLJ, Arrow

# Load Features/Target Table as DataFrame
filename = "bigdata.arrow"
path    = joinpath("C:\\Users\\User\\Desktop\\BigData", filename)
df = DataFrame(Arrow.Table(path))

# Load XGBoostClassifier and One Hot Encoder
XGBC = @load XGBoostClassifier
xgb = XGBC()
ohe = OneHotEncoder()

# Pipeline OneHotEncoder > XGBoost
xgb_pipe = ohe |> xgb

# Setting Target and Features tables:
y, X = unpack(df, ==(:y_label), col->true)

train, test = partition(1:length(y), 0.5, shuffle=true)

xgbm = machine(xgb_pipe, X, y, cache=false)
fit!(xgbm, rows=train, verbosity=0)

Thanks a lot!

Do the XGBoost libraries in any other language support larger-than-memory training? One idea to explore would be using mmap-ed input data (e.g. via Arrow). You may need to dig into lower-level interfaces for this however, as a lot of MLJ functionality likes to eagerly materialize stuff in memory.

I’m not well versed in programming, but as far as I got to read, on the documentation there are some clues of possibilities like:

  • Command Line Parameters: model_in : " Path to input model, needed for test , eval , dump tasks. If it is specified in training, XGBoost will continue training from the input model" >> Seem to me it is a allowance for warm starts, which could be used to train the model in chunks. But I don’t find this parameter in the Julia Package.

  • subsample parameter: It may be an alternative to set the sumsampling factor to a low number (e.g. 0.1) which seems to result in lower memory usage. But the documentation says for using low number it would be necessary to set sampling_method to gradient_based and that this method only works with cuda devices. I don’t know how to use it and even don’t know if this is a restriction at all.

I would love to try running any proposed code and tell here if it works…