XGBoostClassifier on bigger than RAM database

Paulo_Refosco · January 4, 2024, 8:35pm

Hello,
I am trying to run XGBoostClassifier on a dataset that doesn’t fit into RAM, so getting OutOfMemoryError(). I’m not soo much proeficient on writing codes but I think solutions could range from mini-batches and/or warm starts but couldn’t find tips on how to implement it (many querys applied to chatGPT/similar, also JuliaHub Ask AI, but nothing really helpfull, only the conversion of data into Arrow, improving memory size for the file). It also seems JuliaDB could work, but community talks about it don’t encourage (seems it is not being mantained and I couldn’t figure a code to run it).

I’m open for trials on whatever seems to help.
Can anyone help with code which might handle the task?

My code is what follows (for smaller database, it works properly):

using CSV, DataFrames, MLJ, Arrow

# Load Features/Target Table as DataFrame
filename = "bigdata.arrow"
path    = joinpath("C:\\Users\\User\\Desktop\\BigData", filename)
df = DataFrame(Arrow.Table(path))

# Load XGBoostClassifier and One Hot Encoder
XGBC = @load XGBoostClassifier
xgb = XGBC()
ohe = OneHotEncoder()

# Pipeline OneHotEncoder > XGBoost
xgb_pipe = ohe |> xgb

# Setting Target and Features tables:
y, X = unpack(df, ==(:y_label), col->true)

train, test = partition(1:length(y), 0.5, shuffle=true)

xgbm = machine(xgb_pipe, X, y, cache=false)
fit!(xgbm, rows=train, verbosity=0)

Thanks a lot!

ToucheSir · January 6, 2024, 5:35am

Do the XGBoost libraries in any other language support larger-than-memory training? One idea to explore would be using mmap-ed input data (e.g. via Arrow). You may need to dig into lower-level interfaces for this however, as a lot of MLJ functionality likes to eagerly materialize stuff in memory.

Paulo_Refosco · January 8, 2024, 12:56pm

I’m not well versed in programming, but as far as I got to read, on the documentation there are some clues of possibilities like:

Command Line Parameters: model_in : " Path to input model, needed for test , eval , dump tasks. If it is specified in training, XGBoost will continue training from the input model" >> Seem to me it is a allowance for warm starts, which could be used to train the model in chunks. But I don’t find this parameter in the Julia Package.
subsample parameter: It may be an alternative to set the sumsampling factor to a low number (e.g. 0.1) which seems to result in lower memory usage. But the documentation says for using low number it would be necessary to set sampling_method to gradient_based and that this method only works with cuda devices. I don’t know how to use it and even don’t know if this is a restriction at all.

I would love to try running any proposed code and tell here if it works…

Topic		Replies	Views
Julia run using terminal for 1GB dataset showing out of memory error General Usage question	18	5058	August 31, 2017
Julia Execution get out of memory error General Usage	3	4713	August 5, 2017
Online/out-of-core machine learning (ML) algorithms needs to compete with H20 & Spark Data	13	2378	March 1, 2018
JuliaDB out-of-memory computations New to Julia	2	527	December 6, 2018
Using JuliaDB to create larger than memory datasets and work with them? General Usage	3	1071	October 15, 2019

XGBoostClassifier on bigger than RAM database

Related topics