Hello,
I am trying to run XGBoostClassifier on a dataset that doesn’t fit into RAM, so getting OutOfMemoryError(). I’m not soo much proeficient on writing codes but I think solutions could range from mini-batches and/or warm starts but couldn’t find tips on how to implement it (many querys applied to chatGPT/similar, also JuliaHub Ask AI, but nothing really helpfull, only the conversion of data into Arrow, improving memory size for the file). It also seems JuliaDB could work, but community talks about it don’t encourage (seems it is not being mantained and I couldn’t figure a code to run it).
I’m open for trials on whatever seems to help.
Can anyone help with code which might handle the task?
My code is what follows (for smaller database, it works properly):
using CSV, DataFrames, MLJ, Arrow
# Load Features/Target Table as DataFrame
filename = "bigdata.arrow"
path = joinpath("C:\\Users\\User\\Desktop\\BigData", filename)
df = DataFrame(Arrow.Table(path))
# Load XGBoostClassifier and One Hot Encoder
XGBC = @load XGBoostClassifier
xgb = XGBC()
ohe = OneHotEncoder()
# Pipeline OneHotEncoder > XGBoost
xgb_pipe = ohe |> xgb
# Setting Target and Features tables:
y, X = unpack(df, ==(:y_label), col->true)
train, test = partition(1:length(y), 0.5, shuffle=true)
xgbm = machine(xgb_pipe, X, y, cache=false)
fit!(xgbm, rows=train, verbosity=0)
Thanks a lot!