Training a MLJ model on a large dataset

jlwoolf · December 19, 2021, 5:05am

Hello, I am somewhat new to Julia and the MLJ library. I am currently using a preexisting model, and would like to train it on a large dataset. However, I am running out of memory when calling fit with all the data. Each data entry is a paragraph, with embeddings for each sentence (the embedding goes as high as 1836 numbers). With each paragraph being on average 6 sentences long, each data entry is 11,000 floats, which is significant when using 100,000 processed paragraphs. I was wondering if anyone knew an effective way to train on increments of data so I can use all the data.

cjdoris · December 19, 2021, 10:59am

What kind of model are you fitting? It’s really up to the implementation if it can be trained in a streaming fashion (such as anything trained with a variant of SGD).

ablaom · December 19, 2021, 7:44pm

MLJ currently requires that all data fits into memory. There is some copying that happens by default, which exists to speed-up hyper-parameter optimization. However, this can be switched off by specifying cache=false, as in machine(model, X, y, cache=false), and in some other places, such as TunedModel.

The “deep learning” package Flux.jl implements a large class of gradient descent models which can be trained incrementally. You can find some basic text analysis examples at model-zoo. Large datasets are typically handled using DataLoaders.jl (for standard text corpora, see also CorpusLoaders.jl). Flux does not provide a lot of tooling beyond building and training models (like you do get with MLJ). For that, you may want to look at FastAI.jl.

It may be that after certain reductions (eg, TF-IDF transformation, provided by TextAnalysis.jl) your data is sufficiently reduced that it fits into memory, which would open up the possibility of using other models, such as those provided by MLJ.

Topic		Replies	Views
How do you train a machine on several datasets Machine Learning question	2	335	August 22, 2022
PyTorch DataLoader equivalent for training large models with Flux Machine Learning flux	16	4095	November 8, 2020
Tips for handling large Datasets with a lot of preprocessing Machine Learning question , gpu , data	1	123	July 27, 2024
Saving models in MLJ - only final ones without data Machine Learning question , mlj	3	503	September 9, 2022
First steps into Machine Learning? General Usage	4	396	April 7, 2022

Training a MLJ model on a large dataset

Related topics