Training a MLJ model on a large dataset

Hello, I am somewhat new to Julia and the MLJ library. I am currently using a preexisting model, and would like to train it on a large dataset. However, I am running out of memory when calling fit with all the data. Each data entry is a paragraph, with embeddings for each sentence (the embedding goes as high as 1836 numbers). With each paragraph being on average 6 sentences long, each data entry is 11,000 floats, which is significant when using 100,000 processed paragraphs. I was wondering if anyone knew an effective way to train on increments of data so I can use all the data.

1 Like

What kind of model are you fitting? It’s really up to the implementation if it can be trained in a streaming fashion (such as anything trained with a variant of SGD).

MLJ currently requires that all data fits into memory. There is some copying that happens by default, which exists to speed-up hyper-parameter optimization. However, this can be switched off by specifying cache=false, as in machine(model, X, y, cache=false), and in some other places, such as TunedModel.

The “deep learning” package Flux.jl implements a large class of gradient descent models which can be trained incrementally. You can find some basic text analysis examples at model-zoo. Large datasets are typically handled using DataLoaders.jl (for standard text corpora, see also CorpusLoaders.jl). Flux does not provide a lot of tooling beyond building and training models (like you do get with MLJ). For that, you may want to look at FastAI.jl.

It may be that after certain reductions (eg, TF-IDF transformation, provided by TextAnalysis.jl) your data is sufficiently reduced that it fits into memory, which would open up the possibility of using other models, such as those provided by MLJ.

1 Like