DataFrames and MLJ

Jose_Ghislain_Quenum · November 20, 2024, 11:51am

Hi all,
I am experiencing a weird behavior from Pluto. I have a dataset that contains more than 1M observations, although the size is about 75M. I’ve been able to cleanup the dataset and address the missing values. However, when I use the ContinuousEncoder to (1) to one-hot encode the categorical values and ensure the continuous values are enforced, it bloats my dataset to a point where a PCA for dimensionality reduction won’t work. I get a message on Pluto that the process has exited. It seems that the underlying Malt worker has crashed.
What’s the right way to execute this involving the Distributed package? what’s the correct wrokflow to process data frames in MLJ using distribted or multi-threaded processes? Feel free to share some pointers or examples.
Thanks

nilshg · November 20, 2024, 1:05pm

That doesn’t sound like a Pluto issue, but simply OOM? Not sure what you mean by “1 M observations, although the size s about 75M” but if you’ve got 75m rows and you one-hot encode some vector, generating a bunch of additional columns, it’s maybe not surprising to run out of memory?

In any event probably useful to just run your code in the REPL to try and get an error message that isn’t obscured by Malt.

Jose_Ghislain_Quenum · November 24, 2024, 8:42am

Sorry! The way I put it is misleading. I actually meant to indicate that I was running the code from Pluto.
The actual issue is with the one-hot encoding, which bloats the dataset and then I suspect the memory allocation fails. Thus, the process exits…
What I really wanted from the post was to find out what workflow you guys use, especially a distributed version, when handling a large dataset with DataFrames and MLJ
Regards

ablaom · November 24, 2024, 7:01pm

One-hot code bloating is exasperated by any high cardinality features you have. You could try entity embedding: instead of a fixed high-dimensional representation of a feature class, you learn a lower dimensional representation by training using a supervised neural network (NN) that includes embedding layers. The NN may not give the best predictive performance, but once you have the embeddings, you can use these instead of one-not-encoding with whatever supervised model you like. (EvoTreesClassifier or EvoTreesRegressor, the Julia native gradient tree boosters, are pretty good first choices for structured data.).

The NN models provided by MLJFlux now provide entity embedding; an example and citation of the Entity Embedding paper is here. IIdeally, you should think about the dimension you need for each multi-class variable, and specify these explicitly, as the defaults are just crude caps on the dimension.

@EssamWisam is working on a version of these models that can be used in a pipeline, but for now you will need to separately train and apply transform using the NN before passing this on manually to your supervised model of choice.

ablaom · November 24, 2024, 7:04pm

To iron out any issues, I suggest you try this out first with a much smaller version of your dataset on a single process.

Topic		Replies	Views
Julia run using terminal for 1GB dataset showing out of memory error General Usage question	18	5061	August 31, 2017
Julia Execution get out of memory error General Usage	3	4716	August 5, 2017
Encoding categorical variables within a matrix Machine Learning machine-learning	3	2749	December 28, 2019
Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning? Machine Learning question , big-data	27	3096	December 1, 2020
Dummy Encoding(One hot encoding) from PooledDataArray General Usage question	10	3226	June 9, 2017

DataFrames and MLJ

Related topics