DataFrames and MLJ

Hi all,
I am experiencing a weird behavior from Pluto. I have a dataset that contains more than 1M observations, although the size is about 75M. I’ve been able to cleanup the dataset and address the missing values. However, when I use the ContinuousEncoder to (1) to one-hot encode the categorical values and ensure the continuous values are enforced, it bloats my dataset to a point where a PCA for dimensionality reduction won’t work. I get a message on Pluto that the process has exited. It seems that the underlying Malt worker has crashed.
What’s the right way to execute this involving the Distributed package? what’s the correct wrokflow to process data frames in MLJ using distribted or multi-threaded processes? Feel free to share some pointers or examples.
Thanks

That doesn’t sound like a Pluto issue, but simply OOM? Not sure what you mean by “1 M observations, although the size s about 75M” but if you’ve got 75m rows and you one-hot encode some vector, generating a bunch of additional columns, it’s maybe not surprising to run out of memory?

In any event probably useful to just run your code in the REPL to try and get an error message that isn’t obscured by Malt.

Sorry! The way I put it is misleading. I actually meant to indicate that I was running the code from Pluto.
The actual issue is with the one-hot encoding, which bloats the dataset and then I suspect the memory allocation fails. Thus, the process exits…
What I really wanted from the post was to find out what workflow you guys use, especially a distributed version, when handling a large dataset with DataFrames and MLJ
Regards

One-hot code bloating is exasperated by any high cardinality features you have. You could try entity embedding: instead of a fixed high-dimensional representation of a feature class, you learn a lower dimensional representation by training using a supervised neural network (NN) that includes embedding layers. The NN may not give the best predictive performance, but once you have the embeddings, you can use these instead of one-not-encoding with whatever supervised model you like. (EvoTreesClassifier or EvoTreesRegressor, the Julia native gradient tree boosters, are pretty good first choices for structured data.).

The NN models provided by MLJFlux now provide entity embedding; an example and citation of the Entity Embedding paper is here. IIdeally, you should think about the dimension you need for each multi-class variable, and specify these explicitly, as the defaults are just crude caps on the dimension.

@EssamWisam is working on a version of these models that can be used in a pipeline, but for now you will need to separately train and apply transform using the NN before passing this on manually to your supervised model of choice.

To iron out any issues, I suggest you try this out first with a much smaller version of your dataset on a single process.