Hi all,
I am experiencing a weird behavior from Pluto. I have a dataset that contains more than 1M observations, although the size is about 75M. I’ve been able to cleanup the dataset and address the missing values. However, when I use the ContinuousEncoder to (1) to one-hot encode the categorical values and ensure the continuous values are enforced, it bloats my dataset to a point where a PCA for dimensionality reduction won’t work. I get a message on Pluto that the process has exited. It seems that the underlying Malt worker has crashed.
What’s the right way to execute this involving the Distributed package? what’s the correct wrokflow to process data frames in MLJ using distribted or multi-threaded processes? Feel free to share some pointers or examples.
Thanks
That doesn’t sound like a Pluto issue, but simply OOM? Not sure what you mean by “1 M observations, although the size s about 75M” but if you’ve got 75m rows and you one-hot encode some vector, generating a bunch of additional columns, it’s maybe not surprising to run out of memory?
In any event probably useful to just run your code in the REPL to try and get an error message that isn’t obscured by Malt.