Training a million models at the same time

I have written my own code for doing Monte-Carlo simulations with SGD and a one-parameter model, for research. Specifically, I would draw 10^6 data sets (x,y) from a known distribution with x,y 1-dimensional and train 10^6 instances of this model on the corresponding data sets by directly doing SGD on the a 10^6-element vector of parameters.

Can I do this in flux somehow too?

What I DON’T want is training each model separately. That would take me approximately 86 hours for a process that right now takes 10-30 seconds.

I also wanted to try training one model with a million parameters by using a custom loss function and treating the 10^6 as a dimension in the data. This doesn’t work either. It just tells me it runs out of memory if I only try to create the model with Dense(10^6 => 10^6, bias = false). The same happens for 10^5, but I takes maybe a minute before it tells me that. Maybe you could use 10^4-chunks, but that would still be much less efficient than my own code where I don’t have any issues with memory like this.

Any ideas whether there is a clever way of doing this already or do I need to still use my own code in the end?

A dense layer connects each of the 10^6 inputs to each of the 10^6 outputs, which requires 10^12 weights, so it’s no surprise that you run out of memory. Presumably your own code is doing something different from a dense layer.

2 Likes

Ah! Yes, of course. Sorry for the confusion. I need to force the weight matrix to be diagonal somehow - or I suppose I can probably just not use a Dense Layer and do it directly.

I’m not really sure I understand the training objective (sounds like a good candidate for using libraries from one of Julia’s PPL ecosystems), but if what you need is literally Dense with a diagonal weight matrix, we have Flux.Scale.

1 Like

Perfect! That’s exactly what I need (because you can consider them 10^6 independent models).