I am running three million regressions of the form below and I feel like I am leaving a lot of performance on the table.
Can I get some tips to help speed things up?
function test_1(data_0)
EignVectors = rand(100, 12)
Factors = zeros(size(data_0, 1), 12)
for i in 1:size(Factors, 1)
Indexer = Not(ismissing.(data_0[i, :])) # real data contains some missing values
Factors[i, :] = EignVectors[Indexer, :]\data_0[i, Indexer] # all I need are the slope coefficients
end
return Factors
end
data_1 = rand(3000000, 100)
@time test_1( data_1 );
Did you read the performance tips
Try using views to avoid making copies with slices. Replace the Not with (!). since you are allocating an array anyway with isimissing., and perhaps pre-allocate the Indexer array. Try changing the order of (transposing) your data array so that you access the data in memory order.
Unfortunately, having irregular missing values makes things a lot worse — if it weren’t for that, you could (after transposing the data array) replace the entire loop with a single Factors = EignVectors \ data_0 call, which would probably be much faster. Maybe consider sorting your data into chunks that have identical ismissing patterns, so that you can do the \ in chunks.
8 Likes
In particular, this is about 50x faster on my machine.
3 Likes
Thanks for all the comments.
I managed to implement the suggestions and saw a significant increase in speed.