I am running three million regressions of the form below and I feel like I am leaving a lot of performance on the table.
Can I get some tips to help speed things up?
function test_1(data_0)
EignVectors = rand(100, 12)
Factors = zeros(size(data_0, 1), 12)
for i in 1:size(Factors, 1)
Indexer = Not(ismissing.(data_0[i, :])) # real data contains some missing values
Factors[i, :] = EignVectors[Indexer, :]\data_0[i, Indexer] # all I need are the slope coefficients
end
return Factors
end
data_1 = rand(3000000, 100)
@time test_1( data_1 );
Did you read the performance tips
Try using views to avoid making copies with slices. Replace the Not
with (!).
since you are allocating an array anyway with isimissing.
, and perhaps pre-allocate the Indexer
array. Try changing the order of (transposing) your data
array so that you access the data in memory order.
Unfortunately, having irregular missing values makes things a lot worse — if it weren’t for that, you could (after transposing the data array) replace the entire loop with a single Factors = EignVectors \ data_0
call, which would probably be much faster. Maybe consider sorting your data into chunks that have identical ismissing
patterns, so that you can do the \
in chunks.
8 Likes
In particular, this is about 50x faster on my machine.
3 Likes
Thanks for all the comments.
I managed to implement the suggestions and saw a significant increase in speed.