Need help with performance in a very large loop

Thanks so much! Yes, regarding points (1) and (3) the code seems absurd because I abstracted away of some other intricacies of the code to build the MWE. Regarding point (2) I agree with you!

Thanks!

So what is the actual difference between tempind1 and tempindWC? If they are different by a constant addition then you don’t need both even if that constant is different for each t.

By, “MWE” do you mean the code posted at the top of this thread? It could be made much closer to minimal. You could get it to the point where it’s easy for people to run it. I wanted to try to run the code, but there are a lot of modules imported. I suspected they are not necessary. I don’t want to install them all; eg the excel file reader requires that I install other binaries. But, I don’t know which are necessary because you are importing all exportable symbols. So, to make it easier to help, I’d suggest something like the following, which imports only the modules that you actually use.

using Distributed
using BitOperations: BitOperations
using LinearAlgebra: LinearAlgebra

Then write BitOperations.bget and LinearAlgebra.mul!.

Also, you probably don’t want @btime but rather @time because you are timing something that takes a long, approximately constant, amont of time.

The actual difference between temp1WC and temp1 is a vector of size 2^K that depends on the rows of ChoiceSet. I am writing an improvement on the current MWE to clarify these comments.

I just wrote a cleaner MWE including your comments.

I actually just found a big problem. You switched to using Float32 but best is still initialized with a Float64, which gives you a type instability in a hot loop. Replace best = -Inf with best = -Inf32, and do the same for bestWC etc.

3 Likes

Fixing this decreases the run time for k=7 by 65%.

2 Likes

Thanks!!! Nice catch. This improves performance a lot!

I don’t know if it’s the case or not in your real workload, but in this code you only use one of the values in Indices, but you calculate an index for every combination of i, iii, and t. You also calculate the index of bestWC, but don’t actually use it for anything. There is probably a speedup to be had by only calculating the values you actually need.

1 Like