Ayo ayo, time for everyone’s favorite Julia game: MakeItFast!

I’m calculating a discounted value given a controlled Markov process, and for that I construct a matrix betaFP to be used in the context (I-betaFP)\u. There are J choices, nX states, conditional transition matrices F[j] nX-by-nX and conditional choice probabilities P[j] nX-by-1. betaFP is calculated according to the formula: beta(P_1[]F_1+P_2[]F_2+…+P_J[]F_J), where [*] is the hadamard product (element by element, such that P[j][x] is multiplied across all columns in row x of F[j]. For sparse F[j] (large problems) mapreduce seems to be the best bet, but for dense F[j] it isn’t.

Below are two possibilities. I multiply beta directly to P[j] as these are only J*nX multiplications where it would be nX^2 if I multiplied it to the matrix in the end.

Note, my life does not depend on the efficiency of this call (as you saw above, I’m calling \ right after this!), but I just got curious, as I got a vast improvement by doing f1! over `sum(beta*P[j].*F[j] for j in eachindex(P, F))`

(would love to implement a `sum!(betaFP, beta*P[j].*F[j] for j in eachindex(P, F))`

but I’m not sure if that is wanted in Base, and I havn’t looked at that part of the code base.

```
function f1!(βFP, β, P, F)
J = length(P)
βFP .= β*P[1].*F[1]
@inbounds for a = 2:J
βFP .+= β*P[a].*F[a]
end
end
function f2!(βFP, β, P, F)
J = length(P)
nX = length(P[1])
fill!(βFP, 0.0)
@inbounds for j = 1:J
bpj = β*P[j]
Fj = F[j]
for x′ = 1:nX
for x = 1:nX
βFP[x, x′] += bpj[x]*Fj[x, x′]
end
end
end
end
J=2
nX=400
B = rand(nX,nX)
b=0.99
P=[rand(nX) for i=1:J]
F=[rand(nX,nX) for i=1:J]
f1!(B, b, P, F)
```

The unrolled loop a bit faster, sometimes! I’m currently using a heuristic like `if nX < 700 || J > 10`

call `f2!`

else call `f1!`

.

Is there anything that is obviously stupid in your eyes? I would love to use sum and a generator but that is too slow (100-200% longer runtimes in some cases). @simd does nothing, I guess it is automatically applied here?

I guess there’s simply a limit to how fast a computer can compute this, I was just curious if anyone had any insights. Thanks