Sum of hadamard products


#1

Ayo ayo, time for everyone’s favorite Julia game: MakeItFast!

I’m calculating a discounted value given a controlled Markov process, and for that I construct a matrix betaFP to be used in the context (I-betaFP)\u. There are J choices, nX states, conditional transition matrices F[j] nX-by-nX and conditional choice probabilities P[j] nX-by-1. betaFP is calculated according to the formula: beta(P_1[]F_1+P_2[]F_2+…+P_J[]F_J), where [*] is the hadamard product (element by element, such that P[j][x] is multiplied across all columns in row x of F[j]. For sparse F[j] (large problems) mapreduce seems to be the best bet, but for dense F[j] it isn’t.

Below are two possibilities. I multiply beta directly to P[j] as these are only J*nX multiplications where it would be nX^2 if I multiplied it to the matrix in the end.

Note, my life does not depend on the efficiency of this call (as you saw above, I’m calling \ right after this!), but I just got curious, as I got a vast improvement by doing f1! over sum(beta*P[j].*F[j] for j in eachindex(P, F)) (would love to implement a sum!(betaFP, beta*P[j].*F[j] for j in eachindex(P, F)) but I’m not sure if that is wanted in Base, and I havn’t looked at that part of the code base.

function f1!(βFP, β, P, F)
    J = length(P)
    βFP .= β*P[1].*F[1]
    @inbounds for a = 2:J
        βFP .+= β*P[a].*F[a]
    end
end
function f2!(βFP, β, P, F)
    J = length(P)
    nX = length(P[1])
    fill!(βFP, 0.0)
    @inbounds for j = 1:J
        bpj = β*P[j]
        Fj = F[j]
        for x′ = 1:nX
            for x = 1:nX
                βFP[x, x′] += bpj[x]*Fj[x, x′]
            end
        end
    end
end
J=2
nX=400
B = rand(nX,nX)
b=0.99
P=[rand(nX) for i=1:J]
F=[rand(nX,nX) for i=1:J]
f1!(B, b, P, F)

The unrolled loop a bit faster, sometimes! I’m currently using a heuristic like if nX < 700 || J > 10 call f2! else call f1!.

Is there anything that is obviously stupid in your eyes? I would love to use sum and a generator but that is too slow (100-200% longer runtimes in some cases). @simd does nothing, I guess it is automatically applied here?

I guess there’s simply a limit to how fast a computer can compute this, I was just curious if anyone had any insights. Thanks


#2

It’s annoying when you see functions but you are not given anything to call them with :stuck_out_tongue: I want to be able to just copy paste the code, have it run and then go about start tinkering with it. Having to figure out good input arguments creates a too large barrier to get through before getting to the fun part (making it faster).


#3

Ah ofc, my bad. I’ll add details when I’m at a pc.


#4

I updated with arrays… I should maybe say, that I’m using v0.6 for this.


#5

Yes, there was something obvious: you were missing a couple of dots :wink: Every single operation must use dot-call syntax, in order to unleash its power (macro @. is useful in order to not forget any dot). This function doesn’t allocate a bit:

function f1!(βFP, β, P, F)
    J = length(P)
    βFP .= β .* P[1] .* F[1]
    @inbounds for a = 2:J
        βFP .+= β .* P[a] .* F[a]
    end
end

Uhm, but this seems to be slower than before…


#6

First, you should note the .* etcetera are only fusing, and .+= is only in-place, with Julia 0.6. With Julia 0.5, this won’t be particularly fast.

Also, note that β .* P[a] .* F[a] might well be slower than (β * P[a]) .* F[a] even in 0.6, because the latter caches the 1d array β * P[a] rather than multiplying by β in each innermost loop iteration. This is a classic space-time tradeoff.

Maybe:

function f1!(βFP, β, P, F)
    J = length(P)
    βFP .= P[1] .* F[1]
    for a = 2:J
        βFP .+= P[a] .* F[a]
    end
    scale!(βFP, β)
end

so that you only multiply by β once?

A basic problem above is that the loop bodies above are so simple, so you see a lot of loop and cache overhead. One option would be to use BLAS:

function f1!(βFP, β, P, F)
    J = length(P)
    fill!(βFP, zero(eltype(βFP)))
    for a = 1:J
        BLAS.axpy!(P[a], F[a], βFP)
    end
    scale!(βFP, β)
end

This has optimized inner loops, but still has a lot of simple loops that loop over βFP multiple times. Alternatively, you could try to fuse more of the loops:

function f1!(βFP, β, P, F)
    J = length(P)
    fill!(βFP, zero(eltype(βFP)))
    a = 1
    while a+2 ≤ P
        @. βFP += P[a] * F[a] + P[a+1] * F[a+1] + P[a+2] * F[a+2]
        a += 3
    end
    while a ≤ P
        @. βFP += P[a] * F[a]
        a += 1
    end
    scale!(βFP, β)
end

(I didn’t bother with @inbounds since these are not inner loops.)


#7

I take it those isless’es should be isless J and not P. The only problem here is that you’re multiplying beta to nXnX values instead of nXJ values the way I did it (which is why I avoided beta dot * P). It seems consistently slower than the two alternatives I posted. The blas-version doesn’t seem to work.

BLAS.axpy!(P[1], F[1], BFP)
ERROR: MethodError: Cannot `convert` an object of type Array{Float64,1} to an object of type Float64
This may have arisen from a call to the constructor Float64(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] axpy!(::Array{Float64,1}, ::Array{Float64,2}, ::Array{Float64,2}) at ./linalg/generic.jl:1070

And the first option also seems to be slower (again, because of the nX^2 multiplications from scaling betaFP instead of scaling each P before the .*)


#8

Right, I forgot that P[k] was a vector, not a scalar; it’s not a BLAS operation.


#9

Yeah…

Anyway, I think it’s one of those cases where I’m just observing the time it takes to calculate all the necessary calculations. No tricks or anything to apply.

There is something I want to understand a bit better. If nX<500 (circa) then the loop is faster. If J is large (where large depends on nX, for large nX, the requirement for J to be large is higher) then the loop is also faster. However for nX>1000 the .* method is usually faster.

So there must be some overhead that makes .* slower for small matrices and large J (it is repeated many times). I just don’t get why my loop is much slower if for example J==2 and nX==1600, I must be doing something suboptimal.


#10

BLAS.gemm!
?

Edit: No, wait. I guess this is supposed to be the Hadamard prodcut with broadcasting.


#11

Do P and F have to be arrays of arrays?

I think if you had N-D arrays instead, you could do this by concatenating matrix-vector products instead of summing Hadamard products. That would give a much more effective memory access pattern.


#12

I’m not married to the idea, but I would have to refactor a lot of code to make it work. However, if it makes this faster, it may also make something else faster, so I’d be willing to try. I could always just provide vectors of views to the end user to avoid changing the current user facing api.


#13

I think it will make other things faster as well.

The current layout requires a memory read to find the address of F[j] before the address of F[j][x,x′] can be computed. With a 3-D array, the address of F[j,x,x′] can be computed directly. This is always faster. The only disadvantage that I can think of is that it’s harder to change the size parameter J dynamically.

The main speedup would then come from being able to order loops in the way that is the most cache-efficient. For example, the above would probably run fast with the loop over j innermost (or maybe x′ innermost, and j second) so that the cache is not filled up by βFP. (Matrix-vector products are probably still the best solution in this particular case.)


#14

It doesn’t need to, so that’s no obstacle.

I get the looking for F[j] part, and I tried ND arrays, but I ordered it as (x, x', j) and looped over fixed j, x’, then next x’, then next j. It was not faster. You’re betting that j or x’ first is faster? Surprising to me, but I must say I havn’t really thought about ordering it this way before, so I probably just have to see it work :slight_smile:

When you say Matrix-vector products, what do you mean then? I guess you mean after reshufling the storage format?


#15

Yeah, by matrix-vector products, I mean that the if I get this re-shuffling for free:

P2 = zeros(Float64, (nX,J))
F2 = zeros(Float64, (nX,nX,J))
for j=1:J
  P2[:,j] = P[j]
  F2[:,:,j] = F[j]
end

then the function

function fblas!(βFP, β, P2, F2)
    J = size(P2,2)
    nX = size(P2,1)
    for x = 1:nX
      #βFP[x, :] .= β*F2[x,:,:]*P2[x,:]
      BLAS.gemm!('N', 'N', β, view(F2, x, :, :), view(P2, x, :), 0.0, view(βFP, x, :))
    end
end

will take advantage of the highly optimized memory access patterns in BLAS.

However, there seems to be something still allocating memory in my function, so it’s only faster for values of J larger than 10 or so.

Regarding the ordering of loops. I was probably wrong, I had not tried it. My point about having the option to re-order loops when it is beneficial is still valid though.

Edit: The code above seems to give the wrong result, even after I fixed the obvious bug. (The commented-out line gives the right result, but is slow.)


#16

Interesting, that seems to be where my f2! is faster than f1! as well.


#17

It turns out my code gave the wrong result, so the timing measurements were meaningless.