Sum of hadamard products

pkofod · May 5, 2017, 11:02am

Ayo ayo, time for everyone’s favorite Julia game: MakeItFast!

I’m calculating a discounted value given a controlled Markov process, and for that I construct a matrix betaFP to be used in the context (I-betaFP)\u. There are J choices, nX states, conditional transition matrices F[j] nX-by-nX and conditional choice probabilities P[j] nX-by-1. betaFP is calculated according to the formula: beta(P_1F_1+P_2F_2+…+P_JF_J), where [*] is the hadamard product (element by element, such that P[j] is multiplied across all columns in row x of F[j]. For sparse F[j] (large problems) mapreduce seems to be the best bet, but for dense F[j] it isn’t.

Below are two possibilities. I multiply beta directly to P[j] as these are only J*nX multiplications where it would be nX^2 if I multiplied it to the matrix in the end.

Note, my life does not depend on the efficiency of this call (as you saw above, I’m calling \ right after this!), but I just got curious, as I got a vast improvement by doing f1! over sum(beta*P[j].*F[j] for j in eachindex(P, F)) (would love to implement a sum!(betaFP, beta*P[j].*F[j] for j in eachindex(P, F)) but I’m not sure if that is wanted in Base, and I havn’t looked at that part of the code base.

function f1!(βFP, β, P, F)
    J = length(P)
    βFP .= β*P[1].*F[1]
    @inbounds for a = 2:J
        βFP .+= β*P[a].*F[a]
    end
end
function f2!(βFP, β, P, F)
    J = length(P)
    nX = length(P[1])
    fill!(βFP, 0.0)
    @inbounds for j = 1:J
        bpj = β*P[j]
        Fj = F[j]
        for x′ = 1:nX
            for x = 1:nX
                βFP[x, x′] += bpj[x]*Fj[x, x′]
            end
        end
    end
end
J=2
nX=400
B = rand(nX,nX)
b=0.99
P=[rand(nX) for i=1:J]
F=[rand(nX,nX) for i=1:J]
f1!(B, b, P, F)

The unrolled loop a bit faster, sometimes! I’m currently using a heuristic like if nX < 700 || J > 10 call f2! else call f1!.

Is there anything that is obviously stupid in your eyes? I would love to use sum and a generator but that is too slow (100-200% longer runtimes in some cases). @simd does nothing, I guess it is automatically applied here?

I guess there’s simply a limit to how fast a computer can compute this, I was just curious if anyone had any insights. Thanks

kristoffer.carlsson · May 5, 2017, 1:07pm

It’s annoying when you see functions but you are not given anything to call them with I want to be able to just copy paste the code, have it run and then go about start tinkering with it. Having to figure out good input arguments creates a too large barrier to get through before getting to the fun part (making it faster).

pkofod · May 5, 2017, 1:29pm

Ah ofc, my bad. I’ll add details when I’m at a pc.

pkofod · May 5, 2017, 6:29pm

I updated with arrays… I should maybe say, that I’m using v0.6 for this.

giordano · May 5, 2017, 6:58pm

Yes, there was something obvious: you were missing a couple of dots Every single operation must use dot-call syntax, in order to unleash its power (macro @. is useful in order to not forget any dot). This function doesn’t allocate a bit:

function f1!(βFP, β, P, F)
    J = length(P)
    βFP .= β .* P[1] .* F[1]
    @inbounds for a = 2:J
        βFP .+= β .* P[a] .* F[a]
    end
end

Uhm, but this seems to be slower than before…

stevengj · May 5, 2017, 8:32pm

First, you should note the .* etcetera are only fusing, and .+= is only in-place, with Julia 0.6. With Julia 0.5, this won’t be particularly fast.

Also, note that β .* P[a] .* F[a] might well be slower than (β * P[a]) .* F[a] even in 0.6, because the latter caches the 1d array β * P[a] rather than multiplying by β in each innermost loop iteration. This is a classic space-time tradeoff.

Maybe:

function f1!(βFP, β, P, F)
    J = length(P)
    βFP .= P[1] .* F[1]
    for a = 2:J
        βFP .+= P[a] .* F[a]
    end
    scale!(βFP, β)
end

so that you only multiply by β once?

A basic problem above is that the loop bodies above are so simple, so you see a lot of loop and cache overhead. One option would be to use BLAS:

function f1!(βFP, β, P, F)
    J = length(P)
    fill!(βFP, zero(eltype(βFP)))
    for a = 1:J
        BLAS.axpy!(P[a], F[a], βFP)
    end
    scale!(βFP, β)
end

This has optimized inner loops, but still has a lot of simple loops that loop over βFP multiple times. Alternatively, you could try to fuse more of the loops:

function f1!(βFP, β, P, F)
    J = length(P)
    fill!(βFP, zero(eltype(βFP)))
    a = 1
    while a+2 ≤ P
        @. βFP += P[a] * F[a] + P[a+1] * F[a+1] + P[a+2] * F[a+2]
        a += 3
    end
    while a ≤ P
        @. βFP += P[a] * F[a]
        a += 1
    end
    scale!(βFP, β)
end

(I didn’t bother with @inbounds since these are not inner loops.)

pkofod · May 5, 2017, 9:51pm

I take it those isless’es should be isless J and not P. The only problem here is that you’re multiplying beta to nXnX values instead of nXJ values the way I did it (which is why I avoided beta dot * P). It seems consistently slower than the two alternatives I posted. The blas-version doesn’t seem to work.

BLAS.axpy!(P[1], F[1], BFP)
ERROR: MethodError: Cannot `convert` an object of type Array{Float64,1} to an object of type Float64
This may have arisen from a call to the constructor Float64(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] axpy!(::Array{Float64,1}, ::Array{Float64,2}, ::Array{Float64,2}) at ./linalg/generic.jl:1070

And the first option also seems to be slower (again, because of the nX^2 multiplications from scaling betaFP instead of scaling each P before the .*)

stevengj · May 5, 2017, 10:32pm

Right, I forgot that P[k] was a vector, not a scalar; it’s not a BLAS operation.

pkofod · May 6, 2017, 2:25pm

Yeah…

Anyway, I think it’s one of those cases where I’m just observing the time it takes to calculate all the necessary calculations. No tricks or anything to apply.

There is something I want to understand a bit better. If nX<500 (circa) then the loop is faster. If J is large (where large depends on nX, for large nX, the requirement for J to be large is higher) then the loop is also faster. However for nX>1000 the .* method is usually faster.

So there must be some overhead that makes .* slower for small matrices and large J (it is repeated many times). I just don’t get why my loop is much slower if for example J==2 and nX==1600, I must be doing something suboptimal.

Per · May 7, 2017, 6:16am

BLAS.gemm!
?

Edit: No, wait. I guess this is supposed to be the Hadamard prodcut with broadcasting.

Per · May 9, 2017, 5:33am

Do P and F have to be arrays of arrays?

I think if you had N-D arrays instead, you could do this by concatenating matrix-vector products instead of summing Hadamard products. That would give a much more effective memory access pattern.

pkofod · May 9, 2017, 7:44am

I’m not married to the idea, but I would have to refactor a lot of code to make it work. However, if it makes this faster, it may also make something else faster, so I’d be willing to try. I could always just provide vectors of views to the end user to avoid changing the current user facing api.

Per · May 9, 2017, 10:00am

I think it will make other things faster as well.

The current layout requires a memory read to find the address of F[j] before the address of F[j][x,x′] can be computed. With a 3-D array, the address of F[j,x,x′] can be computed directly. This is always faster. The only disadvantage that I can think of is that it’s harder to change the size parameter J dynamically.

The main speedup would then come from being able to order loops in the way that is the most cache-efficient. For example, the above would probably run fast with the loop over j innermost (or maybe x′ innermost, and j second) so that the cache is not filled up by βFP. (Matrix-vector products are probably still the best solution in this particular case.)

pkofod · May 9, 2017, 10:17am

It doesn’t need to, so that’s no obstacle.

I get the looking for F[j] part, and I tried ND arrays, but I ordered it as (x, x', j) and looped over fixed j, x’, then next x’, then next j. It was not faster. You’re betting that j or x’ first is faster? Surprising to me, but I must say I havn’t really thought about ordering it this way before, so I probably just have to see it work

When you say Matrix-vector products, what do you mean then? I guess you mean after reshufling the storage format?

Per · May 9, 2017, 11:55am

Yeah, by matrix-vector products, I mean that the if I get this re-shuffling for free:

P2 = zeros(Float64, (nX,J))
F2 = zeros(Float64, (nX,nX,J))
for j=1:J
  P2[:,j] = P[j]
  F2[:,:,j] = F[j]
end

then the function

function fblas!(βFP, β, P2, F2)
    J = size(P2,2)
    nX = size(P2,1)
    for x = 1:nX
      #βFP[x, :] .= β*F2[x,:,:]*P2[x,:]
      BLAS.gemm!('N', 'N', β, view(F2, x, :, :), view(P2, x, :), 0.0, view(βFP, x, :))
    end
end

will take advantage of the highly optimized memory access patterns in BLAS.

However, there seems to be something still allocating memory in my function, so it’s only faster for values of J larger than 10 or so.

Regarding the ordering of loops. I was probably wrong, I had not tried it. My point about having the option to re-order loops when it is beneficial is still valid though.

Edit: The code above seems to give the wrong result, even after I fixed the obvious bug. (The commented-out line gives the right result, but is slow.)

pkofod · May 9, 2017, 12:10pm

Interesting, that seems to be where my f2! is faster than f1! as well.

Per · May 9, 2017, 1:58pm

It turns out my code gave the wrong result, so the timing measurements were meaningless.

Topic		Replies	Views
Sum operations between arrays Performance	21	5686	April 7, 2020
Speeding up Matrix multiplication involving dot and hadamard product New to Julia question , performance , vector , matrices , loopvectorization	11	1693	February 9, 2022
Kron vs scalar product speed difference. python code faster? New to Julia question	41	4176	January 14, 2017
Fast sum of two sparse matrices and the Hadamard product of sparse and dense matrices Performance matrices	0	647	July 14, 2020
Optimize combination of element-wise products and matrix multiplication Performance	10	1036	July 15, 2020

Sum of hadamard products

Related topics