# Matrix computation

Hello all,

I have the following function that I would need to optimize. This is because this function is called multiple times (at least 500). So, it is crucial to optimize this function. The computing time with my computer is: 1.940828 seconds (163 allocations: 2.642 GiB, 15.24% gc time). So, I am wondering whether anyone could help me to improve the performance of this function further. Thank you very much in advance!

``````function myfun(v::Int64, r::Int64, c::Int64,
Nr::Array{Float64,2},
Nc::Array{Float64,2},
valSwaA::Array{CartesianIndex{2},1},
valSwaB::Array{CartesianIndex{2},1},
tA::Array{Int64,1}, tB::Array{Int64,1})
q = size(valSwaA,1)
efAVec = getindex.(valSwaA,[1 2])
efBVec = getindex.(valSwaB,[1 2])
eA = efAVec[:,1]
eB = efBVec[:,1]
fA = efAVec[:,2]
fB = efBVec[:,2]

# identifying two vectors br and bc
idxR = eA .!= eB
idxC = fA .!= fB
br = zeros(Int64,q)
bc = zeros(Int64,q)
br[idxR] .= 2
bc[idxC] .= 2
Wr = similar(Nr[:,eB])
Wc = similar(Nc[:,fB])
BLAS.blascopy!(v*q, Nr[:,eB], 1, Wr, 1)
BLAS.blascopy!(v*q, Nc[:,fB], 1, Wc, 1)
BLAS.axpby!(-1, Nr[:,eA], 1, Wr);
BLAS.axpby!(-1, Nc[:,fA], 1, Wc);
#Wr1,Wc1 = - Nr[:,eA]  + Nr[:,eB],- Nc[:,fA]  + Nc[:,fB];
# TODO: improve codes here
tAIdx = [CartesianIndex(i,j) for (i,j) in zip(tA,1:q)];
tBIdx = [CartesianIndex(i,j) for (i,j) in zip(tB,1:q)];
wra = Wr[tAIdx]
wrb = Wr[tBIdx]
wca = Wc[tAIdx]
wcb = Wc[tBIdx]
Wr[tAIdx].= 0
Wr[tBIdx].= 0
Wc[tAIdx].= 0
Wc[tBIdx].= 0

# TODO: use Low-level matrix operations
Δ1 = (1/c)*(wra - wrb + br) + (1/r)*(wca - wcb + bc)
Δ2 = (1/c)*(wra + wrb) + (1/r)*(wca + wcb)
Δt = (1/c)*Wr + (1/r)*Wc
χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:])
λ1 = Δ1 + χ
λ2 = Δ1 - χ
lmul!(2.0,Δt)
# normalize vectors here
nl1 = sqrt.(λ1 .* (λ1-Δ1))
nl2 = sqrt.(λ2 .* (λ2-Δ1))
lmul!(2.0,nl1)
lmul!(2.0,nl2)
P1 = similar(Δt)
P2 = similar(Δt)
BLAS.blascopy!(v*q, Δt, 1, P1, 1)
BLAS.blascopy!(v*q, Δt, 1, P2, 1)

P1a =  Δ2 + λ1
P1b =  Δ2 - λ1
P1[tAIdx] = P1a
P1[tBIdx] = P1b

P2a =  Δ2 + λ2
P2b =  Δ2 - λ2
P2[tAIdx] = P2a
P2[tBIdx] = P2b
return λ1, λ2, P1, P2, Δt, nl1, nl2, P1a, P1b, P2a, P2b
end
``````

Here is the data set that can be used to run the function data file.

I do not have an answer for you, but I noted that the current code will not run, as the signature is too restrictive. `Nr` and `Nc` in your data are `Array{Int64,2}` but your current function requires Float Arrays.
I guess you can just get rid of the type annotation in the signature and things will run.

If you call the function multiple times, is there a chance that you can reuse certain allocated arrays? such as `br = zeros(Int64,q)` ? This might bring down allocations.

EDIT: Turns out that blascopy does not work with Integer arguments, but expects
` ::Union{Ptr{Float64}, AbstractArray{Float64,N} where N}`
Therefore the function call needs to be

`tpl = myfun(v,r,c,float.(Nr),float.(Nc),valSwaA,valSwaB,tA,tB);`

1 Like

Can you replace this

``````Wr = similar(Nr[:,eB])
BLAS.blascopy!(v*q, Nr[:,eB], 1, Wr, 1)
``````

with this?

``````Wr = Nr[:,eB]
``````

the version below is a bit faster for me (maybe 15%).

``````function myfun2(v::Int64, r::Int64, c::Int64,
Nr,Nc,
valSwaA::Array{CartesianIndex{2},1},
valSwaB::Array{CartesianIndex{2},1},
tA::Array{Int64,1}, tB::Array{Int64,1})
q = size(valSwaA,1)
efAVec = getindex.(valSwaA,[1 2])
efBVec = getindex.(valSwaB,[1 2])
eA = efAVec[:,1]
eB = efBVec[:,1]
fA = efAVec[:,2]
fB = efBVec[:,2]

# identifying two vectors br and bc
idxR = eA .!= eB
idxC = fA .!= fB
br = zeros(Int64,q)
bc = zeros(Int64,q)
br[idxR] .= 2
bc[idxC] .= 2
Wr = Nr[:,eB] - Nr[:,eA]
Wc = Nc[:,fB] - Nc[:,fA]
#Wr1,Wc1 = - Nr[:,eA]  + Nr[:,eB],- Nc[:,fA]  + Nc[:,fB];
# TODO: improve codes here
tAIdx = [CartesianIndex(i,j) for (i,j) in zip(tA,1:q)];
tBIdx = [CartesianIndex(i,j) for (i,j) in zip(tB,1:q)];
wra = Wr[tAIdx]
wrb = Wr[tBIdx]
wca = Wc[tAIdx]
wcb = Wc[tBIdx]
Wr[tAIdx].= 0
Wr[tBIdx].= 0
Wc[tAIdx].= 0
Wc[tBIdx].= 0

# TODO: use Low-level matrix operations
Δ1 = (1/c)*(wra - wrb + br) + (1/r)*(wca - wcb + bc)
Δ2 = (1/c)*(wra + wrb) + (1/r)*(wca + wcb)
Δt = (1/c)*Wr + (1/r)*Wc
χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:])
λ1 = Δ1 + χ
λ2 = Δ1 - χ
lmul!(2.0,Δt)
# normalize vectors here
nl1 = sqrt.(λ1 .* (λ1-Δ1))
nl2 = sqrt.(λ2 .* (λ2-Δ1))
lmul!(2.0,nl1)
lmul!(2.0,nl2)
P1 = copy(Δt)
P2 = copy(Δt)

P1a =  Δ2 + λ1
P1b =  Δ2 - λ1
P1[tAIdx] = P1a
P1[tBIdx] = P1b

P2a =  Δ2 + λ2
P2b =  Δ2 - λ2
P2[tAIdx] = P2a
P2[tBIdx] = P2b
return λ1, λ2, P1, P2, Δt, nl1, nl2, P1a, P1b, P2a, P2b
end

``````

Thanks very much for your response. Indeed, I used your suggestions. Plus, I found a way to speed up the code further using “Sparse Matrix” since the two matrices `Nr` and `Nr` have a lot of zeros. Below is a new code for that idea.

``````function myfun3(nq::Array{Int64,1},v::Int64, r::Int64, c::Int64,
Nr::SparseMatrixCSC{Float64,Int64},
Nc::SparseMatrixCSC{Float64,Int64},
valSwaA::Array{CartesianIndex{2},1},
valSwaB::Array{CartesianIndex{2},1},
tA::Array{Int64,1}, tB::Array{Int64,1},
Wr::SparseMatrixCSC{Float64,Int64},
Wc::SparseMatrixCSC{Float64,Int64},
br::Array{Float64,1},
bc::Array{Float64,1})
q = length(nq)
efAVec = getindex.(valSwaA,[1 2])
efBVec = getindex.(valSwaB,[1 2])
eA = efAVec[:,1]
eB = efBVec[:,1]
fA = efAVec[:,2]
fB = efBVec[:,2]
# identifying two vectors br and bc
#idxR = eA .!= eB
#idxC = fA .!= fB
idxR = eA .== eB
idxC = fA .== fB
br[idxR] .= 0;
bc[idxC] .= 0;
Wr .= Nr[:,eB] .- Nr[:,eA];
Wc .= Nc[:,fB] .- Nc[:,fA];

#tAIdx = [CartesianIndex(i,j) for (i,j) in zip(tA,1:q)];
#tBIdx = [CartesianIndex(i,j) for (i,j) in zip(tB,1:q)];
# linear indices
tAIdx = tA .+ v.*(nq.-1);
tBIdx = tB .+ v.*(nq.-1);

wra = Array(Wr[tAIdx])
wrb = Array(Wr[tBIdx])
wca = Array(Wc[tAIdx])
wcb = Array(Wc[tBIdx])
Wr[tAIdx].= 0;
Wr[tBIdx].= 0;
Wc[tAIdx].= 0;
Wc[tBIdx].= 0;

# TODO: can this be improved futher?
Δ1 = wra .- wrb .+ br + wca .- wcb .+ bc;
Δ2 = wra .+ wrb .+ wca .+ wcb;
Δt = Wr .+ Wc;
# TODO: can this be improved futher since Δt is a sparse matrix?
χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:]);
λ1 = Δ1 .+ χ
λ2 = Δ1 .- χ
lmul!(2.0,Δt)
# normalize vectors here
nl1 = sqrt.(λ1 .* χ)
nl2 = sqrt.(.-(λ2 .*χ))
lmul!(2.0,nl1)
lmul!(2.0,nl2)
P1 = copy(Δt)
P2 = copy(Δt)

P1a =  Δ2 .+ λ1
P1b =  Δ2 .- λ1
P2a =  Δ2 .+ λ2
P2b =  Δ2 .- λ2

P1[tAIdx] = P1a
P1[tBIdx] = P1b
P2[tAIdx] = P2a
P2[tBIdx] = P2b
return λ1, λ2, P1, P2, Δt, nl1, nl2, P1a, P1b, P2a, P2b
end

``````

So, we can use `myfun3` as follows:

``````sNr = (1/c).*sparse(Nr)
sNc = (1/r).*sparse(Nc)
Wr = spzeros(Float64,v,q);
Wc = spzeros(Float64,v,q);
br = fill((1/c)*2,q)
bc = fill((1/r)*2,q)
nq = collect(1:q);
@time res = myfun3(nq, v, r, c, sNr, sNc,
valSwaA, valSwaB,
tA, tB,
Wr, Wc,
br, bc);
``````

Below is the computing time of that function with my computer. Perhaps, we could improve the code further. So, please let me know if you have any ideas. Thanks in advance!

`````` 0.530621 seconds (347.54 k allocations: 780.198 MiB)
``````

Hi

It is very difficult to help you without being able to easily run the code.

Some thoughts:

1. Allocations still seem very high. I’m hoping your not timing the first time compile (rather use BenchmarkTools). Maybe this happens with sparse matrixes, I’m not sure, having never used them myself. I’m not seeing loops where you sometimes accidentally get an allocation every time going through the loop. My guess is to get so many allocations, it seems as if something is being allocated every time it is being used for every element. I have seen something like that when I accidentally mixed data types, like applying an Int or Float32 to a Float64 matrix and there is type casting happening.

2. Further reusing variables could reduce allocations, for example Δ1 and Δ2 could be reused, if sizes don’t change between calls. In the extreme your function becomes myfun3! returns nothing and all output variables are passed in already and modified in place. I find it a pain though to pass in a bunch of temporary variables, since the calling syntax becomes clunky. One work around is to create a mutable struct to contain those temporaries, but can prevent a 14 parameter function call being a 35 parameter call with 11 return variables and 10 temporary variables.

3. I see you using lmul!, but wouldn’t it be faster to put that into the earlier “fused operation loop” resulting in touching the elements one less time, for example:

``````nl1 = sqrt.(λ1 .* χ)
nl2 = sqrt.(.-(λ2 .*χ))
lmul!(2.0,nl1)
lmul!(2.0,nl2)
``````

becomes

``````nl1 = 2.0 .* sqrt.(λ1 .* χ)
nl2 = 2.0 .* sqrt.(.-(λ2 .*χ))
``````

or better yet

``````nl1 .= 2.0 .* sqrt.(λ1 .* χ)
nl2 .= 2.0 .* sqrt.(.-(λ2 .*χ))
``````
1. I’m not sure exactly what this line does, but it looks like it will create a bunch of temporaries:
``````χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:]);
``````

Since your not using `.+`, `Δ1.*Δ1` will be computed first into a temp. Similarly `Δ2.*Δ2` into another temp. I’m not sure how many temp variables the sum will create. The multiply by 2 will create another temp (I’m wondering if that 2 as an int is causing a bunch of separate type promotions).

I can just refer back to the suggestion

Study the .mem files and try to understand the allocations you see. Many times I have been surprised by what I see. Also make sure you have read:

https://docs.julialang.org/en/v1/manual/performance-tips/

Hi,

Thank you very much for looking into my code. Below is my response for your comments.

Yes, indeed. So, below is a simiulated data that you could use to test my code. Please let me know whether it works for you.

``````# test data
using LinearAlgebra, SparseArrays,  LazyArrays
v = 580;
q = 43416;
r = 18;
c = 36;
Nr = (1/c).*sprandn(v, r, 0.20);
Nc = (1/r).*sprandn(v, c, 0.20);
valSwaA = [rand(1:r,q) rand(1:c,q)];
valSwaB = [rand(1:r,q) rand(1:c,q)];
valSwaA = [CartesianIndex(valSwaA[ii,1],valSwaA[ii,2]) for ii in 1:q];
valSwaB = [CartesianIndex(valSwaB[ii,1],valSwaB[ii,2]) for ii in 1:q];
Wr = spzeros(Float64,v,q);
Wc = spzeros(Float64,v,q);
br = fill((1/c)*2,q)
bc = fill((1/r)*2,q)
nq = collect(1:q);
tA = rand(1:v,q)
tB = rand(1:v,q)
``````

Then, you could run myfun3 as follows:

``````myfun3(nq, v, r, c, Nr, Nc,
valSwaA, valSwaB,
tA, tB,
Wr, Wc,
br, bc);
``````

Based on your thoughts, I implemented a new version of my function as follows

``````# I used your suggestion 2, where three mutable structures are defined
# (i.e., input data, processed data in a function, and output data)
# input: temporary input data
mutable struct temIntStr
nq::Array{Int64,1}
v::Int64
r::Int64
c::Int64
Nr::SparseMatrixCSC{Float64,Int64}
Nc::SparseMatrixCSC{Float64,Int64}
valSwaA::Array{Int64,2}
valSwaB::Array{Int64,2}
tA::Array{Int64,1}
tB::Array{Int64,1}
Wr::SparseMatrixCSC{Float64,Int64}
Wc::SparseMatrixCSC{Float64,Int64}
br::Array{Float64,1}
bc::Array{Float64,1}
temIntStr(nq,v,r,c,Nr,Nc,valSwaA,valSwaB,tA,tB,Wr,Wc,br,bc) = new(nq,v,r,c,Nr,Nc,valSwaA,valSwaB,tA,tB,Wr,Wc,br,bc)
end
# processed data: temporary structure
mutable struct temPro
Δ1::Array{Float64,1}
Δ2::Array{Float64,1}
χ::Array{Float64,1}
wra::Array{Float64,1}
wrb::Array{Float64,1}
wca::Array{Float64,1}
wcb::Array{Float64,1}
temPro(Δ1,Δ2,χ,wra,wrb,wca,wcb) = new(Δ1,Δ2,χ,wra,wrb,wca,wcb)
end
# output: eigenvalues and eigvectors information
mutable struct eigInf
q::Int64
λ1::Array{Float64,1}
λ2::Array{Float64,1}
P1::SparseMatrixCSC{Float64,Int64}
P2::SparseMatrixCSC{Float64,Int64}
Δt::SparseMatrixCSC{Float64,Int64}
nl1::Array{Float64,1}
nl2::Array{Float64,1}
P1a::Array{Float64,1}
P1b::Array{Float64,1}
P2a::Array{Float64,1}
P2b::Array{Float64,1}
eigInf(q,λ1,λ2,P1,P2,Δt,nl1,nl2,P1a,P1b,P2a,P2b) = new(q,λ1,λ2,P1,P2,Δt,nl1,nl2,P1a,P1b,P2a,P2b)
end
``````

Then, I propose a new function as follows

``````function myfun4(iObj::temIntStr,
pObj::temPro,
rObj::eigInf)
eA = iObj.valSwaA[:,1]
eB = iObj.valSwaB[:,1]
fA = iObj.valSwaA[:,2]
fB = iObj.valSwaB[:,2]
# identifying two vectors br and bc
idxR = view(eA,:).==view(eB,:)
idxC = view(fA,:).==view(fB,:)
iObj.br[idxR] .= 0
iObj.bc[idxC] .= 0
# TODO: these following operations seem to be quite expensive.
# This might be something related to your suggestion 1. I am not quite sure whether this be
# improved futher since I also don't have much experience with SparseMatrices. Perhaps, we could
# write a loop to perfom these commands to avoid reallocations here?
iObj.Wr .= iObj.Nr[:,eB] .- iObj.Nr[:,eA]
iObj.Wc .= iObj.Nc[:,fB] .- iObj.Nc[:,fA]
# linear indices
tAIdx = iObj.tA .+ iObj.v .* (iObj.nq .-1)
tBIdx = iObj.tB .+ iObj.v .* (iObj.nq .-1)
tABIdx = ApplyArray(vcat,tAIdx,tBIdx)

pObj.wra .= view(iObj.Wr,tAIdx)
pObj.wrb .= view(iObj.Wr,tBIdx)
pObj.wca .= view(iObj.Wc,tAIdx)
pObj.wcb .= view(iObj.Wc,tBIdx)

iObj.Wr[tABIdx].= 0
iObj.Wc[tABIdx].= 0
# TODO: can this be improved futher?
pObj.Δ1 .= pObj.wra .- pObj.wrb .+ iObj.br .+ pObj.wca .- pObj.wcb .+ iObj.bc
pObj.Δ2 .= pObj.wra .+ pObj.wrb .+ pObj.wca .+ pObj.wcb
rObj.Δt .= iObj.Wr .+ iObj.Wc
# Here I used your suggestion 4 to avoid allocations.
mulSparse(pObj.χ,rObj.Δt,pObj.Δ1,pObj.Δ2)
rObj.λ1 .= pObj.Δ1 .+ pObj.χ
rObj.λ2 .= pObj.Δ1 .- pObj.χ
lmul!(2.0,rObj.Δt)
# normalize vectors here
# Here I used your suggestion 3 to avoid allocations.
mulVec(rObj.nl1,rObj.λ1,pObj.χ)
mulVec(rObj.nl2,rObj.λ2,pObj.χ,true)
rObj.P1 .= rObj.Δt
rObj.P2 .= rObj.Δt
rObj.P1a .=  pObj.Δ2 .+ rObj.λ1
rObj.P1b .=  pObj.Δ2 .- rObj.λ1
rObj.P2a .=  pObj.Δ2 .+ rObj.λ2
rObj.P2b .=  pObj.Δ2 .- rObj.λ2

rObj.P1[tABIdx] = ApplyArray(vcat,rObj.P1a,rObj.P1b)
rObj.P2[tABIdx] = ApplyArray(vcat,rObj.P2a,rObj.P2b)
return nothing
end

function mulSparse(y::Array{Float64,1},
A::SparseMatrixCSC{Float64,Int64},
b1::Array{Float64,1},
b2::Array{Float64,1})
#rowval = A.rowval
nzval = A.nzval
fill!(y,0.0)
@inbounds for i ∈ eachindex(y)
for k=A.colptr[i]:(A.colptr[i+1]-1)
#ki=rowval[k] # index
#kv=nzval[k] # values
y[i] += 2*abs2(nzval[k])
end
y[i] += abs2(b1[i]) + abs2(b2[i])
y[i] = sqrt(y[i])
end
end
function mulVec(y::Array{Float64,1},
a1::Array{Float64,1},
a2::Array{Float64,1},sign=false)
if ~sign
@inbounds @simd for i ∈ eachindex(y)
y[i] = 2*sqrt(a1[i]*a2[i])
end
else
@inbounds @simd for i ∈ eachindex(y)
y[i] = 2*sqrt(-a1[i]*a2[i])
end
end
end
``````

In order to run myfun4, you could perform the following commands:

``````v = 580;
q = 43416;
r = 18;
c = 36;
Nr = (1/c).*sprandn(v, r, 0.20);
Nc = (1/r).*sprandn(v, c, 0.20);
valSwaA = [rand(1:r,q) rand(1:c,q)];
valSwaB = [rand(1:r,q) rand(1:c,q)];
Wr = spzeros(Float64,v,q);
Wc = spzeros(Float64,v,q);
br = fill((1/c)*2,q);
bc = fill((1/r)*2,q);
nq = collect(1:q);
tA = rand(1:v,q);
tB = rand(1:v,q);
const iObj = temIntStr(nq, v, r, c,
Nr, Nc,
valSwaA, valSwaB,
tA, tB,
Wr, Wc,
br, bc);
# proccessed object:
const pObj = temPro(zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q));
# output:
const rObj = eigInf(q,
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,v,q),
zeros(Float64,v,q),
zeros(Float64,v,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q),
zeros(Float64,q));
@btime myfun4(\$iObj, \$pObj, \$rObj);
``````

Now, I used BenchmarkTools to obtain computational times for myfun3 and myfun4. Here they are:

``````myfun3: 2.187 s (347540 allocations: 2.69 GiB)
myfun4: 945.036 ms (173775 allocations: 764.86 MiB)
``````

As you can see, the performance of myfun4 is much better than the performance of myfun3. But it seems there are still a lot of allocations with myfun4. Do you think it is possible to improve this further? Thank you very much in advance! Also, thank you for showing the two reference links. Indeed, I have checked those links!

Commenting out these two lines

``````    rObj.P1 .= rObj.Δt
rObj.P2 .= rObj.Δt
``````

cuts the time in half: 768.564 ms (173731 allocations: 764.87 MiB) --> 343.465 ms (55 allocations: 314.48 MiB). And they appear to be dealing with sparse objects which aren’t sparse at all, 57% full:

``````julia> rObj.Δt
580×43416 SparseMatrixCSC{Float64,Int64} with 0 stored entries:

julia> myfun4(iObj, pObj, rObj);

julia> rObj.Δt
580×43416 SparseMatrixCSC{Float64,Int64} with 14325913 stored entries:
⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛

julia> 14360504 / (580 * 43416)
0.5702849100601717
``````

It’s not so easy to figure out what else is going on. I suspect that a version which breaks things into more functions would be easier to profile. A version which replaces every stuct with a NamedTuple, deletes almost every type annotation, would make it easier to experiment with dense/sparse.