Matrix computation

Hello all,

I have the following function that I would need to optimize. This is because this function is called multiple times (at least 500). So, it is crucial to optimize this function. The computing time with my computer is: 1.940828 seconds (163 allocations: 2.642 GiB, 15.24% gc time). So, I am wondering whether anyone could help me to improve the performance of this function further. Thank you very much in advance!

function myfun(v::Int64, r::Int64, c::Int64, 
                    Nr::Array{Float64,2}, 
                    Nc::Array{Float64,2}, 
                    valSwaA::Array{CartesianIndex{2},1}, 
                    valSwaB::Array{CartesianIndex{2},1}, 
                    tA::Array{Int64,1}, tB::Array{Int64,1})
    q = size(valSwaA,1) 
    efAVec = getindex.(valSwaA,[1 2])
    efBVec = getindex.(valSwaB,[1 2])
    eA = efAVec[:,1] 
    eB = efBVec[:,1]
    fA = efAVec[:,2] 
    fB = efBVec[:,2]
    
    # identifying two vectors br and bc
    idxR = eA .!= eB
    idxC = fA .!= fB
    br = zeros(Int64,q)
    bc = zeros(Int64,q)
    br[idxR] .= 2
    bc[idxC] .= 2
    Wr = similar(Nr[:,eB])
    Wc = similar(Nc[:,fB])
    BLAS.blascopy!(v*q, Nr[:,eB], 1, Wr, 1)
    BLAS.blascopy!(v*q, Nc[:,fB], 1, Wc, 1)
    BLAS.axpby!(-1, Nr[:,eA], 1, Wr);
    BLAS.axpby!(-1, Nc[:,fA], 1, Wc);
    #Wr1,Wc1 = - Nr[:,eA]  + Nr[:,eB],- Nc[:,fA]  + Nc[:,fB];
    # TODO: improve codes here
    tAIdx = [CartesianIndex(i,j) for (i,j) in zip(tA,1:q)];
    tBIdx = [CartesianIndex(i,j) for (i,j) in zip(tB,1:q)];
    wra = Wr[tAIdx]
    wrb = Wr[tBIdx]
    wca = Wc[tAIdx]
    wcb = Wc[tBIdx]
    Wr[tAIdx].= 0
    Wr[tBIdx].= 0
    Wc[tAIdx].= 0
    Wc[tBIdx].= 0

# TODO: use Low-level matrix operations
    Δ1 = (1/c)*(wra - wrb + br) + (1/r)*(wca - wcb + bc)
    Δ2 = (1/c)*(wra + wrb) + (1/r)*(wca + wcb)
    Δt = (1/c)*Wr + (1/r)*Wc
    χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:])
    λ1 = Δ1 + χ
    λ2 = Δ1 - χ  
    lmul!(2.0,Δt)
    # normalize vectors here    
    nl1 = sqrt.(λ1 .* (λ1-Δ1))
    nl2 = sqrt.(λ2 .* (λ2-Δ1))
    lmul!(2.0,nl1)
    lmul!(2.0,nl2)    
    P1 = similar(Δt)
    P2 = similar(Δt)
    BLAS.blascopy!(v*q, Δt, 1, P1, 1)
    BLAS.blascopy!(v*q, Δt, 1, P2, 1)   
    
    P1a =  Δ2 + λ1
    P1b =  Δ2 - λ1
    P1[tAIdx] = P1a
    P1[tBIdx] = P1b

    P2a =  Δ2 + λ2
    P2b =  Δ2 - λ2
    P2[tAIdx] = P2a
    P2[tBIdx] = P2b   
    return λ1, λ2, P1, P2, Δt, nl1, nl2, P1a, P1b, P2a, P2b 
end

Here is the data set that can be used to run the function data file.

I do not have an answer for you, but I noted that the current code will not run, as the signature is too restrictive. Nr and Nc in your data are Array{Int64,2} but your current function requires Float Arrays.
I guess you can just get rid of the type annotation in the signature and things will run.

If you call the function multiple times, is there a chance that you can reuse certain allocated arrays? such as br = zeros(Int64,q) ? This might bring down allocations.

did you look at https://github.com/JuliaPerf/PProf.jl , https://github.com/timholy/ProfileView.jl and https://docs.julialang.org/en/v1/manual/profile/#Memory-allocation-analysis-1 ?

EDIT: Turns out that blascopy does not work with Integer arguments, but expects
::Union{Ptr{Float64}, AbstractArray{Float64,N} where N}
Therefore the function call needs to be

tpl = myfun(v,r,c,float.(Nr),float.(Nc),valSwaA,valSwaB,tA,tB);

1 Like

Can you replace this

Wr = similar(Nr[:,eB])
BLAS.blascopy!(v*q, Nr[:,eB], 1, Wr, 1)

with this?

Wr = Nr[:,eB]

the version below is a bit faster for me (maybe 15%).

function myfun2(v::Int64, r::Int64, c::Int64, 
    Nr,Nc,    
    valSwaA::Array{CartesianIndex{2},1}, 
    valSwaB::Array{CartesianIndex{2},1}, 
    tA::Array{Int64,1}, tB::Array{Int64,1})
q = size(valSwaA,1) 
efAVec = getindex.(valSwaA,[1 2])
efBVec = getindex.(valSwaB,[1 2])
eA = efAVec[:,1] 
eB = efBVec[:,1]
fA = efAVec[:,2] 
fB = efBVec[:,2]

# identifying two vectors br and bc
idxR = eA .!= eB
idxC = fA .!= fB
br = zeros(Int64,q)
bc = zeros(Int64,q)
br[idxR] .= 2
bc[idxC] .= 2
Wr = Nr[:,eB] - Nr[:,eA]
Wc = Nc[:,fB] - Nc[:,fA]
#Wr1,Wc1 = - Nr[:,eA]  + Nr[:,eB],- Nc[:,fA]  + Nc[:,fB];
# TODO: improve codes here
tAIdx = [CartesianIndex(i,j) for (i,j) in zip(tA,1:q)];
tBIdx = [CartesianIndex(i,j) for (i,j) in zip(tB,1:q)];
wra = Wr[tAIdx]
wrb = Wr[tBIdx]
wca = Wc[tAIdx]
wcb = Wc[tBIdx]
Wr[tAIdx].= 0
Wr[tBIdx].= 0
Wc[tAIdx].= 0
Wc[tBIdx].= 0

# TODO: use Low-level matrix operations
Δ1 = (1/c)*(wra - wrb + br) + (1/r)*(wca - wcb + bc)
Δ2 = (1/c)*(wra + wrb) + (1/r)*(wca + wcb)
Δt = (1/c)*Wr + (1/r)*Wc
χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:])
λ1 = Δ1 + χ
λ2 = Δ1 - χ  
lmul!(2.0,Δt)
# normalize vectors here    
nl1 = sqrt.(λ1 .* (λ1-Δ1))
nl2 = sqrt.(λ2 .* (λ2-Δ1))
lmul!(2.0,nl1)
lmul!(2.0,nl2)    
P1 = copy(Δt)
P2 = copy(Δt)

P1a =  Δ2 + λ1
P1b =  Δ2 - λ1
P1[tAIdx] = P1a
P1[tBIdx] = P1b

P2a =  Δ2 + λ2
P2b =  Δ2 - λ2
P2[tAIdx] = P2a
P2[tBIdx] = P2b   
return λ1, λ2, P1, P2, Δt, nl1, nl2, P1a, P1b, P2a, P2b 
end


Thanks very much for your response. Indeed, I used your suggestions. Plus, I found a way to speed up the code further using “Sparse Matrix” since the two matrices Nr and Nr have a lot of zeros. Below is a new code for that idea.

function myfun3(nq::Array{Int64,1},v::Int64, r::Int64, c::Int64, 
    Nr::SparseMatrixCSC{Float64,Int64},
    Nc::SparseMatrixCSC{Float64,Int64},    
    valSwaA::Array{CartesianIndex{2},1},
    valSwaB::Array{CartesianIndex{2},1}, 
    tA::Array{Int64,1}, tB::Array{Int64,1},
    Wr::SparseMatrixCSC{Float64,Int64},
    Wc::SparseMatrixCSC{Float64,Int64},
    br::Array{Float64,1},
    bc::Array{Float64,1})
q = length(nq)
efAVec = getindex.(valSwaA,[1 2])
efBVec = getindex.(valSwaB,[1 2])
eA = efAVec[:,1]
eB = efBVec[:,1]
fA = efAVec[:,2] 
fB = efBVec[:,2]
# identifying two vectors br and bc
#idxR = eA .!= eB
#idxC = fA .!= fB
idxR = eA .== eB
idxC = fA .== fB
br[idxR] .= 0;
bc[idxC] .= 0;
Wr .= Nr[:,eB] .- Nr[:,eA];
Wc .= Nc[:,fB] .- Nc[:,fA];

#tAIdx = [CartesianIndex(i,j) for (i,j) in zip(tA,1:q)];
#tBIdx = [CartesianIndex(i,j) for (i,j) in zip(tB,1:q)];
# linear indices
tAIdx = tA .+ v.*(nq.-1);
tBIdx = tB .+ v.*(nq.-1);

wra = Array(Wr[tAIdx])
wrb = Array(Wr[tBIdx])
wca = Array(Wc[tAIdx])
wcb = Array(Wc[tBIdx])
Wr[tAIdx].= 0;
Wr[tBIdx].= 0;
Wc[tAIdx].= 0;
Wc[tBIdx].= 0;

# TODO: can this be improved futher?
Δ1 = wra .- wrb .+ br + wca .- wcb .+ bc;
Δ2 = wra .+ wrb .+ wca .+ wcb;
Δt = Wr .+ Wc;
# TODO: can this be improved futher since Δt is a sparse matrix?
χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:]);
λ1 = Δ1 .+ χ
λ2 = Δ1 .- χ  
lmul!(2.0,Δt)
# normalize vectors here    
nl1 = sqrt.(λ1 .* χ)
nl2 = sqrt.(.-(λ2 .*χ))
lmul!(2.0,nl1)
lmul!(2.0,nl2)    
P1 = copy(Δt)
P2 = copy(Δt)

P1a =  Δ2 .+ λ1
P1b =  Δ2 .- λ1
P2a =  Δ2 .+ λ2
P2b =  Δ2 .- λ2

P1[tAIdx] = P1a
P1[tBIdx] = P1b
P2[tAIdx] = P2a
P2[tBIdx] = P2b
return λ1, λ2, P1, P2, Δt, nl1, nl2, P1a, P1b, P2a, P2b
end

So, we can use myfun3 as follows:

sNr = (1/c).*sparse(Nr)
sNc = (1/r).*sparse(Nc)
Wr = spzeros(Float64,v,q);
Wc = spzeros(Float64,v,q);
br = fill((1/c)*2,q)
bc = fill((1/r)*2,q)
nq = collect(1:q);
@time res = myfun3(nq, v, r, c, sNr, sNc,
                           valSwaA, valSwaB,
                           tA, tB,
                           Wr, Wc,
                           br, bc);

Below is the computing time of that function with my computer. Perhaps, we could improve the code further. So, please let me know if you have any ideas. Thanks in advance!

 0.530621 seconds (347.54 k allocations: 780.198 MiB)

Hi

It is very difficult to help you without being able to easily run the code.

Some thoughts:

  1. Allocations still seem very high. I’m hoping your not timing the first time compile (rather use BenchmarkTools). Maybe this happens with sparse matrixes, I’m not sure, having never used them myself. I’m not seeing loops where you sometimes accidentally get an allocation every time going through the loop. My guess is to get so many allocations, it seems as if something is being allocated every time it is being used for every element. I have seen something like that when I accidentally mixed data types, like applying an Int or Float32 to a Float64 matrix and there is type casting happening.

  2. Further reusing variables could reduce allocations, for example Δ1 and Δ2 could be reused, if sizes don’t change between calls. In the extreme your function becomes myfun3! returns nothing and all output variables are passed in already and modified in place. I find it a pain though to pass in a bunch of temporary variables, since the calling syntax becomes clunky. One work around is to create a mutable struct to contain those temporaries, but can prevent a 14 parameter function call being a 35 parameter call with 11 return variables and 10 temporary variables.

  3. I see you using lmul!, but wouldn’t it be faster to put that into the earlier “fused operation loop” resulting in touching the elements one less time, for example:

nl1 = sqrt.(λ1 .* χ)
nl2 = sqrt.(.-(λ2 .*χ))
lmul!(2.0,nl1)
lmul!(2.0,nl2) 

becomes

nl1 = 2.0 .* sqrt.(λ1 .* χ)
nl2 = 2.0 .* sqrt.(.-(λ2 .*χ))

or better yet

nl1 .= 2.0 .* sqrt.(λ1 .* χ)
nl2 .= 2.0 .* sqrt.(.-(λ2 .*χ))
  1. I’m not sure exactly what this line does, but it looks like it will create a bunch of temporaries:
χ = sqrt.(Δ1.*Δ1 + Δ2.*Δ2 + 2*sum(Δt.*Δt,dims=1)[:]);

Since your not using .+, Δ1.*Δ1 will be computed first into a temp. Similarly Δ2.*Δ2 into another temp. I’m not sure how many temp variables the sum will create. The multiply by 2 will create another temp (I’m wondering if that 2 as an int is causing a bunch of separate type promotions).

I can just refer back to the suggestion

Study the .mem files and try to understand the allocations you see. Many times I have been surprised by what I see. Also make sure you have read:

https://docs.julialang.org/en/v1/manual/performance-tips/

Hi,

Thank you very much for looking into my code. Below is my response for your comments.

Yes, indeed. So, below is a simiulated data that you could use to test my code. Please let me know whether it works for you.

# test data
using LinearAlgebra, SparseArrays,  LazyArrays
v = 580;
q = 43416;
r = 18;
c = 36;
Nr = (1/c).*sprandn(v, r, 0.20);
Nc = (1/r).*sprandn(v, c, 0.20);
valSwaA = [rand(1:r,q) rand(1:c,q)];
valSwaB = [rand(1:r,q) rand(1:c,q)];
valSwaA = [CartesianIndex(valSwaA[ii,1],valSwaA[ii,2]) for ii in 1:q];
valSwaB = [CartesianIndex(valSwaB[ii,1],valSwaB[ii,2]) for ii in 1:q];
Wr = spzeros(Float64,v,q);
Wc = spzeros(Float64,v,q);
br = fill((1/c)*2,q)
bc = fill((1/r)*2,q)
nq = collect(1:q);
tA = rand(1:v,q)
tB = rand(1:v,q)

Then, you could run myfun3 as follows:

myfun3(nq, v, r, c, Nr, Nc,
                           valSwaA, valSwaB,
                           tA, tB,
                           Wr, Wc,
                           br, bc);

Based on your thoughts, I implemented a new version of my function as follows

# I used your suggestion 2, where three mutable structures are defined 
# (i.e., input data, processed data in a function, and output data)
# input: temporary input data
mutable struct temIntStr
    nq::Array{Int64,1}
    v::Int64
    r::Int64
    c::Int64
    Nr::SparseMatrixCSC{Float64,Int64}
    Nc::SparseMatrixCSC{Float64,Int64}
    valSwaA::Array{Int64,2}
    valSwaB::Array{Int64,2}
    tA::Array{Int64,1}
    tB::Array{Int64,1}
    Wr::SparseMatrixCSC{Float64,Int64}
    Wc::SparseMatrixCSC{Float64,Int64}
    br::Array{Float64,1}
    bc::Array{Float64,1}    
    temIntStr(nq,v,r,c,Nr,Nc,valSwaA,valSwaB,tA,tB,Wr,Wc,br,bc) = new(nq,v,r,c,Nr,Nc,valSwaA,valSwaB,tA,tB,Wr,Wc,br,bc)
end
# processed data: temporary structure
mutable struct temPro
    Δ1::Array{Float64,1}   
    Δ2::Array{Float64,1}
    χ::Array{Float64,1}
    wra::Array{Float64,1}
    wrb::Array{Float64,1}
    wca::Array{Float64,1}
    wcb::Array{Float64,1}
    temPro(Δ1,Δ2,χ,wra,wrb,wca,wcb) = new(Δ1,Δ2,χ,wra,wrb,wca,wcb)
end
# output: eigenvalues and eigvectors information
mutable struct eigInf
    q::Int64
    λ1::Array{Float64,1}
    λ2::Array{Float64,1}
    P1::SparseMatrixCSC{Float64,Int64}
    P2::SparseMatrixCSC{Float64,Int64}
    Δt::SparseMatrixCSC{Float64,Int64}
    nl1::Array{Float64,1}
    nl2::Array{Float64,1}
    P1a::Array{Float64,1}
    P1b::Array{Float64,1}
    P2a::Array{Float64,1}
    P2b::Array{Float64,1}
    eigInf(q,λ1,λ2,P1,P2,Δt,nl1,nl2,P1a,P1b,P2a,P2b) = new(q,λ1,λ2,P1,P2,Δt,nl1,nl2,P1a,P1b,P2a,P2b)
end

Then, I propose a new function as follows

function myfun4(iObj::temIntStr,
                pObj::temPro,
                rObj::eigInf)
    eA = iObj.valSwaA[:,1]
    eB = iObj.valSwaB[:,1]
    fA = iObj.valSwaA[:,2]
    fB = iObj.valSwaB[:,2]
    # identifying two vectors br and bc
    idxR = view(eA,:).==view(eB,:)
    idxC = view(fA,:).==view(fB,:)
    iObj.br[idxR] .= 0
    iObj.bc[idxC] .= 0  
    # TODO: these following operations seem to be quite expensive. 
    # This might be something related to your suggestion 1. I am not quite sure whether this be 
    # improved futher since I also don't have much experience with SparseMatrices. Perhaps, we could 
    # write a loop to perfom these commands to avoid reallocations here? 
    iObj.Wr .= iObj.Nr[:,eB] .- iObj.Nr[:,eA]
    iObj.Wc .= iObj.Nc[:,fB] .- iObj.Nc[:,fA]
    # linear indices
    tAIdx = iObj.tA .+ iObj.v .* (iObj.nq .-1)
    tBIdx = iObj.tB .+ iObj.v .* (iObj.nq .-1)
    tABIdx = ApplyArray(vcat,tAIdx,tBIdx)

    pObj.wra .= view(iObj.Wr,tAIdx)   
    pObj.wrb .= view(iObj.Wr,tBIdx)
    pObj.wca .= view(iObj.Wc,tAIdx)
    pObj.wcb .= view(iObj.Wc,tBIdx)
    
    iObj.Wr[tABIdx].= 0
    iObj.Wc[tABIdx].= 0
    # TODO: can this be improved futher?
    pObj.Δ1 .= pObj.wra .- pObj.wrb .+ iObj.br .+ pObj.wca .- pObj.wcb .+ iObj.bc
    pObj.Δ2 .= pObj.wra .+ pObj.wrb .+ pObj.wca .+ pObj.wcb    
    rObj.Δt .= iObj.Wr .+ iObj.Wc
    # Here I used your suggestion 4 to avoid allocations.
    mulSparse(pObj.χ,rObj.Δt,pObj.Δ1,pObj.Δ2)   
    rObj.λ1 .= pObj.Δ1 .+ pObj.χ
    rObj.λ2 .= pObj.Δ1 .- pObj.χ    
    lmul!(2.0,rObj.Δt)
    # normalize vectors here
    # Here I used your suggestion 3 to avoid allocations.
    mulVec(rObj.nl1,rObj.λ1,pObj.χ) 
    mulVec(rObj.nl2,rObj.λ2,pObj.χ,true)     
    rObj.P1 .= rObj.Δt
    rObj.P2 .= rObj.Δt
    rObj.P1a .=  pObj.Δ2 .+ rObj.λ1
    rObj.P1b .=  pObj.Δ2 .- rObj.λ1
    rObj.P2a .=  pObj.Δ2 .+ rObj.λ2
    rObj.P2b .=  pObj.Δ2 .- rObj.λ2

    rObj.P1[tABIdx] = ApplyArray(vcat,rObj.P1a,rObj.P1b)
    rObj.P2[tABIdx] = ApplyArray(vcat,rObj.P2a,rObj.P2b)
    return nothing
end

function mulSparse(y::Array{Float64,1},                  
                 A::SparseMatrixCSC{Float64,Int64},
                 b1::Array{Float64,1},
                 b2::Array{Float64,1})
    #rowval = A.rowval
    nzval = A.nzval
    fill!(y,0.0)
    @inbounds for i ∈ eachindex(y)
                 for k=A.colptr[i]:(A.colptr[i+1]-1)
                    #ki=rowval[k] # index
                    #kv=nzval[k] # values        
                    y[i] += 2*abs2(nzval[k])
                end
                y[i] += abs2(b1[i]) + abs2(b2[i])
                y[i] = sqrt(y[i])
              end
end
function mulVec(y::Array{Float64,1},                  
                 a1::Array{Float64,1},
                 a2::Array{Float64,1},sign=false)
    if ~sign             
        @inbounds @simd for i ∈ eachindex(y)
                    y[i] = 2*sqrt(a1[i]*a2[i])
        end
    else
        @inbounds @simd for i ∈ eachindex(y)
                    y[i] = 2*sqrt(-a1[i]*a2[i])
        end
    end
end

In order to run myfun4, you could perform the following commands:

v = 580;
q = 43416;
r = 18;
c = 36;
Nr = (1/c).*sprandn(v, r, 0.20);
Nc = (1/r).*sprandn(v, c, 0.20);
valSwaA = [rand(1:r,q) rand(1:c,q)];
valSwaB = [rand(1:r,q) rand(1:c,q)];
Wr = spzeros(Float64,v,q);
Wc = spzeros(Float64,v,q);
br = fill((1/c)*2,q);
bc = fill((1/r)*2,q);
nq = collect(1:q);
tA = rand(1:v,q);
tB = rand(1:v,q);
const iObj = temIntStr(nq, v, r, c, 
                     Nr, Nc,
                     valSwaA, valSwaB,
                     tA, tB,
                     Wr, Wc,
                     br, bc);
# proccessed object:
const pObj = temPro(zeros(Float64,q),
                     zeros(Float64,q),
                     zeros(Float64,q),
                     zeros(Float64,q),
                     zeros(Float64,q),
                     zeros(Float64,q),
                     zeros(Float64,q));
# output:                     
const rObj = eigInf(q,
              zeros(Float64,q),
              zeros(Float64,q),
              zeros(Float64,v,q),
              zeros(Float64,v,q),
              zeros(Float64,v,q),
              zeros(Float64,q),
              zeros(Float64,q),
              zeros(Float64,q),
              zeros(Float64,q),
              zeros(Float64,q),
              zeros(Float64,q));
@btime myfun4($iObj, $pObj, $rObj);

Now, I used BenchmarkTools to obtain computational times for myfun3 and myfun4. Here they are:

myfun3: 2.187 s (347540 allocations: 2.69 GiB)
myfun4: 945.036 ms (173775 allocations: 764.86 MiB)

As you can see, the performance of myfun4 is much better than the performance of myfun3. But it seems there are still a lot of allocations with myfun4. Do you think it is possible to improve this further? Thank you very much in advance! Also, thank you for showing the two reference links. Indeed, I have checked those links!

Commenting out these two lines

    rObj.P1 .= rObj.Δt
    rObj.P2 .= rObj.Δt

cuts the time in half: 768.564 ms (173731 allocations: 764.87 MiB) --> 343.465 ms (55 allocations: 314.48 MiB). And they appear to be dealing with sparse objects which aren’t sparse at all, 57% full:

julia> rObj.Δt
580×43416 SparseMatrixCSC{Float64,Int64} with 0 stored entries:

julia> myfun4(iObj, pObj, rObj);

julia> rObj.Δt
580×43416 SparseMatrixCSC{Float64,Int64} with 14325913 stored entries:
⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛

julia> 14360504 / (580 * 43416)
0.5702849100601717

It’s not so easy to figure out what else is going on. I suspect that a version which breaks things into more functions would be easier to profile. A version which replaces every stuct with a NamedTuple, deletes almost every type annotation, would make it easier to experiment with dense/sparse.