Proper way to access large matrix in a matrix list

I have the following code snippet where eigvecs_k is a global variable, of type Vector{Matrix{ComplexF64}} and consists two matrix of size 4300*4300.

basisN = 4300
function constructPsi!(coeffs_k::Vector{Vector{ComplexF64}}, ψ::Vector{ComplexF64})::Nothing
    fill!(ψ, 0.0)
    vec = Vector{ComplexF64}(undef, basisN)
    for k in [1, 2]
        coeffs = coeffs_k[k]
        eigvecs= eigvecs_k[k]
        for n in 1:basisN
            @. vec[:] +=  coeffs_k[k][n] * (@view eigvecs[:, n])
        end
        @. ψ += vec
    end
    @. ψ /= norm(ψ)
    return nothing
end

I try hard to eliminate unnecessary memory allocation and improve the speed. However, I’m not sure whether eigvecs= eigvecs_k[k] is the proper way to access the matrix element in eigvecs_k[k]. Bechnmarks show large memory allocation ~500M (the size of eigvecs_k) in calling the function.

BTW, is there room/way to improve the core computation @. vec[:] += coeffs_k[k][n] * (@view eigvecs[:, n]) ?

You should never use global variables (unless they are constants const). Start passing eigvecs_k as a parameter to the function as well. The same for basisN.

edited: Y see now that you have a vector of matrices. Then yes, you have the corresponding matrix. You do not need to use a view, because you are referencing the complete matrix (just giving a new label to it). The problem is that they are global, probably.

That specific line I don’t think so. But you do not need the @view there (if I understand correctly you are mutiplying a scalar, coeffs... by each element of eigvecs and summing to each element of vec.
(what you can do is to expand that into a loop and try LoopVectorization, for example).

Some other details: probably for k in 1:2 is what you want in the loop (not [1,2]) and you do not seem to actually require vec. Can’t you sum directly over \psi?

Ps: If you add a running example (with some data already there), it is much easier to suggest directly modifications that will speedup the code.

1 Like

Aside from the generally poor practice of using global variables, the caveat LM has here refers to Julia performance. Since he didn’t include the link, here it is: Performance Tips · The Julia Language.

2 Likes

This is one alternative:

using LoopVectorization

function constructPsi_2!(coeffs_k::Vector{Vector{T}},
                         eigvecs::Vector{Matrix{T}},
                         ψ::Vector{T}) where T
    fill!(ψ, zero(T))
    vec = zeros(T,length(ψ))
    for k in 1:2
        for (i,c) in pairs(coeffs_k[k])
            eigs = eigvecs[k] # avx does not recognize eigvecs[k][i,j]
            @avx for j in eachindex(vec)
                vec[j] += c * eigs[j,i]
            end
        end
        @. ψ += vec
    end
    @. ψ /= norm(ψ)
    nothing
end

Resulting:

julia> include("./yao.jl")
  166.378 ms (117343 allocations: 569.14 MiB) # original
  71.807 ms (2 allocations: 67.27 KiB) # new

code
using BenchmarkTools, Test
using LinearAlgebra 

function constructPsi!(coeffs_k::Vector{Vector{ComplexF64}}, ψ::Vector{ComplexF64})::Nothing
    fill!(ψ, 0.0)
    vec = Vector{ComplexF64}(undef, basisN)
    fill!(vec,0.0)
    for k in [1, 2]
        coeffs = coeffs_k[k]
        eigvecs= eigvecs_k[k]
        for n in 1:basisN
            @. vec[:] +=  coeffs_k[k][n] * (@view eigvecs[:, n])
        end
        @. ψ += vec
    end
    @. ψ /= norm(ψ)
    return nothing
end

function constructPsi_2!(coeffs_k::Vector{Vector{T}}, 
                         eigvecs::Vector{Matrix{T}},
                         ψ::Vector{T}) where T
    fill!(ψ, zero(T))
    vec = zeros(T,length(ψ))
    for k in 1:2
        for (i,c) in pairs(coeffs_k[k])
            eigs = eigvecs[k]
            @avx for j in eachindex(vec)
                vec[j] += c * eigs[j,i]
            end
        end
        @. ψ += vec
    end
    @. ψ /= norm(ψ) 
    nothing
end

basisN = 4300

T = ComplexF64

eigvecs_k = [ rand(T,basisN,basisN) for i in 1:2 ]
coeffs_k = [ rand(T,basisN) for i in 1:2 ]

ψ1 = zeros(T,basisN)
constructPsi!(coeffs_k, ψ1)

ψ2 = zeros(T,basisN)
constructPsi_2!(coeffs_k, eigvecs_k, ψ2)

@test ψ1 ≈ ψ2

ψ0 = zeros(T,basisN)
@btime constructPsi!($coeffs_k, ψ) setup=(ψ=copy(ψ0)) evals=1
@btime constructPsi_2!($coeffs_k, $eigvecs_k, ψ) setup=(ψ=copy(ψ0)) evals=1




add on:

With this inner loop:

       for (i,c) in pairs(coeffs_k[k])
            for j in eachindex(vec)
                @inbounds vec[j] += c * eigvecs[k][j,i]
            end
        end

You get almost the same speed as using @avx.

1 Like

That’s pretty pretty COOL!

One thing surprises me is that, in my code, I am actually using OffsetArrays to offset the index of coeffs_k and eigvecs_k to be kList to make the program readable to easy to maintain,

function constructPsi_2!(coeffs_k::OffsetArray{Vector{T},1, Vector{Vector{T}}}, eigvecs_k:: OffsetArray{Matrix{T},1, Vector{Matrix{T}}}, ψ::Vector{T})::Nothing where T
    fill!(ψ, zero(T))
    vec = zeros(T,length(ψ))
    for k in kList
        coeffs = coeffs_k[k]
        eigs = eigvecs_k[k]
        for (i,c) in pairs(coeffs)
            @avx for j in eachindex(vec)
                vec[j] += c * eigs[j,i]
            end
        end
        @. ψ += vec
    end
    ψ[:] /= norm(ψ)
    nothing
end

but then the performance significant drops, changing from your version

 julia> 223.757 ms (117344 allocations: 569.28 MiB)
       39.092 ms (2 allocations: 67.27 KiB)

to OffsetArrays version

 julia> 223.757 ms (117344 allocations: 569.28 MiB)
       3.290 s (176202344 allocations: 4.28 GiB)

Is there easy fix for this problem? If not, I will just remove OffsetArrays and use the normal vector list.

Yes, you get everything correct. The global variable basisN, eigvecs_k etc are all const, so I guess without passing them as function parameters does not hurt.

The real code is more complex, ψ and vec are actually of different length, I have a additional function that translates vec to phi of size ψ, and then ψ .+= phi.

1 Like

Mind making a reproducible example that I can copy & paste to run and compare?
The fact you got this many allocations along with bad performance:

 3.290 s (176202344 allocations: 4.28 GiB)

suggests there was a type instability somewhere.

The following is the simplified demo code that illustrates the problem

using OffsetArrays
using LinearAlgebra
using SparseArrays
using LoopVectorization

const L = 24
const mass = 1.0
const basisN = 4341
kList = [0, 12]
psi = zeros(ComplexF64, basisN)

#=============================================================================#
# constructHam
#=============================================================================#
function constructHam(k::Int64, mass::Float64)
    Ham = spzeros(ComplexF64, basisN, basisN)
    for _ in 1: L*basisN
        i = trunc(Int, rand()*basisN) + 1
        j = trunc(Int, rand()*basisN) + 1
        Ham[i, j] += mass*rand()
        Ham[j, i] = Ham[i, j]
    end
    return Ham
end

#=============================================================================#
# get Hamiltonian and eigensystem in all momentum sectors
#=============================================================================#
function getHam_k(mass::Float64)
    Ham_k = SparseMatrixCSC{ComplexF64,Int64}[]
    for k in 0:L-1
        if k in kList
            Ham = constructHam(k, mass)
        else
            Ham = spzeros(ComplexF64, 0, 0)
        end
        push!(Ham_k, Ham)
    end
    return OffsetVector(Ham_k, 0:L-1)
end
const Ham_k = getHam_k(mass)

#------------------------------------------------------------------------------
# get eigvecs and eigvals
#------------------------------------------------------------------------------
function getEigenSystem()
    eigvals_k = Vector{Float64}[]
    eigvecs_k = Matrix{ComplexF64}[]
    for k in 0:L-1
        if k in kList
            println("Diagonalizing momentum k = $k")
            # Hermitian necessary to make sure eigvecs all orthonormal
            eigvals, eigvecs = eigen(Hermitian(Matrix(Ham_k[k])))
        else
            eigvals = Vector{Float64}(undef, 0)
            eigvecs = Matrix{ComplexF64}(undef, 0, 0)
        end
        push!(eigvals_k, eigvals)
        push!(eigvecs_k, eigvecs)
    end
    return OffsetVector(eigvals_k, 0:L-1), OffsetVector(eigvecs_k, 0:L-1)
end
const eigvals_k, eigvecs_k = getEigenSystem()

function get_coeffs_k()
    coeffs_k = Vector{ComplexF64}[]
    for k in 0:L-1
        if k in kList
            # coeffs = rand(ComplexF64, basisN)
            eigvecs = eigvecs_k[k]
            coeffs = [conj(eigvecs[trunc(Int, rand()*basisN)+1, i]) for i in 1:basisN]
        else
            coeffs = ComplexF64[]
        end
        push!(coeffs_k, coeffs)
    end
    return OffsetVector(coeffs_k, 0:L-1)
end
const coeffs_k = get_coeffs_k()

#------------------------------------------------------------------------------
#
#------------------------------------------------------------------------------
const T = ComplexF64
function constructPsi_0!(coeffs_k::OffsetArray{Vector{T},1, Vector{Vector{T}}}, eigvecs_k:: OffsetArray{Matrix{T},1, Vector{Matrix{T}}}, ψ::Vector{T})::Nothing where T
    fill!(ψ, 0.0)
    phi = Vector{ComplexF64}(undef, basisN)
    for k in kList
        coeffs = coeffs_k[k]
        eigvecs= eigvecs_k[k]
        vec = zeros(ComplexF64, basisN)
        for n in 1:basisN
            @. vec +=  coeffs_k[k][n] * (@view eigvecs[:, n])
        end
        @. ψ += vec
    end
    ψ[:] /= norm(ψ)
    return nothing
end

function constructPsi_1!(coeffs_k::OffsetArray{Vector{T},1, Vector{Vector{T}}}, eigvecs_k:: OffsetArray{Matrix{T},1, Vector{Matrix{T}}}, ψ::Vector{T})::Nothing where T
    fill!(ψ, zero(T))
    for k in kList
        coeffs = coeffs_k[k]
        eigvecs = eigvecs_k[k]
        vec = zeros(T, basisN)
        for (i, c) in pairs(coeffs)
            @avx for j in eachindex(vec)
                vec[j] += c * eigvecs[j,i]
            end
        end
        @. ψ += vec
    end
    ψ[:] /= norm(ψ)
    return nothing
end

The number of the computation in constructPsi! is ~4341*4341*2 , with GHz computer, the optimal running time should ~ 4341*4341*2/1E9 ~ 40ms. However,

julia> @time constructPsi_1!(coeffs_k, eigvecs_k, psi)
  3.660657 seconds (179.70 M allocations: 4.365 GiB, 18.60% gc time)
kList

Is global and not constant there

1 Like

const kList = [0, 12] solves the problem!!! I’m VERY VERY surprised at this point. Coming from Fortran, I’m used to use module/global variables mindlessly and did not pay much attention to performance tips in Julia Documentation. Now is time for me to read it carefully. But why such small difference could have such a huge impact on performance?

Another thing, in my typical program, I have many parameters and variables that need to be used in various functions, passing them as functions parameters will make the program looks ugly and cumbersome to work with. Is it fine to use const global variables then?

It is, although that is not very flexible (they are constants afterall). Also, your code becomes hard to understand, as the result of a function depends on one or many things which are defined somewhere else.

Another alternative is to wrap all these parameters in a struct to pass only one parameter “down”. The Parameters package allows for a convenient constructor:

using Parameters
struct MyPars
   L::Int = 24
   mass::Float64 = 1.0
   basisN::Int = 4341
   kList::Vector{Int} = [0, 12]
end

newpars = MyPars() # defaults to those values
newpars = MyPars(L=48) # change something

function constructPsi_2!(coeffs_k, eigvecs_k, ψ; pars = MyPars()) # default 
    @unpack L, mass = pars  # "unpack" L and mass
    whatever = L * mass
    ...
    return ...
end

# call with:

constructPsi_2!(coeffs_k, eigvecs_k, ψ)

#or

constructPsi_2!(coeffs_k, eigvecs_k, ψ, pars = newpars)


#or


constructPsi_2!(coeffs_k, eigvecs_k, ψ, pars = MyPars(L=48))

Another thing: those type annotations in the function:

function constructPsi_2!(coeffs_k::OffsetArray{Vector{T},1, Vector{Vector{T}}}, eigvecs_k:: OffsetArray{Matrix{T},1, Vector{Matrix{T}}}, ψ::Vector{T})::Nothing where T

Don’t do anything for performance. They are useful for debugging and understanding the code better only. But specializing to an OffsetArray is certainly too much. You probably should at most use:

function constructPsi_2!(coeffs_k::AbstractVector{Vector{T}}, 
                         eigvecs_k:: AbstractVector{Matrix{T}}, 
                         ψ::AbstractVector{T}) where T

The Abstract there allows you to input a greater variety of vector types (such as offsets, slices, views).

2 Likes

Now this is is at the very core of how Julia works and allows both dynamic typing and performance. This is one of many texts about that:

1 Like

Thanks you so much! That is very informative and helpful :+1: :+1: :+1:

1 Like