If you are simply diagonalizing matrices, then iterative techniques should normally be of not much help. It is a must in planewave-DFT because the matrix can be as large as 1m * 1m, while you only need 1k eigenvectors.

And, in most iterative techniques, you will make a reduced space, which is still larger than the number you need (most of the time it will be twice as large), and then perform direct diagonalization in the reduced space. Because of this, it may be a waste of time to calculate all eigenvectors with `zhegv`

. Instead, with `zhegvx`

you can set the number of eigenvectors you need. `zhegvd`

will also calculate all eigenvectors, but it uses a divide-and-conquer algorithm and is generally faster than `zhegv`

.

Iām using the `gv`

variants because we need that in DFT, but it should be the same for the `ev`

variants. That being said, you may want to have a try to `zheevd`

, and even `zheevx`

if you need only some of the eigenvectors. However, if your matrix is small, there should be little difference.

If the assignment of off-diagonal parts are the most time-consuming, it may be helpful to rearrange arr_1 so that it can be achieved as arr_i[1, j, i]. But since itās only approximately 10 * 10 * 3, it should have little impact.

And Iāve tried the following:

```
M = 10; N = 1000; y = zeros(ComplexF64, M*N, M*N); a = zeros(Int, M, M); x = zeros(ComplexF64, N, N); rand!(x);
for i = 1 : M * M
a[i] = Int(round(rand()*5))
end
function set_1!(y, a, x, M, N)
for i = 1 : M, j = 1 : i
y[1+N*(i-1):N*i, 1+N*(j-1):N*j] = a[i, j] .* x
end
end
function set_2!(y, a, x, M, N)
for i = 1 : M, j = 1 : i
y[1+N*(i-1):N*i, 1+N*(j-1):N*j] .= a[i, j] .* x
end
end
```

and get the following results:

```
julia> @btime set_1!(y, a, x, M, N)
338.428 ms (110 allocations: 839.24 MiB)
julia> @btime set_2!(y, a, x, M, N)
90.999 ms (0 allocations: 0 bytes)
```

I then tried directly calling BLAS:

```
function set_4!(y, a, x, M, N)
for p = 1 : N
ptr_x = pointer(x, 1+N*(p-1))
for i = 1 : M, j = 1 : i
coeff = a[i, j]
ptr_y = pointer(y, 1+N*M*(N*(j-1)+p-1)+N*(i-1))
ccall((BLAS.@blasfunc("zaxpby_"), BLAS.libblas), Cvoid, (Ref{BlasInt}, Ref{ComplexF64}, Ptr{ComplexF64}, Ref{BlasInt}, Ref{ComplexF64}, Ptr{ComplexF64}, Ref{BlasInt}), N, coeff, ptr_x, 1, zero(ComplexF64), ptr_y, 1)
end
end
end
```

and the result is

```
julia> @btime set_4!(y, a, x, M, N)
69.349 ms (0 allocations: 0 bytes)
```

why 4? because `set_3!`

is just like it, but `p = 1 : N`

is the innermost loop. It is slightly slower then `set_2!`

, so I donāt put it here.

One minor note that `Int(round(rand()*5))`

could just be `rand(0:5)`

It appears to me that `mat_h_plus!`

and `mat_h_minus!`

construct spin raising and lowering operators for spins of size `M`

(the factor of `im`

there seems a bit non-standard to me and `sqrt(ind)`

is not quite right, but I think you simplified for the purpose of posting the code). Anyways the point is `mat_plus`

and `mat_minus`

are rather sparse with (M-1)/M^2 \propto \mathcal{O}(1/M) entries. So `mat_x`

and `mat_y`

are similarly sparse. The following code

```
for i = 1:M, j=1:i
res[1+N*(i-1):N*i, 1+N*(j-1):N*j] = mat_x .* arr_1[i,j,1] .+ mat_y .* arr_1[i,j,2] .+ coeff_b .* arr_1[i,j,3]
end
```

essentially performs a few Kronecker products and should be equivalent to

```
res = @views kron(arr_1[:,:,1], mat_x) + kron(arr_1[:,:,2], mat_y) + kron(arr_1[:,:,3], coeff_b * I(M))
```

The second loop can be simplified similarly and should be equivalent to:

```
res += kron(Diagonal(mat_1), I(M)) + I(M*N)*coeff_a
```

So for better performance, I would use sparse matrices. Here is an attempt:

```
using SparseArrays
# HELPER FUNCTIONS
function mat_h_plus(size, coeff::Real)
I = 1:size-1
J = 2:size
V = im*coeff*sqrt.(I)
return sparse(I,J,V, size,size) # constructs a sparse size x size matrix
# size_o = size(original, 1)
# for ind in 1:size_o-1
# original[ind,ind+1] = im * coeff * sqrt(ind) # procedure like this
# end
end
function mat_h_minus() # similar to mat_h_plus
end
# MAIN FUNCTION
function make_matrix(
N::Integer, # large integer (~1000)
M::Integer, # small integer (~10)
mat_1::Vector{ComplexF64}, # size M
arr_1::Array{ComplexF64, 3}, # MxMx3 array
coeff::Real # coefficients for mat_hs
)
res = zeros(ComplexF64, M*N, M*N)
mat_plus = mat_h_plus(M, coeff)
mat_minus = mat_h_minus(M, coeff) # actually shouldn't that just be the Hermitian conjugate of mat_plus?
# mat_minus = mat_plus' # hermitian conjugate
mat_x = (mat_plus - mat_minus)/sqrt(2) # something like this
mat_y = # similar to calculation above
# diag = coeff_a * I(M) # diagonal matrix
# assign values for off-diagonal parts
res .+= @views kron(arr_1[i,j,1], mat_x)
res .+= @views kron(arr_1[i,j,2], mat_y)
res .+= @views kron(arr_1[i,j,3], coeff_b*I(M))
# for i = 1:M, j=1:i
# (
# res[1+N*(i-1):N*i, 1+N*(j-1):N*j]
# = mat_x .* arr_1[i,j,1] .+ mat_y .* arr_1[i,j,2] .+ coeff_b .* arr_1[i,j,3]
# )
# end
# diagonal parts
res .+= kron(Diagonal(mat_1), I(M))
res .+= coeff_a*I(M*N)
# for i=1:M
# (
# res[1+N*(i-1):N*i, 1+N*(i-1):N*i]
# .+= diag .+ mat_1[i] * I(M)
# )
# end
return Hermitian(res)
end
```

The next low-hanging fruit here is to move the allocation of `res`

outside of `make_matrix`

to only allocate it once per thread as I wrote earlier.

Another improvement would be use a better in-place addition by replacing the `.+=`

with a custom addition function like:

```
function myadd!(dense::Matrix, sparse::SparseMatrixCSC)
for col in 1:size(sparse, 2)
for r in nzrange(sparse, col)
dense[rowvals(sparse)[r], col] += nonzeros(sparse)[r]
end
end
return dense
end
```

Thank you.

For our final result, we should obtain all the eigenvalues of given matrices. So it might not be useful to use iterative techinques or subroutines like `zhegvx`

.

Actually, the `arr_1[:,:,i]`

is passed by a function in a form of matrices. Rearranging them might increase numbers of assignment.

I will try your last idea to call blas subroutines directry. Thank you.

Thank you for the reply.

I tried three versions from your advices.

- original code. it takes 13 minutes, and make 38 GB allocations.
- use kronecker product. 10 minutes, 69 GB
- kronecker product + sparse arrays. 9 minutes, 69 GB
- only sparse arrays. 9 minutes, 22 GB

Considering those experiments, I decided to use sparse arrrays, and not to use kronecker product methods. It reduces roughly 30 % of time and 40 % of memory allocations.

When you say āwithout kronecker productsā you mean you build the sparse array directly from a loop? That would of course be more efficient but also a bit more difficult, so I didnāt propose it originally. If you post the current code again, Iāll have another look and see whether we can make it even faster

Edit: Does M stay constant the whole time? If so it would make sense to construct the matrices `mat_plus`

,ā¦ only once and not on every call to `make_matrix`

.

I wanted to say that I donāt compute elements of `res_mat`

with kronecker product.

Certainly, M is fixed for whole the iterations, but make_matrix has other argument called `x`

, which influences the mat_plus through `coeff`

argument. So I should make mat_plus for every iterations.

Below is my current code for make_matrix:

```
using SparseArrays
# HELPER FUNCTIONS
function mat_h_plus(size, coeff::Real)
I = 1:size-1
J = 2:size
V = coeff*sqrt.(I)
return sparse(I,J,V, size,size) # constructs a sparse size x size matrix
end
# MAIN FUNCTION
function make_matrix(
N::Integer, # large integer (~1000)
M::Integer, # small integer (~10)
mat_1::Vector{ComplexF64}, # size M
arr_1::Array{ComplexF64, 3}, # MxMx3 array
coeff::Real # coefficients for mat_hs
)
res = zeros(ComplexF64, M*N, M*N)
mat_plus = mat_h_plus(M, coeff)
mat_minus = adjoint(mat_plus) # transpose conjugate of mat_plus
mat_x = (mat_plus - mat_minus)/sqrt(2) # something like this
mat_y = # similar to calculation above
# assign values for off-diagonal parts
for i = 1:M, j=1:i
(
res[1+N*(i-1):N*i, 1+N*(j-1):N*j]
= mat_x .* arr_1[i,j,1] .+ mat_y .* arr_1[i,j,2] .+ coeff_b .* arr_1[i,j,3]
)
end
# diagonal parts
for i=1:M
(
res[1+N*(i-1):N*i, 1+N*(i-1):N*i]
.+= coef.*I(M) .+ mat_1[i] * I(M)
)
end
return Hermitian(res)
end
```

And this is main part:

```
# calculate needed values for x
function calc_function(M, N, x)
kvalue_range = range(0, 0.5, 201)
result_vector = zeros(201)
mat_1, arr_1 = # diagonalization of size M(~10)
coeff = coeff_func(x) # some arithmetics
Threads.@threads for k_ind in 1:201
kvalue = kvalue_range[k_ind]
energy = eigvals!( make_matrix(N, M, mat_1, arr_1, mat_1, coeff)
# we need different matrices for each kvalues
result_vector[k_ind] = sub_calculation(energy)
end
result = integrate(result_vector)
# integrate uses the trapezoidal rule: just adding values
retun result
end
const dx = 0.01 # for differentiation
container = zeros(201)
for ind in eachindex(container)
x = 0.1 * (ind-1)
Res_minus = calc_func(x-dx)
Res_plus = calc_func(x+dx)
container[ind] = (Res_plus - Res_minus) / (2*dx)
end
# some IOs
```

Note though that in this code. `mat_plus`

, `mat_minus`

and thus also `mat_x`

and `mat_y`

are directly proportional to `coeff`

. So you could still build them once and then reuse. Also you really should reuse the space for `res`

in `make_matrix`

like I suggested earlier.

I have the strong feeling that the code you posted is not your true code (spelling mistakes, syntax errors,ā¦), so I wonāt make the effort of restructuring it, because I donāt know whether these transformations would apply to your true code. If you post what you real code does here, then Iāll gladly show you how to restructure it to gain more performance.

Iām sorry, but it is hard to provide a more detailed code example since the real code involves information we have not published yet.

Thank you so much for all your advice. It was helpful for more efficient calculation by methods other than parallelization, which was my first purpose.