Help with this code (deallocation/pointers on sparse matrices)

In a few hours I will provide the full program (dinner time). I don’t know if you want it all or just a simple reproducible example.

Besides sparse arrays, Julia has many types that can save on time or space depending on use case. Take for instance kronecker, which simply wraps the original matrices and then forwards to the appropriate matrix when indexing, i.e., it is even sparser than sparse:

julia> using SparseArrays, Kronecker

julia> S = sparse(2.3 * I(100));

# Check how much memory that matrix is using up
julia> Base.summarysize(S)
2568
# Should be approximately the amount of memory needed to store each entry plus two index vectors
julia> 100 * (sizeof(2.3) + 2 * sizeof(1))
2400
# Now for kron
julia> Base.summarysize(kron(S, S))
240168
# Still a sparse matrix, but approx, 100 times as large
julia> Base.summarysize(kron(S, S)) / Base.summarysize(S)
93.5233644859813
# Now kronecker
julia> Base.summarysize(kronecker(S, S))
2688
# Just 5% more, i.e., super sparse
julia> Base.summarysize(kronecker(S, S)) / Base.summarysize(S)
1.0467289719626167
# But, indexing is now more complicated and takes a bit more time
julia> using BenchmarkTools

julia> @btime $(S)[37, 53];
  6.301 ns (0 allocations: 0 bytes)

julia> @btime $(kron(S, S))[370, 5300];
  6.719 ns (0 allocations: 0 bytes)

julia> @btime $(kronecker(S, S))[370, 5300];
  26.824 ns (0 allocations: 0 bytes)

In the end, it’s all about trade-offs. Julia with its multiple dispatch is very flexible and usually good in picking a suitable implementation for different types. Just let it do its magic and don’t worry about low-level details … unless you have solid evidence, i.e., by profiling, that a sub-optimal or even bad path was taken.

2 Likes

So, instead of giving the full code (which still, to my surprise, doesn’t work) I want to give an analysis of single pieces, to not confuse with the little wall of text.

First: I need to multiply rectangular dense matrices with sparse quare matrices.
I thought of going with mul!, but it seems I have some issues.
If I write:

function foo()
    m=2
    # Block_Sz1 = Array{Float64}(undef, 4,2)
    Sz = [[0.5, 0] [0,-0.5]]
    A = [[0,1,0,0] [0,0,3,1]]
    Identity = [[1,0] [0,1]] 
    mul!(Block_Sz1, kronecker(Identity, Sz), A)
end
foo()

It says that Block_Sz1 is not defined.
I can define it: I tried removing the comment and it all works.
Then I tried benchmarking, but it seems that @btime doesn’t know of the existence of Block_Sz1, while @time succeds.
If I try to allocate Block_Sz1 for the sake of @btime (but still, running without using it), mul! throws a big error:

LoadError: InexactError: Int64(-0.5) (...)

Don’t know why, cause the documentation of mul! it says:
Calculates the matrix-matrix or matrix-vector product ABAB and stores the result in Y, overwriting the existing value of Y. Note that Y must not be aliased with either A or B.

However, we can proceed further cause i have sparse matrices.
Is mul! the generalized one or sparse one too?
I found on internet this Mul! vs SparseMatrix

Which says that mul! is generic, while spmatmul! is the sparse specific one, however in the sparseArrays documentation I can’t find anything related to the keyword “mul”.
We can try mul! anyway.

From the previous example I understood I first need to preallocate the array as undefined, but I can’t create an undefined sparse array, so I will go with a dense one instead.
It ends up spending 5 times the memory and 4 times the time:

function foo()
    m=2
    Block_Sz1 = Array{Float64}(undef, 4,2)
    Sz_sparse = sparse([[0.5, 0] [0,-0.5]])
    A_sparse = sparse([[0,1,0,0] [0,0,3,1]])
    @time mul!(Block_Sz1, kronecker(sparse(I, m, m), Sz_sparse), A_sparse)
end

foo()

Even If I can manage to solve these issues, there are still uncertainties:

  1. reading on internet i need to preallocate to help the compiler, and it’s obviously required. I will try to experiment on this subject after having a functional program within the julia semantics.
  2. In the end I will do a triple product like O’ A O where ’ is the transposition, A is square and sparse, O is rectangular and dense. I’ve seen there are tools like octave and tullio who could help optimizing the performances
  3. i still wonder if it’s possible, after all these steps and related to step 1, to preallocate these sparse matrices before the loop and viewing slices of them. But i think it will come at the end of all this.

Ok thanks for writing that up. That makes it a lot easier to help you. Next time please also include code snippets that you tried that didn’t work then I can just point at them and explain what went wrong. Writing a bit too much is generally better than writing too little :slight_smile:

So let’s start addressing things one-by-one.

Note the result of dense*sparse is another dense matrix.

Yes of course. If you want to use mul! the whole point is that you need to preallocate the output. Again, Julia has not Matlab’s or Fortran’s semantic so mul!(Y,A,B) does not define some space Y for you. Variables are just names for things not things themselves.

You didn’t provide code, so I don’t know exactly what you did, but presumably you just swapped @time for @btime and pu. In that case you need to need interpolate variables into the expression. The reasons are a bit too technical right now but they aim to reduce overhead from the accessing the variables and are mostly in-place for when you call @btime on the top-level. The correct call to measure only the performance of the mul! (without kronecker) is:

@time mul!($(Block_Sz1), $(kronecker(sparse(I, m, m), Sz_sparse)), $(A_sparse))

$ performs interpolation and in this case it means, that stuff inside $(...) is run first and does not count towards to the runtime.

The next part about using mul! together with sparse matrices is a bit confused I think. mul! is generic and will generally use an optimal implementation (just like * or any other Julia function really). However you use Kronecker.jl and I have no idea how that will work with mul! (by that I mean it will like work but it could be very slow). SparseArrays.jl generally respects mul! so if you use kron instead of kronecker it would be fine.

Preallocation of a sparse matrix works by simply making an empty one and then using sizehint! to ensure the correct dimensions:

Block_Sz1 = spzeros(2,2)
sizehint!(Block_Sz1, number_of_expected_entries)

However: all of that preallocation buisness does not make any sense in this example! There is no need for you to preallocate anything here. You just perform a single multiplication and whether you allocate the out manually or let Julia do it does not matter at all. In fact its probably slower when you do it manually. As such I don’t see any reason why you would measure performance right now. Your code does not remotely do what it has to do in the end and so measuring anything here is just pointless! You need to tune the real-world code for somewhat realistic parameters. Everything else is just a mirage.

This is not a useful insight. Of course for small matrices a dense matrix multiplication will always beat sparse arrays. That’s why benchmarking things here is rather meaningless.

Generally you are imo way too focused on preallocating stuff for performance’s sake but it is just to early to think about this. Implement a small functional part of you algorithm first and then try to tune it. There is no point in trying to jump to hoops if you don’t know if 1) it’s worth it (because it just isn’t the bottleneck) and 2) you have nothing to compare to. See not all code needs to be fast. In most numerical programs, execution spends >~90% of time in a couple lines of code. So the rest of program can be written as simple as possible and it doesn’t hurt performance of the full thing. Having a simple implementation also gives you a reference to check against in case you introduce additional bugs by your optimizations.

Well no. You don’t need to preallocate to help the compiler, you need to preallocate if you use functions that use preallocated space such as mul!. You can perfectly fine just use normal multiplication * and don’t have to preallocate anything.

Stop worrying about performance right now and focus on writing a correct code. Things like this can be optimized later much much more effectively.

This will very likely not work due to the data structure that is a sparse matrix. I don’t think views are effective on sparse matrices (at least on SparseMatrixCSC). But we can discuss this once you have a working code that you want to tune.

2 Likes

I’ll add one thing to Adrian’s excellent answer: You didnt’ say exactly how you got the error

LoadError: InexactError: Int64(-0.5) (...)

but I not you are defining

    Sz = [[0.5, 0] [0,-0.5]]
    A = [[0,1,0,0] [0,0,3,1]]
    Identity = [[1,0] [0,1]] 

which creates Sz as a Matrix{Float64} but will make A and Identity Matrix{Int64} - the array constructor [] will automatically try to infer the type of the array from its elements, and if all elements are integers you’ll get an integer array which can’t store floats. See:

julia> A = [1.5 2.0; 3.0 4.0]; B = [1.0 1.0; 1.0 1.0]; Y = [0 0; 0 0];

julia> typeof.((A, B, Y))
(Matrix{Float64}, Matrix{Float64}, Matrix{Int64})

julia> mul!(Y, A, B)
ERROR: InexactError: Int64(3.5)

This is fairly basic stuff so as Adrian says it seems to me you’re trying to take the third step (complicated micro-optimisations) before the first (writing very basic correct Julia code). You’ve said you’re new to the language so I think it’d really pay off to spend a bit of time learning the basics before worrying about shaving microseconds off of a specific matrix product.

3 Likes

Sorry fo the delay, but i had to solve a ghost issue i didn’t find, which lead to really unstable results.
Instead I simplified the program and, even if they really are not reliable, i do not expect much from a dummy example where i even collect the matrix for a full diagonalization, just for the sake of simplicity.
They are closer enough in certain condition to led me think I didn’t make any issue. In other condition they really are out of the spot, but i don’t really mind here.
This is the program, however it’s really slow.

using SparseArrays
using LinearAlgebra
using Kronecker
using BenchmarkTools
using MKL
using MKLSparse

function NRG(Delta::Real, max_states::Integer, NIter::Integer, Energy::Vector, Energy_2::Vector, Energy_bond::Vector)

    Sz = sparse(Float64[[0.5, 0] [0, -0.5]])
    Sp = sparse(Float64[[0, 0] [1.0, 0]])
    Sm = sparse(Float64[[0, 1.0] [0, 0]])
    Ham = sparse(zeros(2,2))

    local Block_Sp = Sp
    local Block_Sm = Sm
    local Block_Sz = Sz
    local Block_Ham = Ham
    local Block_Sp1
    local Block_Sm1
    local Block_Sz1

    # threshold::Int32 = floor(log(max_states)/log(2)) #2 == initial matrix size #true if constraint is on lattice syze
    # threshold::Int32 = floor(sqrt(max_states)) -1 #(x+1)^2 <= max_states where x is iteration index. Iteration 1 starts with syze=2 and creates syze=4
    local m = 2 #hilbert size
    local lattice_size

    for i = 1:NIter
        lattice_size = 2^i #spatial row_size

        #Hamiltonian constructor
        H_super = kroneckersum(Block_Ham, Block_Ham) .- Delta*kronecker(Block_Sz, Block_Sz) .+ 0.5*(kronecker(Block_Sp, Block_Sm) .+ kronecker(Block_Sm, Block_Sp))
        H_dense_super = collect(H_super) #I make a dense matrix as this is a dummy example with full diagonalization.

        #Diagonalization part
        Energy[i] = eigvals(H_dense_super)[1] #ascendent ordered
        Energy_2[i] = Energy[i]/lattice_size
        if i> 1
            Energy_bond[i] = (Energy[i] - Energy[i-1]) / (lattice_size/2)
        end
        m_next = m*m # hilbert space row_size after kronecker products (mul!)
        trunc_index = min(m_next,max_states)
        trunc_matrx = eigvecs(H_dense_super)[:,1:trunc_index]

        #Preallocation for mul!
        Block_Ham1 = Array{Float64}(undef, m_next, trunc_index)
        Block_Sz1 = Array{Float64}(undef, m_next, trunc_index)
        Block_Sp1 = Array{Float64}(undef, m_next, trunc_index)
        Block_Sm1 = Array{Float64}(undef, m_next, trunc_index)

        #mul!
        mul!(Block_Ham1, H_super, trunc_matrx)
        mul!(Block_Sz1, kronecker(sparse(I, m, m), Block_Sz), trunc_matrx)
        mul!(Block_Sp1, kronecker(sparse(I, m, m), Block_Sp), trunc_matrx)
        mul!(Block_Sm1, kronecker(sparse(I, m, m), Block_Sm), trunc_matrx)

        #Preallocation for mul!
        Block_Ham = Array{Float64}(undef, trunc_index, trunc_index)
        Block_Sz = Array{Float64}(undef, trunc_index, trunc_index)
        Block_Sp = Array{Float64}(undef, trunc_index, trunc_index)
        Block_Sm = Array{Float64}(undef, trunc_index, trunc_index)

        #mul!
        mul!(Block_Ham, trunc_matrx', Block_Ham1)
        mul!(Block_Sz, trunc_matrx', Block_Sz1)
        mul!(Block_Sp, trunc_matrx', Block_Sp1)
        mul!(Block_Sm, trunc_matrx', Block_Sm1)

        m = trunc_index # hilbert space row_size before next kronecker products
    end        
end

#Definition of starting parameters
Delta::Float64 = 0.5
max_states::Int16 = 5
NIter::Int32 = 5 #Final lattice size: 2*NIter + 2

#output vectors
Energy = Vector{Float64}(undef, NIter)
Energy_2 = Vector{Float64}(undef, NIter)
Energy_bond = Vector{Float64}(undef, NIter)
Energy_bond[1] = 0 #no bond

NRG(Delta, max_states, NIter, Energy, Energy_2, Energy_bond)
# display(Energy)
# display(Energy_2)
# display(Energy_bond)

I have a matlab example to use as a benchmark. They are purposely written similar, to benchmark one in the other:

clear all

Delta = 0.5;   % ZZ anisotropy
m=10;          % Number of states kept m
NIter = 10;    % Number of iterations.  Final lattice size is 2*Niter + 2


%               Intialize local operators
I= eye(2);
Sz = [1/2 0 ; 0 -1/2];
Sp = [0 0 ; 1 0];
Sm = [0 1 ; 0 0];

% Initial blocks  (we assume reflection symmetry)
BlockSz = Sz;
BlockSp = Sp;
BlockSm = Sm;
BlockI  = I;
BlockH  = zeros(2);
Energy = 0.;

%%  Numerical Renormalization Group (NRG)

for l = 1:NIter
    
    SystSize = 2^l;
    
%                   HAMILTONIAN MATRIX forsuperblock
    H_super = kron(BlockH, BlockI) + kron(BlockI, BlockH) ...
        - Delta * kron(BlockSz, BlockSz) ...
        + 0.5 * ( kron(BlockSp, BlockSm) + kron(BlockSm, BlockSp) );

    H_super = 0.5 * (H_super + H_super');  % ensure H is symmetric

    LastEnergy = Energy;
    
%                   Diagonalizing the Hamiltonian
    [Evect Spectrum] = eig(H_super);
    [Spectrum Index] = sort(diag(Spectrum), 'ascend');  % sort ascending
    Evect = Evect(:,Index);
    
    Energy = Spectrum(1);
    Ener2  = Energy / SystSize;
    EnergyPerBond = (Energy - LastEnergy) / (SystSize/2);
    
%                   Construct the truncation operator
    NKeep = min(size(Spectrum, 1), m);
    Omatr = Evect(:, 1:NKeep);

    fprintf('%d\t%f\t%f\t%f\n',SystSize, Energy, EnergyPerBond, Ener2);

    BlockSz = kron(BlockI,BlockSz);
    BlockSp = kron(BlockI,BlockSp);
    BlockSm = kron(BlockI,BlockSm);
    BlockI = kron(BlockI,BlockI);
    
%  Transform the block operators into the truncated basis
    BlockH  = Omatr' * H_super * Omatr;
    BlockSz = Omatr' * BlockSz * Omatr;
    BlockSp = Omatr' * BlockSp * Omatr;
    BlockSm = Omatr' * BlockSm * Omatr;
    BlockI =  Omatr' * BlockI  * Omatr;

end

I think it’s now time to dwelve in optimization, as it spends an unexpected amount of time.
I run the julia program in 7 seconds, using @time.
I run the matlab program on matlab mobile (I hate windows, so I use it through the phone) and it’s less than 1 second.

Noteworthy remarks:

  1. H_super is symmetric and it could be something matlab detected.
  2. Matlab isn’t necessary doing a full diagonalization of a dense matrix
  3. the bottleneck could be collecting the sparse matrix in a dense matrix

If you wonder where is the part when I switch between odd and even number: it’s in the part that works of the program which does not work, mentioned at the beginning. I will fix it another time. To sum it up: instead of multiplying trunc_matrx each time, I need to do it just when m_next > max_states.

I don’t have much time right now but I noticed one pretty big difference: The Julia code performs the diagonalization 2 times. Once when you call eigvals and once when you use eigvecs. Better use eigen (or in this case it’s save to use eigen! to save some allocations) which gives you both eigenvectors and eigenvalues.

evalues, evectors = eigen!(H_dense_super)

Another good time save would be to reuse your preallocated matrices better. Right now you preallocate 4 times the same matrix size but you could just reorder the operations and get away with a single preallocated matrix.

        # Preallocation for mul!
        temp = Array{Float64}(undef, m_next, trunc_index)

        #mul!
        mul!(temp, H_super, trunc_matrx)
        Block_Ham = trunc_matrx' * temp

        mul!(temp, kronecker(sparse(I, m, m), Block_Sz), trunc_matrx)
        Block_Sz = trunc_matrx' * temp

        mul!(temp, kronecker(sparse(I, m, m), Block_Sp), trunc_matrx)
        Block_Sp = trunc_matrx' * temp

        mul!(temp, kronecker(sparse(I, m, m), Block_Sm), trunc_matrx)
        Block_Sm = trunc_matrx' * temp

No point in preallocating the Block_* just to multiply into them once.

You could also try not materializing the identity inside the kronecker. Maybe Kronecker.jl has some optimization if it knows that one factor is the identity. So try replacing

kronecker(sparse(I, m, m), Block_Sm)

with

kronecker(I(m), Block_Sm)

Another comment: For a full diagonalization you need a dense matrix. So there is no way of keeping it sparse if you do a full diagonalization.

2 Likes

Thanks a lot. I needed the sparse syntax cause, after enjoying a bit of Julia I will try a simple dmrg, where I would use lanczos algorithms

When I run your last posted Julia version using @time in front of the NRG call, I get about 6 seconds, similar to you. However, if you look at the output of @time more closely, you’ll see that the vast majority of the elapsed time is due to JIT compilation:

6.039726 seconds (11.50 M allocations: 767.095 MiB, 4.07% gc time, 99.90% compilation time)

Note that 99.9% of the time is spent in compilation. If I call it once more using @time I get

julia> @time NRG(delta, max_states, niter, Energy, Energy_2, Energy_bond)
  0.001204 seconds (470 allocations: 492.078 KiB)

About a millisecond. This is the actual execution time of the algorithm.
Most of the time when benchmarking it doesn’t make sense to include compilation time, especially when comparing different algorithms against one another. I’ve seen you use @btime from BenchmarkTools previously–it always excludes compilation time. It’s also more reliable for testing fast-running functions since it performs many repetitions of the call and returns the minimum time.

2 Likes

Just another questions on: why did you use mul! Half the times?

This question has is backwards :slight_smile: The question is not “why not use mul!” but rather “why use mul!”. mul! is just a performance optimization for * in case you can reuse some memory. If you can’t reuse memory, then mul! has no advantage whatsoever. A normal multiplication just allocates an appropriate output and then uses mul! to fill it. There is not inherent performance benefit to mul! of *.

So you preallocated 4 temporary matrices, filled each once, then preallocated 4 output matrices, filled them and threw the 4 temporary matrices away. I preallocated 1 temporary matrix and used it 4 times to store the intermediate result. I did not explicitely preallocate something for the output because * does that for me.

Another comment: A more sophisticated code could split the loop into two parts. The first part loops until m equals max_states and then the second loop performs the rest of the iterations. This has the advantage that you can preallocate everything for the 2nd loop because the matrix sizes don’t change anymore.

2 Likes

Yes, for the last comment you are right, even if there is a grey zone of 1 iteration between max states and trunc index which has an even different size. However I incurred in some issue implementing that and, given the fact I was trying to optimiz my understanding of Julia (and I think I succeed even if I didn’t use tullio or octave) rather than the algorithm, I just left it there. It would have required the even and odd distinction and depending on max states, the last iteration could have stuff_1 or just stuff_. Thus I would need 2 loops like these one with multiplication, one if trunc index is odd and another one if it is even. That would have increased the reading complexity, but I would not have gained any more insight from it.

I didn’t understand well the difference between mul! And the normal * operation, so thanks a lot