Speeding up array multiplication

Dear Community members:
I need advice for speeding up the following operations.

I have three arrays,

Fockmatrix, fcmatrix, densitymatrix

densitymatrix and fcmatrix are already given, and they are passed as arguments to the function. None of these arrays are sparse. Within this function, I need to use densitymatrix and fcmatrix to construct the Fockmatrix. Fockmatrix is a 3-D array, with size (a,a,b), fcmatrix is an array of size (a,a,b,b), densitymatrix is an array of size (a,a,b).

Besides, a is typically from 50-100, b is typically a few thousand

The relation between these three arrays is that for any ja and jb

Fockmatrix[ja,jb,:]=fcmatrix[ja,jb,:,:]*densitymatrix[ja,jb,:]

There are special properties for these arrays. fcmatrix is a real Float64 array, besides, it is symmetric when exchanging the first two indices or the last two indices. Besides,

desitymatrix[:,:,i] is an Hermitian matrix

So it is guranteed that
Fockmatrix[:,:,i] is an Hermitian matrix.

Below is how I am currently doing these construction.


function test_func(density_matrix::Array{ComplexF64},fcmatrix::Array{Float64})

a=50
b=1000
Fock_matrix=zeros(ComplexF64,a,a,b)

 Threads.@threads for ja in 1:a
  for jb in 1:ja-1
  Fock_matrix[ja,jb,:]+=fcmatrix[ja,jb,:,:]*density_matrix[ja,jb,:]
  end

 end
 Threads.@threads for ja in 1:b
   Fock_matrix[:,:,ja]+=Fock_matrix[:,:,ja]'
 end

Threads.@threads for ja in 1:a
    Fock_matrix[ja,ja,:]+=fcmatrix[ja,ja,:,:]*density_matrix[ja,ja,:]
end


end

I am running these codes on a high-performance server, so there are multiple cores available, which is why I am using the Threads.@threads here. I am looking for advice on how to make this construction faster. Any improvements are highly appreciated!

Thanks

Hi!
To get the full benefits of multithreading, the first thing you’ll need to do is optimize your sequential code, in particular with respect to memory allocations. For instance, every time you do x[i, :], it allocates a copy of the row or column in question. In general, you’ll want to perform as many operations as possible “in-place”. Here’s an example of how that might look, using built-in LinearAlgebra functions:

using LinearAlgebra

function test_func(density_matrix::Array{ComplexF64}, fcmatrix::Array{Float64})
    a, _, b = size(density_matrix)
    Fock_matrix = zeros(ComplexF64, a, a, b)

    for ja in 1:a
        for jb in 1:(ja - 1)
            Fock_matrix[ja, jb, :] += fcmatrix[ja, jb, :, :] * density_matrix[ja, jb, :]
        end
    end
    for ja in 1:b
        Fock_matrix[:, :, ja] += Fock_matrix[:, :, ja]'
    end

    for ja in 1:a
        Fock_matrix[ja, ja, :] += fcmatrix[ja, ja, :, :] * density_matrix[ja, ja, :]
    end
end

function test_func_fast!(
    Fock_matrix::Array{ComplexF64,3},
    density_matrix::Array{ComplexF64,3},
    fcmatrix::Array{Float64,4},
)
    a, _, b = size(density_matrix)

    for ja in 1:a
        for jb in 1:(ja - 1)
            mul!(
                @view(Fock_matrix[ja, jb, :]),
                @view(fcmatrix[ja, jb, :, :]),
                @view(density_matrix[ja, jb, :]),
                1,
                1,
            )
        end
    end
    for ja in 1:b
        axpy!(1, @view(Fock_matrix[:, :, ja])', @view(Fock_matrix[:, :, ja]))
    end

    for ja in 1:a
        mul!(
            @view(Fock_matrix[ja, ja, :]),
            @view(fcmatrix[ja, ja, :, :]),
            @view(density_matrix[ja, ja, :]),
            1,
            1,
        )
    end
end

Allocations are much smaller and the overall function is a bit faster, but it doesn’t seem to speed up the multithreaded version in my tests.

julia> using Chairmarks

julia> a, b = 20, 200;

julia> fcmatrix = rand(a, a, b, b);

julia> density_matrix = rand(ComplexF64, a, a, b);

julia> Fock_matrix = zeros(ComplexF64, a, a, b);

julia> @be test_func($density_matrix, $fcmatrix) seconds = 1
Benchmark: 25 samples with 1 evaluation
 min    37.738 ms (5034 allocs: 71.654 MiB, 4.30% gc time)
 median 39.248 ms (5034 allocs: 71.654 MiB, 6.77% gc time)
 mean   40.699 ms (5040.40 allocs: 71.655 MiB, 8.50% gc time)
 max    80.524 ms (5194 allocs: 71.664 MiB, 53.13% gc time)

julia> @be test_func_fast!($Fock_matrix, $density_matrix, $fcmatrix) seconds = 1
Benchmark: 31 samples with 1 evaluation
 min    32.304 ms (894 allocs: 51.438 KiB)
 median 32.766 ms (894 allocs: 51.438 KiB)
 mean   32.760 ms (894 allocs: 51.438 KiB)
 max    33.170 ms (894 allocs: 51.438 KiB)

My next angle of attack would be rethinking the array storage. Julia arrays are stored in a column-major format, which roughly means that when you increment only the first index, you get coefficients that are next to each other in memory, but when you increment the last index, you get coefficients that are far apart. If you know that you’ll perform operations on slices like fcmatrix[ja, jb, :, :], it may be worth reordering the axes so that they look like fcmatrix[:, :, ja, jb] instead.

Finally, you may want to look into libraries like Tullio.jl to express your computations in a simpler, more parallelizable way?

7 Likes

This code is calling into a linear algebra, “BLAS” library, which is an external library and written in different language, and which is also itself heavily multithreaded. When you try to use Julia’s own multithreading on top of that, it doesn’t work particularly well.

3 Likes

Each entry of fcmatrix is used exactly once, i.e. the arithmetic density (number of floating point ops divided by number of memory loads) is O(1).

This means that you’re memory bandwidth bound, and multithreading won’t significantly help on most architectures (on most systems, 2 cores are enough to saturate the main memory bandwidth).

There are two approaches:

  1. Optimize for memory bandwidth / access patterns, use symmetry, etc. This will only ever give you a small constant factor, though, and leave your CPU mostly idle (while the memory bus runs as hot as it can)
  2. Recognize that this tells you that you’re asking a stupid question. Fuse with other operations you’re doing anyway!

For example, consider code like

c .+= a
c .+= b

This code with its bad arithmetic density tells you that “add one array to another” is just a bad abstraction / idea. Don’t do it.

Instead do c .+= a .+ b, and then fuse with further operations. Matlab-style “vectorization” is actively harmful in non-interpreted languages.

PS. The thing you need to fuse with are other operations that read/write fcmatrix. So e.g. the code that initially creates fcmatrix should be able to do your thing with close to zero extra cost. Or maybe you want to do this op with the same fcmatrix and different densitymatrix? Then do them together, for close to zero extra cost.

4 Likes

Note that even for the lower edge of the size you talked about fcmatrix (a 50x50x1000x1000 array of Float64) is 20GB, so you’re already pushing the limits of most systems just to store it in RAM.

As suggested above, you should avoid allocations with view and mul!, reorder your data to be more cache-friendly, and combined your three separate loops into one. Doing this, I get

using LinearAlgebra

function test_func_reorder(density_matrix::Array{ComplexF64},fcmatrix::Array{Float64})

    b, _, a = size(density_matrix)
    Fock_matrix = Array{ComplexF64}(undef, b, a, a)
    for ja in 1:a
        for jb in 1:ja-1
            # normally, best practice would insist on looping ja on the inside because it's the "earlier" dimension
            # but here we access both sides so it makes no difference
            @views mul!(Fock_matrix[:, ja, jb], fcmatrix[:, :, ja, jb], density_matrix[:, ja, jb])
            @views Fock_matrix[:, jb, ja] .= conj.(Fock_matrix[:, ja, jb])
        end
        @views mul!(Fock_matrix[:, ja, ja], fcmatrix[:, :, ja, ja], density_matrix[:, ja, ja])
    end
    return Fock_matrix
end

a = 30; b = 200; # smaller test problem
dm = randn(ComplexF64, a, a, b);
fc = randn(Float64, a, a, b, b);
dm_reorder = permutedims(dm, (3, 1, 2)); # if you reorganized dm
fc_reorder = permutedims(fc, (3, 4, 1, 2)); # if you reorganied fc
julia> using BenchmarkTools

julia> fm = @btime test_func($dm, $fc);
  41.692 ms (8874 allocations: 158.77 MiB)

julia> fm_reorder = @btime test_func_reorder($dm_reorder, $fc_reorder);
  8.186 ms (3 allocations: 2.75 MiB)

julia> fm == permutedims(fm_reorder, (2, 3, 1)) # unshuffle and compare
true

I didn’t use threading yet in test_func_reorder, so you might get a little more benefit out of that. But as others have said, this is a memory-constrained calculation so it’s unlikely to see massive speedups from threading (and also */mul! are already threaded so you’d want to disable that by using LinearAlgebra.BLAS.set_num_threads(1)). You could expect a roughly 2x speedup (and decrease in memory) by using Float32/ComplexF32 instead of the 64 versions, if you don’t need the extra precision.

It would be more “Julian” if you thought about how you could further reorganize your data. For example, it seems that fcmatrix is actually an array of matrices and density_matrix is an array of vectors. It might be cleaner if you stored them that way (instead of the NumPy/MATLAB/etc way of using high-dimensional arrays), although the performance would be similar.

1 Like

Dear Community
Thanks for all the wodnerful suggestions. So eventually I did two things: one is to adopt in-place multiplication mul!. The other is to reduce the size of fcmatrix. As you noted, for fcmatrix[a,b,c,d], I only used the part for a>b. So eventually I changed fcmatrix to a matrix of matrix fcmatrix[a,b][c,d]. And when I construct it , I leave all the entries of a<b undefined.

These two modification roughly speed up my code for 4x times. So I am guessing that the main bottle-neck the the memory.

I would be more grateful if you have further insights on these.

3 Likes

Can you post the updated version of your code?

Hi Here is the updated version, and the code I acutally used in the code. My code calls this function Construct_projector:

function Construct_projector(k_set::Vector{Vector{Float64}},ϵr::Float64,
  density_matrix::Array{ComplexF64},single_matrix::Array{ComplexF64},
  energy_input::Float64,BG_density_matrix::Array{ComplexF64},fcmatrix::Array{Matrix{Float64}},Area::Float64,dimension::Int,
  target_density::Float64,temp::Float64,itcount::Int)
 
  

 Fock_matrix=zeros(ComplexF64,dimension,dimension,length(k_set))


 Threads.@threads for ja in 1:dimension
  for jb in 1:(ja - 1)
      mul!(
          @view(Fock_matrix[ja, jb, :]),
          @view(fcmatrix[ja, jb][:, :]),
          @view(density_matrix[ja, jb, :]),
          1,
          1,
      )
  end
 end
 

 Threads.@threads for ja in eachindex(k_set)
  axpy!(1, @view(Fock_matrix[:, :, ja])', @view(Fock_matrix[:, :, ja]))
 end


 Threads.@threads for ja in 1:dimension
  mul!(
      @view(Fock_matrix[ja, ja, :]),
      @view(fcmatrix[ja, ja][:, :]),
      @view(density_matrix[ja, ja, :]),
      1,
      1,
  )
 end


SOME MORE OPERATIONS...
end

As you can see, this is a rather large function that takes a lot of arguments. I am only putting here the part that’s taking the most of time in this function. Some more operations takes at most 2 to 3 seconds

As I recently run this code on the computing nodes when requesting 8 cpus, and dimension is exactly 80, and k_set is of length 961, each call of this function takes around 17-20 seconds.

fcmatrix is now a matrix of matrix instead of a 4D array. It is only defined for fcmatrix[ja,jb] for ja>=jb. All other entries are undef. It’s constructed once and never changed. However, the density_matrix will change iteratively (it will actually be produced as an ouput of this Construct_projector function), so I will need to call this Construct_projector function many times.

The overall structure of my code is:


function main(initial_density_matrix+ Many other arguments)

    fcmatrix=Matrix{Matrix{Float64}}(undef,dimension,dimension)

   CODE THAT CONSTRUCTS fcmatrix
    
      input_density_matrix=initial_density_matrix
while condition
      output_density_matrix=Construct_Projector(input_density_matrix+ Other arguments)
      input_density_matrix=output_density_matrix
end




return input_density_matrix

end