Hi all,
I have a large sparse matrix (dimension of order 10^5) which I have to multiply with a matrix of vectors many times. such a matrix of vectors is made out of, say, 500 vectors each of dimension equal to the dimension of the sparse matrix mentioned above. I am looking for the best way to parallelize suzh an operation.
I tried a couple of strategies but none of them looks very efficient and scalable. I go now to illustrate them in a super simplified example which still shows the problems I am encountering:
- using mul!() on matrix-matrix multiplication.
here is a code example:
function filtering_step!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
mul!(u_vectors, hamiltonian_matrix, search_vectors_list)
provisional_vector .= copy(search_vectors_list)
mul!(provisional_vector, hamiltonian_matrix, u_vectors, 2.0, -1.0)
w_vectors .= copy(provisional_vector)
end
in this code I multiply the matrix of vectors search_vectors_list
on the left with the large sparse matrix hamiltonian_matrix
and I store the result on the matrix of vectors called u_vectors
. after that, I would like to compute the following operation
w_vectors = 2 * hamiltonian_matrix * u_vectors - search_vectors_list
and i do to that by means of mul!() and by using the provisional_vector
matrix which has the same dimension of w_vectors
and u_vectors
. I have to say that the code works well: just 4 memory allocations and fast. However, no parallelism is used. To parallelize the code I am trying with multithreading as follows
- using mul!() on matrix-vector multiplication and multithreading
to this purpose, I rewrite the previous code as follows:
provisional_vector = zeros(ComplexF64, size(hamiltonian_matrix, 2), Threads.nthreads())
function filtering_step_new!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
@inbounds Threads.@threads for k in 1:size(search_vectors_list, 2)
mul!(provisional_vector[:, Threads.threadid()], hamiltonian_matrix, search_vectors_list[:, k])
u_vectors[:, k] .= provisional_vector[:, Threads.threadid()]
provisional_vector[:, Threads.threadid()] .= search_vectors_list[:, k]
mul!(provisional_vector[:, Threads.threadid()], hamiltonian_matrix, u_vectors[:, k], 2.0, -1.0)
w_vectors[:, k] .= provisional_vector[:, Threads.threadid()]
end
end
However, in this case the performance does not scale well. Here are my performances
1 thread 211.195 ms
2 threads 133.692 ms
3 threads 105.833 ms
4 threads 90.927 ms
5 threads 76.597 ms
6 threads 72.710 ms
from which the advantage is not very nice. Considering that I have to move this code to a server (not a cluster) with around 60 cores, I am wondering how I should proceed to increase the performance.
Thanks.