Hi all,

I have a large sparse matrix (dimension of order 10^5) which I have to multiply with a matrix of vectors many times. such a matrix of vectors is made out of, say, 500 vectors each of dimension equal to the dimension of the sparse matrix mentioned above. I am looking for the best way to parallelize suzh an operation.

I tried a couple of strategies but none of them looks very efficient and scalable. I go now to illustrate them in a super simplified example which still shows the problems I am encountering:

- using mul!() on matrix-matrix multiplication.

here is a code example:

```
function filtering_step!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
mul!(u_vectors, hamiltonian_matrix, search_vectors_list)
provisional_vector .= copy(search_vectors_list)
mul!(provisional_vector, hamiltonian_matrix, u_vectors, 2.0, -1.0)
w_vectors .= copy(provisional_vector)
end
```

in this code I multiply the matrix of vectors `search_vectors_list`

on the left with the large sparse matrix `hamiltonian_matrix`

and I store the result on the matrix of vectors called `u_vectors`

. after that, I would like to compute the following operation

```
w_vectors = 2 * hamiltonian_matrix * u_vectors - search_vectors_list
```

and i do to that by means of mul!() and by using the `provisional_vector`

matrix which has the same dimension of `w_vectors`

and `u_vectors`

. I have to say that the code works well: just 4 memory allocations and fast. However, no parallelism is used. To parallelize the code I am trying with multithreading as follows

- using mul!() on matrix-vector multiplication and multithreading

to this purpose, I rewrite the previous code as follows:

```
provisional_vector = zeros(ComplexF64, size(hamiltonian_matrix, 2), Threads.nthreads())
function filtering_step_new!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
@inbounds Threads.@threads for k in 1:size(search_vectors_list, 2)
mul!(provisional_vector[:, Threads.threadid()], hamiltonian_matrix, search_vectors_list[:, k])
u_vectors[:, k] .= provisional_vector[:, Threads.threadid()]
provisional_vector[:, Threads.threadid()] .= search_vectors_list[:, k]
mul!(provisional_vector[:, Threads.threadid()], hamiltonian_matrix, u_vectors[:, k], 2.0, -1.0)
w_vectors[:, k] .= provisional_vector[:, Threads.threadid()]
end
end
```

However, in this case the performance does not scale well. Here are my performances

1 thread 211.195 ms

2 threads 133.692 ms

3 threads 105.833 ms

4 threads 90.927 ms

5 threads 76.597 ms

6 threads 72.710 ms

from which the advantage is not very nice. Considering that I have to move this code to a server (not a cluster) with around 60 cores, I am wondering how I should proceed to increase the performance.

Thanks.