# Best way to parallelize

Hi all,

I have a large sparse matrix (dimension of order 10^5) which I have to multiply with a matrix of vectors many times. such a matrix of vectors is made out of, say, 500 vectors each of dimension equal to the dimension of the sparse matrix mentioned above. I am looking for the best way to parallelize suzh an operation.

I tried a couple of strategies but none of them looks very efficient and scalable. I go now to illustrate them in a super simplified example which still shows the problems I am encountering:

1. using mul!() on matrix-matrix multiplication.

here is a code example:

``````function filtering_step!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
mul!(u_vectors, hamiltonian_matrix, search_vectors_list)
provisional_vector .= copy(search_vectors_list)
mul!(provisional_vector, hamiltonian_matrix, u_vectors, 2.0, -1.0)
w_vectors .= copy(provisional_vector)
end
``````

in this code I multiply the matrix of vectors `search_vectors_list` on the left with the large sparse matrix `hamiltonian_matrix` and I store the result on the matrix of vectors called `u_vectors`. after that, I would like to compute the following operation

``````w_vectors = 2 * hamiltonian_matrix * u_vectors - search_vectors_list
``````

and i do to that by means of mul!() and by using the `provisional_vector` matrix which has the same dimension of `w_vectors` and `u_vectors`. I have to say that the code works well: just 4 memory allocations and fast. However, no parallelism is used. To parallelize the code I am trying with multithreading as follows

1. using mul!() on matrix-vector multiplication and multithreading

to this purpose, I rewrite the previous code as follows:

``````provisional_vector = zeros(ComplexF64, size(hamiltonian_matrix, 2), Threads.nthreads())

function filtering_step_new!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
end
end

``````

However, in this case the performance does not scale well. Here are my performances

from which the advantage is not very nice. Considering that I have to move this code to a server (not a cluster) with around 60 cores, I am wondering how I should proceed to increase the performance.

Thanks.

For option 1, are you sure that no parallelism is used under the hood by BLAS? What happens when you launch Julia with more than 1 thread?

thanks for your reply: I suspect that is not using any parallelism under the hood because the matrix is sparse. I tried to lunch the code with 6 threads and, beyond not seeing any gain in performance, I also checked the number of Julia threads actively working with a `top` command on the terminal, and just one thread was at work. Do you think should I do some other tests?

This is a bit beside the point, but your MWE is really hard to parse with such long variable names. I know long names are popular for various reasons, but for a reader who does not know what the names mean anyway, they just obscure the code logic.

Just renaming the variables like this, makes it much easier to see what is happening

``````function filtering_step!(S, U, W, P, H)
mul!(U, H, S)
P .= copy(S)
mul!(P, H, U, 2.0, -1.0)
W .= copy(P)
end

function filtering_step_new!(S, U, W, P, H)
mul!(P[:, id], H, S[:, k])
U[:, k] .= P[:, id]
P[:, id] .= S[:, k]
mul!(P[:, id], H, U[:, k], 2.0, -1.0)
W[:, k] .= P[:, id]
end
end
``````
3 Likes

You may want to limit the number of batches to the number of threads and increase the workload on each batch. This can be done by hand, but I have the impresion that FLoops.jl can handle that case nicely.

This is a bit of a red flag, BTW, and has been much discussed lately. Putting `@inbounds` in front of a `k in 1:size` loop is quite risky, and can cause really bad bugs. Instead, use `for k in axes(S, 2)`.

This is a bit beside the point, but your MWE is really hard to parse with such long variable names. I know long names are popular for various reasons, but for a reader who does not know what the names mean anyway, they just obscure the code logic.

Thanks!I will keep in mind for my next questions. Since I work on VScode which has auto-completion I like long names because I can remind the meanings but I got your point.

This is a bit of a red flag, BTW, and has been much discussed lately. Putting `@inbounds` in front of a `k in 1:size` loop is quite risky, and can cause really bad bugs. Instead, use `for k in axes(S, 2)` .

yes, I am aware of this. I can definitely change it at a later stage but now my most pressing point is in understanding the performance issue.

You may want to limit the number of batches to the number of threads and increase the workload on each batch. This can be done by hand, but I have the impresion that FLoops.jl can handle that case nicely.

This sounds like an interesting suggestion. I will take a look at it.

When multithreading, it is important to limit unnecessary allocations.

This:

creates a copy of `S` and then copies the contents of the copy into `P`. In other words, copying twice. You can just write `P .= S` to copy once.

In the `filtering_step_new!` function, you probably want `@views` on the whole loop.

creates a copy on the right, unnecessary.

This one is worse:

It creates a copy of for `S`, but it also creates a copy for `P`, which means you copy a column out of `P`, pass it to `mul!` which modifies that copy, and then throws it away(!) The operation is not applied to `P` at all.

6 Likes

Good point. To spot such performance pitfalls, you can use the `@profview` macro in VSCode. To solve them, just adding `@views` at the beginning of the loop should already go a long way.

This is another way to try it:

``````@views function filtering_step_new!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
for k in 1:nchuncks:size(search_vectors_list, 2) # simple splitter
mul!(provisional_vector[:, ichunck], hamiltonian_matrix, search_vectors_list[:, k])
u_vectors[:, k] .= provisional_vector[:, ichunck]
provisional_vector[:, ichunck] .= search_vectors_list[:, k]
mul!(provisional_vector[:, ichunck], hamiltonian_matrix, u_vectors[:, k], 2.0, -1.0)
w_vectors[:, k] .= provisional_vector[:, ichunck]
end
end
end
``````

Note that `nchuncks` is now independent of `Threads.nthreads()`. If the tasks are inhomogeneous, it may be good to make `nchuncks >> nthreads()`.

(of course you have to initialize the per-chunck auxiliary arrays with `nchuncks`size instead of `nthreads()`, thus in your case `nchuncks` could be just the length of those arrays, deduced from the input).

thank you very much all for the answers and the suggestions! I will try to implement all of them in the next few days and let you know the results!

Just another question, for the moment. For a problem of this form and with this dimensions, should I remain with multithreading? or is distributed more appropriate?

It creates a copy of for `S` , but it also creates a copy for `P` , which means you copy a column out of `P` , pass it to `mul!` which modifies that copy , and then throws it away(!) The operation is not applied to `P` at all.

You are right! actually my original code was not written in that way. in my original code, P is not a matrix but it is a vector of vectors and I do not have the problem you noticed. In other words, my original code is like this

``````mul!(P[id], H, S[:, k])
``````

and in this way it is working fine (at the level of making the right results, I am not talking about performance.)

I think your answer has been the one that helped most! at the end I found convenient to make as many chunks as the number of threads and put an equal number of operations on each chunks.

As I was saying, I have just a final point that I would like to discuss: when working on a single machine, is multithreading always better than distributed? or should I try to make a mixture?

I would not mix anything, unless there is a very clear reason to do so, but with that philosophy I myself never used Distributed for anything.

Concerning the number of chunks, it is always good to be able to set them independently of the number of threads, particularly if the tasks are not homogeneous, it may be better to use more chunks than threads, such that the workload get better distributed.