Best way to parallelize

Hi all,

I have a large sparse matrix (dimension of order 10^5) which I have to multiply with a matrix of vectors many times. such a matrix of vectors is made out of, say, 500 vectors each of dimension equal to the dimension of the sparse matrix mentioned above. I am looking for the best way to parallelize suzh an operation.

I tried a couple of strategies but none of them looks very efficient and scalable. I go now to illustrate them in a super simplified example which still shows the problems I am encountering:

  1. using mul!() on matrix-matrix multiplication.

here is a code example:

function filtering_step!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
        mul!(u_vectors, hamiltonian_matrix, search_vectors_list) 
        provisional_vector .= copy(search_vectors_list)
        mul!(provisional_vector, hamiltonian_matrix, u_vectors, 2.0, -1.0)
        w_vectors .= copy(provisional_vector)       
end

in this code I multiply the matrix of vectors search_vectors_list on the left with the large sparse matrix hamiltonian_matrix and I store the result on the matrix of vectors called u_vectors. after that, I would like to compute the following operation

w_vectors = 2 * hamiltonian_matrix * u_vectors - search_vectors_list

and i do to that by means of mul!() and by using the provisional_vector matrix which has the same dimension of w_vectors and u_vectors. I have to say that the code works well: just 4 memory allocations and fast. However, no parallelism is used. To parallelize the code I am trying with multithreading as follows

  1. using mul!() on matrix-vector multiplication and multithreading

to this purpose, I rewrite the previous code as follows:

provisional_vector = zeros(ComplexF64, size(hamiltonian_matrix, 2), Threads.nthreads())

function filtering_step_new!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
    @inbounds Threads.@threads for k in 1:size(search_vectors_list, 2)
        mul!(provisional_vector[:, Threads.threadid()], hamiltonian_matrix, search_vectors_list[:, k]) 
        u_vectors[:, k] .= provisional_vector[:, Threads.threadid()]
        provisional_vector[:, Threads.threadid()] .= search_vectors_list[:, k]
        mul!(provisional_vector[:, Threads.threadid()], hamiltonian_matrix, u_vectors[:, k], 2.0, -1.0)
        w_vectors[:, k] .= provisional_vector[:, Threads.threadid()]
    end        
end

However, in this case the performance does not scale well. Here are my performances

1 thread 211.195 ms
2 threads 133.692 ms
3 threads 105.833 ms
4 threads 90.927 ms
5 threads 76.597 ms
6 threads 72.710 ms

from which the advantage is not very nice. Considering that I have to move this code to a server (not a cluster) with around 60 cores, I am wondering how I should proceed to increase the performance.

Thanks.

For option 1, are you sure that no parallelism is used under the hood by BLAS? What happens when you launch Julia with more than 1 thread?

thanks for your reply: I suspect that is not using any parallelism under the hood because the matrix is sparse. I tried to lunch the code with 6 threads and, beyond not seeing any gain in performance, I also checked the number of Julia threads actively working with a top command on the terminal, and just one thread was at work. Do you think should I do some other tests?

This is a bit beside the point, but your MWE is really hard to parse with such long variable names. I know long names are popular for various reasons, but for a reader who does not know what the names mean anyway, they just obscure the code logic.

Just renaming the variables like this, makes it much easier to see what is happening

function filtering_step!(S, U, W, P, H)
    mul!(U, H, S) 
    P .= copy(S)
    mul!(P, H, U, 2.0, -1.0)
    W .= copy(P)       
end

function filtering_step_new!(S, U, W, P, H)
    @inbounds Threads.@threads for k in 1:size(S, 2)
        id = Threads.threadid()
        mul!(P[:, id], H, S[:, k]) 
        U[:, k] .= P[:, id]
        P[:, id] .= S[:, k]
        mul!(P[:, id], H, U[:, k], 2.0, -1.0)
        W[:, k] .= P[:, id]
    end        
end
3 Likes

You may want to limit the number of batches to the number of threads and increase the workload on each batch. This can be done by hand, but I have the impresion that FLoops.jl can handle that case nicely.

See: Sum result of Threads.foreach() - #10 by tkf

This is a bit of a red flag, BTW, and has been much discussed lately. Putting @inbounds in front of a k in 1:size loop is quite risky, and can cause really bad bugs. Instead, use for k in axes(S, 2).

This is a bit beside the point, but your MWE is really hard to parse with such long variable names. I know long names are popular for various reasons, but for a reader who does not know what the names mean anyway, they just obscure the code logic.

Thanks!I will keep in mind for my next questions. Since I work on VScode which has auto-completion I like long names because I can remind the meanings but I got your point.

This is a bit of a red flag, BTW, and has been much discussed lately. Putting @inbounds in front of a k in 1:size loop is quite risky, and can cause really bad bugs. Instead, use for k in axes(S, 2) .

yes, I am aware of this. I can definitely change it at a later stage but now my most pressing point is in understanding the performance issue.

You may want to limit the number of batches to the number of threads and increase the workload on each batch. This can be done by hand, but I have the impresion that FLoops.jl can handle that case nicely.

This sounds like an interesting suggestion. I will take a look at it.

When multithreading, it is important to limit unnecessary allocations.

This:

creates a copy of S and then copies the contents of the copy into P. In other words, copying twice. You can just write P .= S to copy once.

In the filtering_step_new! function, you probably want @views on the whole loop.

creates a copy on the right, unnecessary.

This one is worse:

It creates a copy of for S, but it also creates a copy for P, which means you copy a column out of P, pass it to mul! which modifies that copy, and then throws it away(!) The operation is not applied to P at all.

6 Likes

Good point. To spot such performance pitfalls, you can use the @profview macro in VSCode. To solve them, just adding @views at the beginning of the loop should already go a long way.

This is another way to try it:

@views function filtering_step_new!(search_vectors_list, u_vectors, w_vectors, provisional_vector, hamiltonian_matrix)
    nchuncks = Threads.nthreads()
    Threads.@threads for ichunck in 1:nchuncks
        for k in 1:nchuncks:size(search_vectors_list, 2) # simple splitter
            mul!(provisional_vector[:, ichunck], hamiltonian_matrix, search_vectors_list[:, k]) 
            u_vectors[:, k] .= provisional_vector[:, ichunck]
            provisional_vector[:, ichunck] .= search_vectors_list[:, k]
            mul!(provisional_vector[:, ichunck], hamiltonian_matrix, u_vectors[:, k], 2.0, -1.0)
            w_vectors[:, k] .= provisional_vector[:, ichunck]
        end        
    end
end   

Note that nchuncks is now independent of Threads.nthreads(). If the tasks are inhomogeneous, it may be good to make nchuncks >> nthreads().

(of course you have to initialize the per-chunck auxiliary arrays with nchunckssize instead of nthreads(), thus in your case nchuncks could be just the length of those arrays, deduced from the input).

thank you very much all for the answers and the suggestions! I will try to implement all of them in the next few days and let you know the results!

Just another question, for the moment. For a problem of this form and with this dimensions, should I remain with multithreading? or is distributed more appropriate?

It creates a copy of for S , but it also creates a copy for P , which means you copy a column out of P , pass it to mul! which modifies that copy , and then throws it away(!) The operation is not applied to P at all.

You are right! actually my original code was not written in that way. in my original code, P is not a matrix but it is a vector of vectors and I do not have the problem you noticed. In other words, my original code is like this

mul!(P[id], H, S[:, k])

and in this way it is working fine (at the level of making the right results, I am not talking about performance.)

I think your answer has been the one that helped most! at the end I found convenient to make as many chunks as the number of threads and put an equal number of operations on each chunks.

As I was saying, I have just a final point that I would like to discuss: when working on a single machine, is multithreading always better than distributed? or should I try to make a mixture?

I would not mix anything, unless there is a very clear reason to do so, but with that philosophy I myself never used Distributed for anything.

Concerning the number of chunks, it is always good to be able to set them independently of the number of threads, particularly if the tasks are not homogeneous, it may be better to use more chunks than threads, such that the workload get better distributed.