How to code faster parallel for loop

parallel

#1

Hi,

I am trying to multiply every element of a sparse matrix A (aij) with x[i]*x[j] from vector X.

aij *= (x[i]*x[j]).

In other words, Diagonal(X)*A*Diagonal(X). It will be simple to just compute as that but it takes 2X time since it iterates A twice by 2 multiplication. Then I made a function as following to update the value of the matrix by single round of iteration.

function _norm!(A, x)
    m, n = size(A)
    rows = rowvals(A)
    vals = nonzeros(A)
    @inbounds @simd for i = 1:n
        for j in nzrange(A, i)
          vals[j] *= x[rows[j]]*x[i]
       end
    end
    nothing
end
#testing
using SparseArrays, SharedArrays
using Distributed

n = 10^6
A = sprand(n,n,0.0001)
x = randn(Float64, n)

_norm!(A, x)
@time _norm!(A,x)
#0.591912 seconds (4 allocations: 160 bytes)

I try to parallel process the columns. But the function I coded is slower than single processor.

function _pnorm!(shared, A, x)
    m, n = size(A)
    rows = rowvals(A)
    @sync @distributed for i = 1:n
        @inbounds @simd for j in nzrange(A, i)
          shared[j] *= x[rows[j]]*x[i]
       end
    end
    nothing
end
#testing
y = SharedArray(A.nzval)
addprocs(3)
@everywhere using SparseArrays
_pnorm!(y, A, x)
@time _pnorm!(y, A, x)
#2.424550 seconds (626 allocations: 34.734 KiB)

This is the first time I try the parallel feature. I am very grateful for your help


#2

Multithreading can be simple:

function _norm!(A, x)
    m, n = size(A)
    rows = rowvals(A)
    vals = nonzeros(A)
    Threads.@threads for i = 1:n
        @simd for j in nzrange(A, i)
          vals[j] *= x[rows[j]]*x[i]
       end
    end
    nothing
end

using SparseArrays

n = 10^6;
A = sprand(n,n,0.0001);
x = randn(Float64, n);

_norm!(A, x)
@time _norm!(A,x)
  # 0.688902 seconds (5 allocations: 224 bytes) # without multithreading
  # 0.189398 seconds (5 allocations: 224 bytes) # with multithreading

#3

See my response and sample code for a somewhat similar problem here.

Also note that IMO this is the wrong end to approach multithreading. For several reasons, I’d rather recommend multithreading on as coarse of a level as possible, not in tight loops and individual functions like this.

(With some exceptions, for example if your problem solely consists of multiplying a single large matrix with a vector.)


#4

Thank you for your solution. It works very nicely I monitored the system while I ran those functions. I notice that when I used parallel in 4 core it has 4 lines of Julia usage of cpu. But when I use multithreading it only have one single Julia usage with 400% CPU usage. Does that means multithreading won’t have the problem of creating copies of variables for each thread? How is it different from parallel and when should I use parallel?

Thank you very much


#5

Thank you for the suggestion. Could you explain it a little bit more? I saw the Julia document it looks like all the example are using @thread for the for loop. Does Julia thread can offer more controls and do you know where I can find a document for the full usage of Julia multithreading?


#6

Note that @simd macro does not work with the gather pattern like x[rows[j]]. However, you can do it with SIMD.jl now. Depending on array size, I can get up to 2x speedup (which presumably also depend on CPU). I put sparse matrix/vector related SIMD’ed code in: https://github.com/tkf/SparseXX.jl. Maybe you can steal some code. It’s not thread-based parallelism but if you want parallelism it’s nice to get it at all levels including per-CPU.


#7

It’s not related to Julia really, just in general – if you parallelize the most fine-grained parts of your application (like this method), it’ll be harder to parallelize the application as a whole. It’s additional development work, leads to more complicated code (which increases the maintenance burden and risk of bugs), likely some parts of your application don’t lend themselves to be parallelized, which results in less parallelization overall. Users of your code might also prefer to handle the parallelization themselves.

There are exceptions; for example if the vast majority of your application is spent executing this particular method, and/or if the problem you’re solving can’t be parallelized on a more coarse level. Still, parallelizing individual methods like this sounds like a very last resort.

On top of that, when I run @Seif_Shebl’s sample code with and without @threads, I get a 2.5x speedup with the multithreaded version (~260 vs 670 ms). Yet I have an 8 core machine, and the multi-threaded version uses 760% CPU. This is a really bad usage of multi-threading. If your code uses 760% CPU, you want to aim for a speedup close to 7.6x. For comparison, in the example in the thread I linked to, I achieved a 7.4x speedup with 780% CPU usage.


#8

It makes sense. Thank you for the explanation.

Also I try to mimic your code of your other post

function _norm!(A, x; T=Threads.nthreads())
   m, n = size(A)
   rows = rowvals(A)
   vals = nonzeros(A)
   Threads.@threads for t in 1:T
       for i = t:T:n, j in nzrange(A, i)
          vals[j] *= x[rows[j]]*x[i]
      end
   end
   nothing
end

But I could not get ~thread_numX speedup.


#9

That approach is not very useful here, in fact it will have a negative impact since it ruins cache locality. The reason it worked in the other post was very problem-specific (the size of the inner loop grew linearly with the size of the outer loop). For your problem, that’s not the case, so you’d need to do something a bit more elaborate. Not sure how you’d parallelize it efficiently, perhaps someone has any ideas, but until then I’d recommend removing that @threads annotation, and look for other ways to speed up your code. Start at the highest level by trying to reduce the amount of work done (by choosing efficient algorithms, the right data structures, caching, etc), then work your way down. Did you take a look at @tkf’s link above on how to use SIMD?