How to code faster parallel for loop

Donut_Meepo · January 17, 2019, 8:27pm

Hi,

I am trying to multiply every element of a sparse matrix A (aij) with x[i]*x[j] from vector X.

aij *= (x[i]*x[j]).

In other words, Diagonal(X)*A*Diagonal(X). It will be simple to just compute as that but it takes 2X time since it iterates A twice by 2 multiplication. Then I made a function as following to update the value of the matrix by single round of iteration.

function _norm!(A, x)
    m, n = size(A)
    rows = rowvals(A)
    vals = nonzeros(A)
    @inbounds @simd for i = 1:n
        for j in nzrange(A, i)
          vals[j] *= x[rows[j]]*x[i]
       end
    end
    nothing
end

#testing
using SparseArrays, SharedArrays
using Distributed

n = 10^6
A = sprand(n,n,0.0001)
x = randn(Float64, n)

_norm!(A, x)
@time _norm!(A,x)
#0.591912 seconds (4 allocations: 160 bytes)

I try to parallel process the columns. But the function I coded is slower than single processor.

function _pnorm!(shared, A, x)
    m, n = size(A)
    rows = rowvals(A)
    @sync @distributed for i = 1:n
        @inbounds @simd for j in nzrange(A, i)
          shared[j] *= x[rows[j]]*x[i]
       end
    end
    nothing
end

#testing
y = SharedArray(A.nzval)
addprocs(3)
@everywhere using SparseArrays
_pnorm!(y, A, x)
@time _pnorm!(y, A, x)
#2.424550 seconds (626 allocations: 34.734 KiB)

This is the first time I try the parallel feature. I am very grateful for your help

Seif_Shebl · January 17, 2019, 9:27pm

Multithreading can be simple:

function _norm!(A, x)
    m, n = size(A)
    rows = rowvals(A)
    vals = nonzeros(A)
    Threads.@threads for i = 1:n
        @simd for j in nzrange(A, i)
          vals[j] *= x[rows[j]]*x[i]
       end
    end
    nothing
end

using SparseArrays

n = 10^6;
A = sprand(n,n,0.0001);
x = randn(Float64, n);

_norm!(A, x)
@time _norm!(A,x)
  # 0.688902 seconds (5 allocations: 224 bytes) # without multithreading
  # 0.189398 seconds (5 allocations: 224 bytes) # with multithreading

bennedich · January 17, 2019, 9:33pm

See my response and sample code for a somewhat similar problem here.

Also note that IMO this is the wrong end to approach multithreading. For several reasons, I’d rather recommend multithreading on as coarse of a level as possible, not in tight loops and individual functions like this.

(With some exceptions, for example if your problem solely consists of multiplying a single large matrix with a vector.)

Donut_Meepo · January 18, 2019, 12:05am

Thank you for your solution. It works very nicely I monitored the system while I ran those functions. I notice that when I used parallel in 4 core it has 4 lines of Julia usage of cpu. But when I use multithreading it only have one single Julia usage with 400% CPU usage. Does that means multithreading won’t have the problem of creating copies of variables for each thread? How is it different from parallel and when should I use parallel?

Thank you very much

Donut_Meepo · January 18, 2019, 12:08am

Thank you for the suggestion. Could you explain it a little bit more? I saw the Julia document it looks like all the example are using @thread for the for loop. Does Julia thread can offer more controls and do you know where I can find a document for the full usage of Julia multithreading?

tkf · January 18, 2019, 12:16am

Note that @simd macro does not work with the gather pattern like x[rows[j]]. However, you can do it with SIMD.jl now. Depending on array size, I can get up to 2x speedup (which presumably also depend on CPU). I put sparse matrix/vector related SIMD’ed code in: GitHub - tkf/SparseXX.jl: Sparse arrays with eXperimental eXtensions. Maybe you can steal some code. It’s not thread-based parallelism but if you want parallelism it’s nice to get it at all levels including per-CPU.

bennedich · January 18, 2019, 1:29am

It’s not related to Julia really, just in general – if you parallelize the most fine-grained parts of your application (like this method), it’ll be harder to parallelize the application as a whole. It’s additional development work, leads to more complicated code (which increases the maintenance burden and risk of bugs), likely some parts of your application don’t lend themselves to be parallelized, which results in less parallelization overall. Users of your code might also prefer to handle the parallelization themselves.

There are exceptions; for example if the vast majority of your application is spent executing this particular method, and/or if the problem you’re solving can’t be parallelized on a more coarse level. Still, parallelizing individual methods like this sounds like a very last resort.

On top of that, when I run @Seif_Shebl’s sample code with and without @threads, I get a 2.5x speedup with the multithreaded version (~260 vs 670 ms). Yet I have an 8 core machine, and the multi-threaded version uses 760% CPU. This is a really bad usage of multi-threading. If your code uses 760% CPU, you want to aim for a speedup close to 7.6x. For comparison, in the example in the thread I linked to, I achieved a 7.4x speedup with 780% CPU usage.

Donut_Meepo · January 18, 2019, 2:57am

It makes sense. Thank you for the explanation.

Also I try to mimic your code of your other post

function _norm!(A, x; T=Threads.nthreads())
   m, n = size(A)
   rows = rowvals(A)
   vals = nonzeros(A)
   Threads.@threads for t in 1:T
       for i = t:T:n, j in nzrange(A, i)
          vals[j] *= x[rows[j]]*x[i]
      end
   end
   nothing
end

But I could not get ~thread_numX speedup.

bennedich · January 18, 2019, 8:20pm

That approach is not very useful here, in fact it will have a negative impact since it ruins cache locality. The reason it worked in the other post was very problem-specific (the size of the inner loop grew linearly with the size of the outer loop). For your problem, that’s not the case, so you’d need to do something a bit more elaborate. Not sure how you’d parallelize it efficiently, perhaps someone has any ideas, but until then I’d recommend removing that @threads annotation, and look for other ways to speed up your code. Start at the highest level by trying to reduce the amount of work done (by choosing efficient algorithms, the right data structures, caching, etc), then work your way down. Did you take a look at @tkf’s link above on how to use SIMD?

Topic		Replies	Views
Best way to parallelize Julia at Scale parallel , linearalgebra , sparse	12	1341	August 25, 2022
Is there an easy way to parallelise matrix multiplication? Performance	8	4718	April 19, 2019
Trying to code a parallel for loop with shared sparse arrays New to Julia	18	1605	June 15, 2020
How to write multithreading matrix multiplication New to Julia parallel , multithreading	4	244	October 29, 2024
Multithreaded MatVec Numerics multithreading , matrices	10	1964	February 4, 2022

How to code faster parallel for loop

Related topics