Parallel implementation is not very effective

Hi, I’m new to Julia and parallel programming and I’m having trouble with a parallel implementation. I need to assemble a matrix by calculating 10 individual blocks, each one taking approximately 18 seconds of computing time in a single thread session. I wish to parallelize this process calculating each block in an available separate thread so I implemented the following functions

function construct_U_block(sol,sigma)
    A=zeros(length(sol),length(sol))
    @inbounds for i in 1:length(sol)
        @inbounds for j in 1:i
            @inbounds A[j,i]=k(sol[i],sol[j],sigma)
        end
    end
    return Symmetric(A)
end

function K_matrix(sol1,sol2,sigma)
    Kcom=zeros(length(sol1),length(sol2))
    @inbounds for i in 1:length(sol2)
         @inbounds for j in 1:length(sol1)
            @inbounds Kcom[j,i]=k(sol1[j],sol2[i],sigma)
        end
    end
    return Kcom
end

function fill_K_completely(soln,sigma,iOO,iOF,iFO,iFF)
    D1=@spawn @views construct_U_block(soln[iOO],sigma)
    D2=@spawn @views construct_U_block(soln[iOF],sigma)
    D3=@spawn @views construct_U_block(soln[iFO],sigma)
    D4=@spawn @views construct_U_block(soln[iFF],sigma)
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    D5=@spawn @views K_matrix(soln[iOO],soln[iOF],sigma)
    D6=@spawn @views K_matrix(soln[iOF],soln[iFO],sigma)
    D7=@spawn @views K_matrix(soln[iFO],soln[iFF],sigma)
    D8=@spawn @views K_matrix(soln[iOO],soln[iFO],sigma)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    D9=@spawn @viewsK_matrix(soln[iOF],soln[iFF],sigma)
    D10=@spawn @views K_matrix(soln[iOO],soln[iFF],sigma)
    rD9=fetch(D9)
    rD10=fetch(D10)
    
    t(A)=transpose(A)
    return Symmetric([rD1 rD5 rD8 rD10;
                    t(rD5) rD2 rD6 rD9;
                    t(rD8) t(rD6) rD3 rD7;
                    t(rD10) t(rD9) t(rD7) rD4
                    ]),blockdiag(sparse(rD1),sparse(rD2),sparse(rD3),sparse(rD4))
end

where soln is an array containing 10000 50x50 float64 matrices and iOO,iOF,iFO,iFF are indices of certain elements of the soln array. However, when I run this parallel version with 4 threads I only got an improvement of 20 seconds i.e now I wait 160 seconds, not 180 as in the single thread version. I expected the time saved with parallelization would be greater, as the block matrices take a bit to be calculated, that’s why I wonder if my parallel implementation is correct and efficient or there is a pitfall in my code.

Note: I tested this implementation with a function returning a random matrix and sleep(1) simulating the time to compute it and it worked,computing the full matrix took 4 seconds, I don’t know where is my blottleneck in the above example

RAM bandwith

1 Like

Is it related to the @views macro(removing the macro didn’t work)? How can I pass a copy of the slices of solnt o be accessed independently by each thread ?

I would suggest changing the construct_U_block and K_matrix so that they work in-place with pre-allocated memory. That way you can:

  1. Allocate one large array for your final K matrix in the fill_K_completely function
  2. Pass the appropriate views of the global matrix to the other methods that are responsible for filling it in

Assuming your k function is allocation-free, I think this would should significantly reduce the number of allocations, and improve both serial and parallel performance.

Can you post a complete (runnable) example perhaps?

Yes, it is something like this

function k(X,x,sigma)
    return exp(-sum((X.-x).^2)/(sigma^2))    
end

function construct_U_block(sol,sigma)
    A=zeros(length(sol),length(sol))
    @inbounds for i in 1:length(sol)
        @inbounds for j in 1:i
            @inbounds A[j,i]=k(sol[i],sol[j],sigma)
        end
    end
    return Symmetric(A)
end

function K_matrix(sol1,sol2,sigma)
    Kcom=zeros(length(sol1),length(sol2))
    @inbounds for i in 1:length(sol2)
         @inbounds for j in 1:length(sol1)
            @inbounds Kcom[j,i]=k(sol1[j],sol2[i],sigma)
        end
    end
    return Kcom
end

function fill_K_completely(soln,sigma,iOO,iOF,iFO,iFF)
    D1=@spawn @views construct_U_block(soln[iOO],sigma)
    D2=@spawn @views construct_U_block(soln[iOF],sigma)
    D3=@spawn @views construct_U_block(soln[iFO],sigma)
    D4=@spawn @views construct_U_block(soln[iFF],sigma)
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    D5=@spawn @views K_matrix(soln[iOO],soln[iOF],sigma)
    D6=@spawn @views K_matrix(soln[iOF],soln[iFO],sigma)
    D7=@spawn @views K_matrix(soln[iFO],soln[iFF],sigma)
    D8=@spawn @views K_matrix(soln[iOO],soln[iFO],sigma)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    D9=@spawn @viewsK_matrix(soln[iOF],soln[iFF],sigma)
    D10=@spawn @views K_matrix(soln[iOO],soln[iFF],sigma)
    rD9=fetch(D9)
    rD10=fetch(D10)
    
    t(A)=transpose(A)
    return Symmetric([rD1 rD5 rD8 rD10;
                    t(rD5) rD2 rD6 rD9;
                    t(rD8) t(rD6) rD3 rD7;
                    t(rD10) t(rD9) t(rD7) rD4
                    ]),blockdiag(sparse(rD1),sparse(rD2),sparse(rD3),sparse(rD4))
end

soln=[]
for i in 1:10000
    append!(soln,rand(50,50))
end
iOO=1:2550
iOF=2551:5026
iFO=5027:7530
iFF=7531:10000
K,K0=fill_K_completely(soln,1.0,iOO,iOF,iFO,iFF)

I also think I have a problem with the number of allocations in my k function, by I don’t see other intuitive way to code it.

this allocates intermediate arrays, use:

julia> k(X,x,sigma) = exp(sum(vals -> -(vals[1] - vals[2])^2/sigma^2, zip(x,X)))

julia> @btime k($X,$x,$sigma)
  12.907 ns (0 allocations: 0 bytes)
0.32743207166924876

or expand that into a loop:

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  11.140 ns (0 allocations: 0 bytes)
0.32743207166924876

The one you like most.

With the loop version you can use LoopVectorization, if that specific operation is critical for you:

julia> X = rand(100); x = rand(100); sigma = 1.0;

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  76.622 ns (0 allocations: 0 bytes)
4.157610819433723e-9

julia> using LoopVectorization

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           @turbo for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  14.912 ns (0 allocations: 0 bytes)
4.1576108194336795e-9

or @simd, which is almost as fast:

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           @simd for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  18.214 ns (0 allocations: 0 bytes)
4.157610819433694e-9

As a coding style, remove all those @inbounds, they are hardly ever needed except for a very very final performance tuning, and may hide important errors.

Good tip, do you know why with allocations here is slower?
image

with the @simd flag it is effectively faster, if I use it can I still parallelize my matrix block construction?

Interpolate the variables in @btime (i. e. @btime k($X,$x,$sigma)), the loop version does not allocate.

It is probably slower in that case because in the one-liner it is doing simd automatically, while in the loop version it is not. And yes, you can use @simd there, no problem.

1 Like

Is soln supposed be a vector of 10_000 matrices of size 50x50? Your code produces a vector of 25000000 scalars. Also, soln as such is a Vector{Any}, and that is usually bad for performance because the compiler can’t infer the type of the elements based on the type of soln.

Yes, you are right, it should be something like

soln=Matrix{Float64}[]
for i in 1:10000
    append!(soln,[rand(50,50)])
end

bad example construction

So long your recomendations have reduced my waiting time from 180 to 42 seconds. I have implemented the following

  • k function without allocations (largest performance improvement, saved 100 second )
  • specifying type of elements when constructing soln(second largest improvement, saved 60 second)
  • creating a fixed sized array and passing the views when constructing blocks(reduced allocations, but not much time, saved 5 seconds)

Thank you all @lmiq @maltezfaria @jling !

With this recommendations my non-parallel time is 66 second, and my current parallel version time is 42 seconds(4 threads), Is it the best I can accomplish with parallelism? With 4 threads I would expect it to run in aproximately 20 seconds, Do you see something I can improve in my current code? Here is my current version

function k(X,x,sigma)
    s = zero(eltype(x))
    @simd for i in eachindex(x,X)
            s += (X[i] - x[i])^2
        end
    return exp(-s/sigma^2)
end
function U_block_in_place(partA,partSol,sigma)
     for i in 1:length(partSol)
         for j in 1:i
             partA[j,i]=k(partSol[i],partSol[j],sigma)
            end
        end
end

function full_block_in_place(partA,partSol1,partSol2,sigma)
     for i in 1:length(partSol2)
          for j in 1:length(partSol1)
              partA[j,i]=k(partSol1[j],partSol2[i],sigma)
        end
    end
end

function K_com_inplace(solucionesn,sigma,iOO,iOF,iFO,iFF)
    A=zeros(length(solucionesn),length(solucionesn))
    D1=@spawn @views U_block_in_place(A[iOO,iOO],solucionesn[iOO],sigma)
    D2=@spawn @views U_block_in_place(A[iOF,iOF],solucionesn[iOF],sigma)
    D3=@spawn @views U_block_in_place(A[iFO,iFO],solucionesn[iFO],sigma)
    D4=@spawn @views U_block_in_place(A[iFF,iFF],solucionesn[iFF],sigma)
    D5=@spawn @views full_block_in_place(A[iOO,iOF],solucionesn[iOO],solucionesn[iOF],sigma)
    D6=@spawn @views full_block_in_place(A[iOF,iFO],solucionesn[iOF],solucionesn[iFO],sigma)
    D7=@spawn @views full_block_in_place(A[iFO,iFF],solucionesn[iFO],solucionesn[iFF],sigma)
    D8=@spawn @views full_block_in_place(A[iOO,iFO],solucionesn[iOO],solucionesn[iFO],sigma)
    D9=@spawn @views full_block_in_place(A[iOF,iFF],solucionesn[iOF],solucionesn[iFF],sigma)
    D10=@spawn @views full_block_in_place(A[iOO,iFF],solucionesn[iOO],solucionesn[iFF],sigma)
    
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    rD9=fetch(D9)
    rD10=fetch(D10)
    return Symmetric(A)
end

soln=Matrix{Float64}[]
for i in 1:10000
    append!(soln,[rand(50,50)])
end
iOO=1:2550
iOF=2551:5026
iFO=5027:7530
iFF=7531:10000
K,K0= K_com_inplace(soln,1.0,iOO,iOF,iFO,iFF)

I tried running the program under LinuxPerf:

@pstats K,K0= K_com_inplace(soln,1.0,iOO,iOF,iFO,iFF)

I get between 70 and 80 % backend stalled cycles with more than 1 thread, i.e. the cpu waits for resources. With 1 thread it has 30% stalls. Probably it’s waiting for memory, I haven’t looked deeper. I.e. there is so much read/store in memory that it can’t keep up when you run in parallel. On my cpu (A Ryzen 7) there is very little benefit from more than two threads.

I think U_block_in_place is still far from optimal. You are computing a lot of the same information a lot of times. How does the following do?

using LoopVectorization
function U_block_in_place(partA,partSol,sigma)
    @turbo for i in 1:length(partSol)
        x = partSol[i]
        for j in 1:i
            X = partSol[i]
        s = zero(eltype(partSol))
        for k in eachindex(x, X)
            s += (X[k] - x[k])^2
        end
        partA[j,i] = exp(-s/sigma^2)
        end
end

@Oscar_Smith I don’t see why, each element of the matric is computed only one time. Also, I tried your code fixing the ends as I thought it would work
image
but it didn’t and I don’t know why

@sgaure Do you know if it is slow also that different thread access different pieces of memory? I mean, I don’t think they are trying to access the same arrays stored in memory at the same time

Update:
If I pass a copy of my arrays as this

function K_com_inplace(solucionesn,sigma,iOO,iOF,iFO,iFF)
    A=zeros(length(solucionesn),length(solucionesn))
    D1=@spawn U_block_in_place(@view(A[iOO,iOO]),deepcopy(solucionesn[iOO]),sigma)
    D2=@spawn U_block_in_place(@view(A[iOF,iOF]),deepcopy(solucionesn[iOF]),sigma)
    D3=@spawn U_block_in_place(@view(A[iFO,iFO]),deepcopy(solucionesn[iFO]),sigma)
    D4=@spawn U_block_in_place(@view(A[iFF,iFF]),deepcopy(solucionesn[iFF]),sigma)
    D5=@spawn full_block_in_place(@view(A[iOO,iOF]),deepcopy(solucionesn[iOO]),deepcopy(solucionesn[iOF]),sigma)
    D6=@spawn full_block_in_place(@view(A[iOF,iFO]),deepcopy(solucionesn[iOF]),deepcopy(solucionesn[iFO]),sigma)
    D7=@spawn full_block_in_place(@view(A[iFO,iFF]),deepcopy(solucionesn[iFO]),deepcopy(solucionesn[iFF]),sigma)
    D8=@spawn full_block_in_place(@view(A[iOO,iFO]),deepcopy(solucionesn[iOO]),deepcopy(solucionesn[iFO]),sigma)
    D9=@spawn full_block_in_place(@view(A[iOF,iFF]),deepcopy(solucionesn[iOF]),deepcopy(solucionesn[iFF]),sigma)
    D10=@spawn full_block_in_place(@view(A[iOO,iFF]),deepcopy(solucionesn[iOO]),deepcopy(solucionesn[iFF]),sigma)
    
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    rD9=fetch(D9)
    rD10=fetch(D10)
    return Symmetric(A)
end

time reduces to 34 seconds due to not accessing the same arrays at the same times. Thanks @sgaure

Depending on the CPU-, cache-, and memory-architecture, it may be beneficial to read the same memory at the same time from several threads, but this isn’t easy to control. One more thing. At least on my box the total memory bandwidth seems to be the limit. Doing eight threads with Float32 almost doubles the speed due to less memory use compared to Float64, but again this depends on the memory architecture and stuff. And, of course, it may not be acceptable with the lower precision.

You are calculating the Gaussian of the squared Euclidean distance between two sets of points. If the dimension of the points is not too small, it might be beneficial to use a matrix multiplication approach that will benefit from the available high performance matrix-matrix multiplications.

The complete mxn squared distance matrix D2 between mxd points A and nxd points B is

D2 = sum(A .^2, dims=2) .+ sum(B.^2, dims=2)' - 2 .* A * B'

The above should be faster, even for small d.

And even if you do use @inbounds, you only need one:

No need to put it in three places. Just keep one of those, it doesn’t matter which.