Parallel implementation is not very effective

CarlosContrerasQ12 · October 6, 2022, 1:14am

Hi, I’m new to Julia and parallel programming and I’m having trouble with a parallel implementation. I need to assemble a matrix by calculating 10 individual blocks, each one taking approximately 18 seconds of computing time in a single thread session. I wish to parallelize this process calculating each block in an available separate thread so I implemented the following functions

function construct_U_block(sol,sigma)
    A=zeros(length(sol),length(sol))
    @inbounds for i in 1:length(sol)
        @inbounds for j in 1:i
            @inbounds A[j,i]=k(sol[i],sol[j],sigma)
        end
    end
    return Symmetric(A)
end

function K_matrix(sol1,sol2,sigma)
    Kcom=zeros(length(sol1),length(sol2))
    @inbounds for i in 1:length(sol2)
         @inbounds for j in 1:length(sol1)
            @inbounds Kcom[j,i]=k(sol1[j],sol2[i],sigma)
        end
    end
    return Kcom
end

function fill_K_completely(soln,sigma,iOO,iOF,iFO,iFF)
    D1=@spawn @views construct_U_block(soln[iOO],sigma)
    D2=@spawn @views construct_U_block(soln[iOF],sigma)
    D3=@spawn @views construct_U_block(soln[iFO],sigma)
    D4=@spawn @views construct_U_block(soln[iFF],sigma)
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    D5=@spawn @views K_matrix(soln[iOO],soln[iOF],sigma)
    D6=@spawn @views K_matrix(soln[iOF],soln[iFO],sigma)
    D7=@spawn @views K_matrix(soln[iFO],soln[iFF],sigma)
    D8=@spawn @views K_matrix(soln[iOO],soln[iFO],sigma)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    D9=@spawn @viewsK_matrix(soln[iOF],soln[iFF],sigma)
    D10=@spawn @views K_matrix(soln[iOO],soln[iFF],sigma)
    rD9=fetch(D9)
    rD10=fetch(D10)
    
    t(A)=transpose(A)
    return Symmetric([rD1 rD5 rD8 rD10;
                    t(rD5) rD2 rD6 rD9;
                    t(rD8) t(rD6) rD3 rD7;
                    t(rD10) t(rD9) t(rD7) rD4
                    ]),blockdiag(sparse(rD1),sparse(rD2),sparse(rD3),sparse(rD4))
end

where soln is an array containing 10000 50x50 float64 matrices and iOO,iOF,iFO,iFF are indices of certain elements of the soln array. However, when I run this parallel version with 4 threads I only got an improvement of 20 seconds i.e now I wait 160 seconds, not 180 as in the single thread version. I expected the time saved with parallelization would be greater, as the block matrices take a bit to be calculated, that’s why I wonder if my parallel implementation is correct and efficient or there is a pitfall in my code.

Note: I tested this implementation with a function returning a random matrix and sleep(1) simulating the time to compute it and it worked,computing the full matrix took 4 seconds, I don’t know where is my blottleneck in the above example

jling · October 6, 2022, 1:16am

RAM bandwith

CarlosContrerasQ12 · October 6, 2022, 1:45am

Is it related to the @views macro(removing the macro didn’t work)? How can I pass a copy of the slices of solnt o be accessed independently by each thread ?

maltezfaria · October 6, 2022, 8:34am

I would suggest changing the construct_U_block and K_matrix so that they work in-place with pre-allocated memory. That way you can:

Allocate one large array for your final K matrix in the fill_K_completely function
Pass the appropriate views of the global matrix to the other methods that are responsible for filling it in

Assuming your k function is allocation-free, I think this would should significantly reduce the number of allocations, and improve both serial and parallel performance.

Can you post a complete (runnable) example perhaps?

CarlosContrerasQ12 · October 6, 2022, 1:30pm

Yes, it is something like this

function k(X,x,sigma)
    return exp(-sum((X.-x).^2)/(sigma^2))    
end

function construct_U_block(sol,sigma)
    A=zeros(length(sol),length(sol))
    @inbounds for i in 1:length(sol)
        @inbounds for j in 1:i
            @inbounds A[j,i]=k(sol[i],sol[j],sigma)
        end
    end
    return Symmetric(A)
end

function K_matrix(sol1,sol2,sigma)
    Kcom=zeros(length(sol1),length(sol2))
    @inbounds for i in 1:length(sol2)
         @inbounds for j in 1:length(sol1)
            @inbounds Kcom[j,i]=k(sol1[j],sol2[i],sigma)
        end
    end
    return Kcom
end

function fill_K_completely(soln,sigma,iOO,iOF,iFO,iFF)
    D1=@spawn @views construct_U_block(soln[iOO],sigma)
    D2=@spawn @views construct_U_block(soln[iOF],sigma)
    D3=@spawn @views construct_U_block(soln[iFO],sigma)
    D4=@spawn @views construct_U_block(soln[iFF],sigma)
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    D5=@spawn @views K_matrix(soln[iOO],soln[iOF],sigma)
    D6=@spawn @views K_matrix(soln[iOF],soln[iFO],sigma)
    D7=@spawn @views K_matrix(soln[iFO],soln[iFF],sigma)
    D8=@spawn @views K_matrix(soln[iOO],soln[iFO],sigma)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    D9=@spawn @viewsK_matrix(soln[iOF],soln[iFF],sigma)
    D10=@spawn @views K_matrix(soln[iOO],soln[iFF],sigma)
    rD9=fetch(D9)
    rD10=fetch(D10)
    
    t(A)=transpose(A)
    return Symmetric([rD1 rD5 rD8 rD10;
                    t(rD5) rD2 rD6 rD9;
                    t(rD8) t(rD6) rD3 rD7;
                    t(rD10) t(rD9) t(rD7) rD4
                    ]),blockdiag(sparse(rD1),sparse(rD2),sparse(rD3),sparse(rD4))
end

soln=[]
for i in 1:10000
    append!(soln,rand(50,50))
end
iOO=1:2550
iOF=2551:5026
iFO=5027:7530
iFF=7531:10000
K,K0=fill_K_completely(soln,1.0,iOO,iOF,iFO,iFF)

I also think I have a problem with the number of allocations in my k function, by I don’t see other intuitive way to code it.

lmiq · October 6, 2022, 1:39pm

this allocates intermediate arrays, use:

julia> k(X,x,sigma) = exp(sum(vals -> -(vals[1] - vals[2])^2/sigma^2, zip(x,X)))

julia> @btime k($X,$x,$sigma)
  12.907 ns (0 allocations: 0 bytes)
0.32743207166924876

or expand that into a loop:

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  11.140 ns (0 allocations: 0 bytes)
0.32743207166924876

The one you like most.

lmiq · October 6, 2022, 1:45pm

With the loop version you can use LoopVectorization, if that specific operation is critical for you:

julia> X = rand(100); x = rand(100); sigma = 1.0;

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  76.622 ns (0 allocations: 0 bytes)
4.157610819433723e-9

julia> using LoopVectorization

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           @turbo for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  14.912 ns (0 allocations: 0 bytes)
4.1576108194336795e-9

or @simd, which is almost as fast:

julia> function k(X,x,sigma)
           s = zero(eltype(x))
           @simd for i in eachindex(x,X)
               s += (X[i] - x[i])^2
           end
           return exp(-s/sigma^2)
       end
k (generic function with 1 method)

julia> @btime k($X,$x,$sigma)
  18.214 ns (0 allocations: 0 bytes)
4.157610819433694e-9

As a coding style, remove all those @inbounds, they are hardly ever needed except for a very very final performance tuning, and may hide important errors.

CarlosContrerasQ12 · October 6, 2022, 2:04pm

Good tip, do you know why with allocations here is slower?

with the @simd flag it is effectively faster, if I use it can I still parallelize my matrix block construction?

lmiq · October 6, 2022, 2:08pm

Interpolate the variables in @btime (i. e. @btime k($X,$x,$sigma)), the loop version does not allocate.

It is probably slower in that case because in the one-liner it is doing simd automatically, while in the loop version it is not. And yes, you can use @simd there, no problem.

maltezfaria · October 7, 2022, 7:38am

Is soln supposed be a vector of 10_000 matrices of size 50x50? Your code produces a vector of 25000000 scalars. Also, soln as such is a Vector{Any}, and that is usually bad for performance because the compiler can’t infer the type of the elements based on the type of soln.

CarlosContrerasQ12 · October 7, 2022, 2:32pm

Yes, you are right, it should be something like

soln=Matrix{Float64}[]
for i in 1:10000
    append!(soln,[rand(50,50)])
end

bad example construction

CarlosContrerasQ12 · October 7, 2022, 3:52pm

So long your recomendations have reduced my waiting time from 180 to 42 seconds. I have implemented the following

k function without allocations (largest performance improvement, saved 100 second )
specifying type of elements when constructing soln(second largest improvement, saved 60 second)
creating a fixed sized array and passing the views when constructing blocks(reduced allocations, but not much time, saved 5 seconds)

Thank you all @lmiq @maltezfaria @jling !

With this recommendations my non-parallel time is 66 second, and my current parallel version time is 42 seconds(4 threads), Is it the best I can accomplish with parallelism? With 4 threads I would expect it to run in aproximately 20 seconds, Do you see something I can improve in my current code? Here is my current version

function k(X,x,sigma)
    s = zero(eltype(x))
    @simd for i in eachindex(x,X)
            s += (X[i] - x[i])^2
        end
    return exp(-s/sigma^2)
end
function U_block_in_place(partA,partSol,sigma)
     for i in 1:length(partSol)
         for j in 1:i
             partA[j,i]=k(partSol[i],partSol[j],sigma)
            end
        end
end

function full_block_in_place(partA,partSol1,partSol2,sigma)
     for i in 1:length(partSol2)
          for j in 1:length(partSol1)
              partA[j,i]=k(partSol1[j],partSol2[i],sigma)
        end
    end
end

function K_com_inplace(solucionesn,sigma,iOO,iOF,iFO,iFF)
    A=zeros(length(solucionesn),length(solucionesn))
    D1=@spawn @views U_block_in_place(A[iOO,iOO],solucionesn[iOO],sigma)
    D2=@spawn @views U_block_in_place(A[iOF,iOF],solucionesn[iOF],sigma)
    D3=@spawn @views U_block_in_place(A[iFO,iFO],solucionesn[iFO],sigma)
    D4=@spawn @views U_block_in_place(A[iFF,iFF],solucionesn[iFF],sigma)
    D5=@spawn @views full_block_in_place(A[iOO,iOF],solucionesn[iOO],solucionesn[iOF],sigma)
    D6=@spawn @views full_block_in_place(A[iOF,iFO],solucionesn[iOF],solucionesn[iFO],sigma)
    D7=@spawn @views full_block_in_place(A[iFO,iFF],solucionesn[iFO],solucionesn[iFF],sigma)
    D8=@spawn @views full_block_in_place(A[iOO,iFO],solucionesn[iOO],solucionesn[iFO],sigma)
    D9=@spawn @views full_block_in_place(A[iOF,iFF],solucionesn[iOF],solucionesn[iFF],sigma)
    D10=@spawn @views full_block_in_place(A[iOO,iFF],solucionesn[iOO],solucionesn[iFF],sigma)
    
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    rD9=fetch(D9)
    rD10=fetch(D10)
    return Symmetric(A)
end

soln=Matrix{Float64}[]
for i in 1:10000
    append!(soln,[rand(50,50)])
end
iOO=1:2550
iOF=2551:5026
iFO=5027:7530
iFF=7531:10000
K,K0= K_com_inplace(soln,1.0,iOO,iOF,iFO,iFF)

sgaure · October 7, 2022, 4:35pm

I tried running the program under LinuxPerf:

@pstats K,K0= K_com_inplace(soln,1.0,iOO,iOF,iFO,iFF)

I get between 70 and 80 % backend stalled cycles with more than 1 thread, i.e. the cpu waits for resources. With 1 thread it has 30% stalls. Probably it’s waiting for memory, I haven’t looked deeper. I.e. there is so much read/store in memory that it can’t keep up when you run in parallel. On my cpu (A Ryzen 7) there is very little benefit from more than two threads.

Oscar_Smith · October 7, 2022, 4:45pm

I think U_block_in_place is still far from optimal. You are computing a lot of the same information a lot of times. How does the following do?

using LoopVectorization
function U_block_in_place(partA,partSol,sigma)
    @turbo for i in 1:length(partSol)
        x = partSol[i]
        for j in 1:i
            X = partSol[i]
        s = zero(eltype(partSol))
        for k in eachindex(x, X)
            s += (X[k] - x[k])^2
        end
        partA[j,i] = exp(-s/sigma^2)
        end
end

CarlosContrerasQ12 · October 7, 2022, 5:12pm

@Oscar_Smith I don’t see why, each element of the matric is computed only one time. Also, I tried your code fixing the ends as I thought it would work

but it didn’t and I don’t know why

CarlosContrerasQ12 · October 7, 2022, 5:30pm

@sgaure Do you know if it is slow also that different thread access different pieces of memory? I mean, I don’t think they are trying to access the same arrays stored in memory at the same time

CarlosContrerasQ12 · October 7, 2022, 5:36pm

Update:
If I pass a copy of my arrays as this

function K_com_inplace(solucionesn,sigma,iOO,iOF,iFO,iFF)
    A=zeros(length(solucionesn),length(solucionesn))
    D1=@spawn U_block_in_place(@view(A[iOO,iOO]),deepcopy(solucionesn[iOO]),sigma)
    D2=@spawn U_block_in_place(@view(A[iOF,iOF]),deepcopy(solucionesn[iOF]),sigma)
    D3=@spawn U_block_in_place(@view(A[iFO,iFO]),deepcopy(solucionesn[iFO]),sigma)
    D4=@spawn U_block_in_place(@view(A[iFF,iFF]),deepcopy(solucionesn[iFF]),sigma)
    D5=@spawn full_block_in_place(@view(A[iOO,iOF]),deepcopy(solucionesn[iOO]),deepcopy(solucionesn[iOF]),sigma)
    D6=@spawn full_block_in_place(@view(A[iOF,iFO]),deepcopy(solucionesn[iOF]),deepcopy(solucionesn[iFO]),sigma)
    D7=@spawn full_block_in_place(@view(A[iFO,iFF]),deepcopy(solucionesn[iFO]),deepcopy(solucionesn[iFF]),sigma)
    D8=@spawn full_block_in_place(@view(A[iOO,iFO]),deepcopy(solucionesn[iOO]),deepcopy(solucionesn[iFO]),sigma)
    D9=@spawn full_block_in_place(@view(A[iOF,iFF]),deepcopy(solucionesn[iOF]),deepcopy(solucionesn[iFF]),sigma)
    D10=@spawn full_block_in_place(@view(A[iOO,iFF]),deepcopy(solucionesn[iOO]),deepcopy(solucionesn[iFF]),sigma)
    
    rD1=fetch(D1)
    rD2=fetch(D2)
    rD3=fetch(D3)
    rD4=fetch(D4)
    rD5=fetch(D5)
    rD6=fetch(D6)
    rD7=fetch(D7)
    rD8=fetch(D8)
    rD9=fetch(D9)
    rD10=fetch(D10)
    return Symmetric(A)
end

time reduces to 34 seconds due to not accessing the same arrays at the same times. Thanks @sgaure

sgaure · October 7, 2022, 5:51pm

Depending on the CPU-, cache-, and memory-architecture, it may be beneficial to read the same memory at the same time from several threads, but this isn’t easy to control. One more thing. At least on my box the total memory bandwidth seems to be the limit. Doing eight threads with Float32 almost doubles the speed due to less memory use compared to Float64, but again this depends on the memory architecture and stuff. And, of course, it may not be acceptable with the lower precision.

pitsianis · October 8, 2022, 8:14am

You are calculating the Gaussian of the squared Euclidean distance between two sets of points. If the dimension of the points is not too small, it might be beneficial to use a matrix multiplication approach that will benefit from the available high performance matrix-matrix multiplications.

The complete mxn squared distance matrix D2 between mxd points A and nxd points B is

D2 = sum(A .^2, dims=2) .+ sum(B.^2, dims=2)' - 2 .* A * B'

The above should be faster, even for small d.

DNF · October 8, 2022, 8:28am

And even if you do use @inbounds, you only need one:

No need to put it in three places. Just keep one of those, it doesn’t matter which.

Topic		Replies	Views
Parallelization on the CPU isn't effective General Usage	19	524	November 19, 2021
A question about parallel performance in multithreading Performance question , performance , multithreading	10	647	May 12, 2022
Multi-threading on a 2 CPU system New to Julia multithreading	15	1006	February 2, 2023
Parallel computing of matrix with @threads General Usage parallel , multithreading , threads	3	543	August 4, 2022
Native (oversimplified) Julia gemm implementation Performance proposal	6	1160	August 15, 2019

Parallel implementation is not very effective

Related topics