Parallelization in Julia vs C++

Do you tend to have problems parallelizing thousands of fast computations over a large matrix? I’m going to show a Julia and then a C++ version of the code snippet. I get a lot more performance benefits from parallelization with #pragma omp parallel for in C++ than I can get with Julia for some reason…

    P1a::Matrix{Float64} = zeros(4, mesh.nt)
    P1b::Matrix{Float64} = zeros(4, mesh.nt)
    P1c::Matrix{Float64} = zeros(4, mesh.nt)
    P1d::Matrix{Float64} = zeros(4, mesh.nt)

    # Singlethread
    @time P1!(mesh, P1a, P1b, P1c, P1d, 1:mesh.
#Outputs: 0.781466 seconds (18.03 M allocations: 736.519 MiB, 9.84% gc time, 6.23% compilation time)

    # Multithreaded
    @time begin
        chunks = Iterators.partition(1:mesh.nt, cld(mesh.nt, Threads.nthreads()))
        @sync for chunk in chunks
            Threads.@spawn P1!(mesh, P1a, P1b, P1c, P1d, chunk)
        end
    end
#Outputs: 1.051446 seconds (18.00 M allocations: 735.137 MiB, 8.05% gc time, 15.01% compilation time)

The mesh has ‘nt’ elements (sometimes 1M), and each column of P1a,b,c,d is independent, so I batch the elements in chunks and send each chunk to a thread and @sync. Its always somehow slower than the single threaded version.
Here is the P1! function.

function P1!(mesh::MESH, 
            P1a::Matrix{Float64},
            P1b::Matrix{Float64},
            P1c::Matrix{Float64},
            P1d::Matrix{Float64},
            chunk)
    
    for k in chunk
        nds = @view mesh.t[:, k]
        for i in 1:4
            P1a[i, k], P1b[i, k], 
            P1c[i, k], P1d[i, k] = abcd(mesh.p, nds, nds[i])
        end
    end

end

The same code in C++, paralellized with #pragma omp parallel for gives me 10x speed ups.

#pragma omp parallel for // Each column is independent 
for(int k = 0; k<mesh.nt; k++){ // For each element
	for(int i = 0; i<4; i++){ // For each node of current element
		Eigen::Vector4d abcd = linearBasis3D(mesh.p, mesh.t.col(k), mesh.t(i, k));
			
		P1a(i, k) = abcd(0);
		P1b(i, k) = abcd(1);
		P1c(i, k) = abcd(2);
		P1d(i, k) = abcd(3);
	}
}

Any help with this would be amazing.

As far as I can see the function P1! should not allocate anything, so you probably must solve that first in the single threaded call.

What is the definition of your MESH type? What is abcd?

MESH is a mutable struct with types.

mutable struct MESH

    # Order | 1 - Linear, 2 - Quadratic
    order::Int

    # Node coordinates
    p::Matrix{Float64}

    # Connectivity list
    t::Matrix{Int32}
    
    # Surface triangles
    surfaceT::Matrix{Int32}

    # Edges (updated when doing second order meshes)
    edges::Vector{Int32} # Gmsh label of each edge midpoint (2nd order mesh node)
    
    # Map the global edge label to a local edge from 1 to 'ne'
    edge2localMap::Vector{Int32} 

    # Map mesh volume elements to the parent Cell/Volume ID
    elementID::Vector{Int32}

    # Elements inside material
    InsideElements::Vector{Int32}
    
    # Nodes inside material
    InsideNodes::Vector{Int32}
    
    # Volume of each mesh element
    VE::Vector{Float64}
    
    # Area of each surface mesh element
    AE::Vector{Float64} 

    # Normal of each surface triangle
    normal::Matrix{Float64}
    
    nv::Int32           # Number of nodes
    nt::Int32           # Number of elements
    ns::Int32           # Number of surface elements
    ne::Int32           # Number of edges

    nInside::Int32      # Number of elements for the inside cells
    nInsideNodes::Int32 # Numberf of nodes for the inside cells

    shell_id # Id of the surface boundaries of the bounding shell

    # Constructor
    MESH() = new()
end

And abcd is

# 3D linear basis function
function abcd(p::Matrix{Float64},nodes::AbstractVector,nd::Int32)
    n1,n2,n3 = nodes[nodes .!= nd]
    x = @view p[1, [nd, n1, n2, n3]]
    y = @view p[2, [nd, n1, n2, n3]]
    z = @view p[3, [nd, n1, n2, n3]]

    r::Vector{Float64} =[1.0 x[1] y[1] z[1];
                         1.0 x[2] y[2] z[2];
                         1.0 x[3] y[3] z[3];
                         1.0 x[4] y[4] z[4]]\[1;0;0;0]

    return r[1],r[2],r[3],r[4] # a b c d
end # Basis function coef.

Its the abcd() function call that likely allocates memory. That shouldn’t be the cause, I think.

Probably it is. Try writting a non-allocating version of that, I’m sure your serial and parallel versions will be much, much faster.

2 Likes

For 4x4 matrices, you definitely want to be using StaticArrays.jl

Yeah, every line on the ABCD function allocates. That’s your major bottleneck.

What you’re seeing in Julia is pretty common when the per-task work is small and the threaded version is paying more in overhead than it gets back.

A few things stand out:

Your Julia code is not really doing the same thing as @pragma omp parallel for. With OpenMP, the loop is threaded directly with very low overhead. In your Julia version, you are creating tasks with Threads.@spawn, partitioning chunks, and syncing them. That is usually heavier, especially for lots of fast computations.

The big clue is here:

18.03 M allocations: 736 MiB

That is a huge amount of allocation for something you want to scale well. If the threaded version still allocates that much, threads may just be fighting over memory bandwidth and GC overhead instead of speeding up compute.

A more Julia-like first step would be to thread the outer loop directly:

Threads.@threads for k in 1:mesh.nt
    nds = @view mesh.t[:, k]
    for i in 1:4
        P1a[i, k], P1b[i, k], P1c[i, k], P1d[i, k] = abcd(mesh.p, nds, nds[i])
    end
end

That is much closer to what OpenMP is doing.

A few other things worth checking:

  • Make sure abcd is type-stable.
  • Try to reduce allocations inside abcd and inside the loop.
  • @view mesh.t[:, k] may still not be ideal if it causes overhead in a very hot loop.
  • If each iteration is very fast, threading may not help unless you increase work per iteration.
  • Since you write by columns and Julia arrays are column-major, your memory access pattern is at least reasonable.

Honestly, the main issue does not look like “Julia threading is bad,” but more that this version is allocation-heavy and uses task spawning where loop threading would fit better. I’d look at allocations first, then compare Threads.@threads against your current @spawn approach.

1 Like

Note that the “gc time” is the time spent in actual gc work. For multithreaded programs that’s misleading because of how gc is implemented. When a thread triggers gc it has to wait for the other threads entering a “gc safe point”. This waiting time is not counted as “gc time”. That is, the garbage collection overhead in your threaded loop is probably much higher than the reported 8%. Generally, heavy allocations in parallel loops are devastating for performance in julia.

As noted by many, it’s likely the abcd method. E.g. the nodes .!= nd allocates a Vector{Bool}, and in your @view p[1, [nd, n1, n2, n3]], the inner [nd, n1, n2, n3] is allocated even though the @view may not be. Similarly the r = [...]\[1;0;0;0] allocates a 4x4 matrix and a 4-vector, as well as the resulting 4-vector.

These allocations can be avoided by using StaticArrays.jl. Also, one may avoid the x, y, z by writing e.g. p[2, n2] instead of y[3], though it’s slightly less readable.
I.e. something like:

using StaticArrays
...
    r = @SMatrix[1.0 p[1, nd] p[2, nd] p[3, nd];
                 1.0 p[1, n1] p[2, n1] p[3, n1];
                 1.0 p[1, n2] p[2, n2] p[3, n2];
                 1.0 p[1, n3] p[2, n3] p[3, n3]]\@SVector[1.0, 0.0, 0.0, 0.0]
    return (r...,)
1 Like

Here’s my crack at abcd. I didn’t see a MWE, so I made up some inputs that seemed to make sense with how the values were used:

using StaticArrays

function abcd2(p::AbstractMatrix, nodes::AbstractVector, nd::Integer)
    snodes = SVector{4}(nodes)
    ninds = vcat(nd, deleteat(snodes, findfirst(snodes .== nd))) # == [nd; nodes[nodes .!= nd]]
    v = SVector{4, eltype(p)}(1, 0, 0, 0)
    M = hcat(one.(v), p[1, ninds], p[2, ninds], p[3, ninds])
    r = M \ v
    return Tuple(r)
end

function abcd3(p::AbstractMatrix, nd::Integer)
    M = vcat(@SMatrix(ones(eltype(p), 1, 4)), SMatrix{3,4}(p)) # [ones(1,4); p]
    v = setindex(zeros(SVector{4, eltype(p)}), 1, nd) # 0 vector except 1 at index nd
    r = M' \ v
    return Tuple(r)
end

p = [i + j + sin(i*j) for i in 1:3, j in 1:4]
nodes = Int32[1, 3, 4, 2]
nd = Int32(3)
julia> using BenchmarkTools

julia> @btime abcd($p, $nodes, $nd)
  329.104 ns (33 allocations: 1.28 KiB)
(-0.3841176455833004, -0.8527952233036965, -0.42684118258960735, 1.080857269683743)

julia> @btime abcd2($p, $nodes, $nd)
  60.825 ns (0 allocations: 0 bytes)
(-0.38411764558330086, -0.8527952233036968, -0.42684118258960735, 1.0808572696837433)

julia> @btime abcd3($p, $nd)
  52.921 ns (0 allocations: 0 bytes)
(-0.38411764558330086, -0.8527952233036965, -0.4268411825896073, 1.080857269683743)

abcd2 is a literal translation while abcd3 recognizes that the nodes input and ensuing permutations are actually redundant (if my assumptions about the inputs are correct – verify it on your actual problem).

Note that it isn’t necessary to convert the return type to Tuple (AbstractVector can be destructured the same way), but I did it here for the sake of matching.

So these are 5-6x faster and save 33 allocations from the GC.

3 Likes

Well, implementing Static Arrays which by itself improved the speed of filing the P1 matrices by almost 10x, great! Then I changed the node logic in abcd() to not allocate a bool vector from n1,n2,n3 = nodes[nodes .!= nd] by replacing with a simple size 4 loop. That helped a bit.

I tested your abcd2() and it was about 2x faster, but gives wrong output unfortunately.

For those interested, here is the abcd() I’m running currently

# 3D linear basis function
function abcd(p::Matrix{Float64}, nds::AbstractVector, nd::Int32)
    # OLD: n1,n2,n3 = nodes[nodes .!= nd]
    
    # NEW
    n1::Int32 = 0
    n2::Int32 = 0
    n3::Int32 = 0
    for i in nds
        if i != nd 
            if n1 == 0
                n1 = i
            elseif n2 == 0
                n2 = i
            elseif n3 == 0
                n3 = i
            end
        end
    end

    M = @SMatrix [1.0 p[1, nd] p[2, nd] p[3, nd];
                  1.0 p[1, n1] p[2, n1] p[3, n1];
                  1.0 p[1, n2] p[2, n2] p[3, n2];
                  1.0 p[1, n3] p[2, n3] p[3, n3]]

    r::Vector{Float64} = M \ @SArray[1;0;0;0]

    return r[1],r[2],r[3],r[4] # a b c d
end # Basis function coef.

Still, parallelizing the P1 matrix assembly is proving quite difficult in Julia, while a simple #pragma omp parallel for automatically gives me a significant performance boost in C++.

I really like Julia so getting really good at parallel computing would be essential for me. Oh well.

Thank you everyone who contributed so far. I might not have parallelized the computation, but we* did significantly boost the performance.

Maybe I missed it but what’s the definition of linearBasis3D in C++?

I did not upload it yet. Here it is

// FEM basis function
Eigen::Vector4d linearBasis3D(Eigen::Ref<Eigen::MatrixXd> p, Eigen::Vector4i nodes,int nd){
	
	int nds[3] = {0,0,0};		// All other nodes of the element
	{ // Get nodes different than nd
	int n = 0;
	for(int i = 0; i<4; i++){
		if(nodes(i)!=nd){
			nds[n] = nodes(i);
			n++;
		}
	}
	} // Get nodes different than nd

	// Build the matrix equation for the a b c d
	Eigen::Matrix4d M = Eigen::Matrix4d::Zero();
	M.row(0) << 1,p(0,nd),p(1,nd),p(2,nd);
	M.row(1) << 1,p(0,nds[0]),p(1,nds[0]),p(2,nds[0]);
	M.row(2) << 1,p(0,nds[1]),p(1,nds[1]),p(2,nds[1]);
	M.row(3) << 1,p(0,nds[2]),p(1,nds[2]),p(2,nds[2]);

	Eigen::Vector4d b(1.0,0,0,0);
	Eigen::Vector4d r = M.colPivHouseholderQr().solve(b);

	return r;
} // FEM basis function

If you don’t provide a MWE of abcd we can’t help you optimize it. I did my best by making up some inputs, but clearly I did not correctly guess the structure of the inputs.

At the very least, your line r::Vector{Float64} = M \ @SArray[1;0;0;0] should absolutely not use ::Vector{Float64} because that’s counterproductive (it’s causing an allocation unless escape analysis manages to remove it, which isn’t very likely).

What does Threads.nthreads() return? If 1, you only have one thread to work with so multithreading can only slow things down. Start julia with the option -t8 to use 8 threads, for example.

Here is a minimum working example. A “mesh” with 2 elements, 5 nodes. To pass the “test”, the println(f) should give 1.0 when nd == nds[1] and 0.0 on the other 3 nodes.
Again, thanks for the help so far.

using LinearAlgebra, StaticArrays

function abcd(p::Matrix{Float64}, nds::AbstractVector, nd::Int32)
    # n1,n2,n3 = nodes[nodes .!= nd]

    n1::Int32 = 0
    n2::Int32 = 0
    n3::Int32 = 0
    for i in nds
        if i != nd 
            if n1 == 0
                n1 = i
            elseif n2 == 0
                n2 = i
            elseif n3 == 0
                n3 = i
            end
        end
    end

    M = @SMatrix [1.0 p[1, nd] p[2, nd] p[3, nd];
                  1.0 p[1, n1] p[2, n1] p[3, n1];
                  1.0 p[1, n2] p[2, n2] p[3, n2];
                  1.0 p[1, n3] p[2, n3] p[3, n3]]

    r = M \ @SArray[1;0;0;0]

    return r[1],r[2],r[3],r[4] # a b c d
end # Basis function coef.

function main()
	
	p::Matrix{Float64} = [0 1 0 0 -1;
						  0 0 1 0  0;
						  0 0 0 1  0]
	
	t::Matrix{Int32} = [1 1;
					    2 2;
						3 4;
						4 5]

	k = 1
	nds = @view t[:, k]
	a, b, c, d = abcd(p, nds, nds[1])
    for i in 1:4
    	nd = nds[i]
        f = a + b*p[1, nd] + c*p[2, nds[i]] + d*p[3, nds[i]]
        println(f)
    end

end

main()

And yes, I’m calling the julia code from the terminal with --threads=auto and Threads.nthreads() outputs 16.

I think you might benefit from taking a look at OhMyThreads.jl.

My abcd2 appears to give the same results as your abcd (but abcd3 had some incorrect assumptions). But your updated abcd is now allocation-free and appears to be slightly faster (probably because your construction of M is slightly more efficient) so might as well use that. I don’t see any obvious changes that would further improve it, so looks good to me.

I’m not the best person to help with the threading issues, so I’ll leave that to others.

Just a short slightly off-topic comment on this. Indexing of arrays in julia is done with Int, which is Int64 on most architectures. Any other Integer (like Int32) will be passed through Base.to_indices which converts to Int. This is inlined and optimized and typically results in just a single additional instruction.

The same goes for type declarations n1::Int32 = ... etc. Their only effect is that assignments to the variable are converted and a type assertion is added, i.e. n1 = i will be rewritten to n1 = convert(Int32, i)::Int32. I don’t think this contributes to any significant slowdown in your program, though in some cases it may.

1 Like

That is nonsense. Threads.@threads and Threads.@spawn with sync are basically the same thing.

All that being said, julia Tasks and the julia scheduler are somewhat heavier than whatever your openMP thing is probably using (libomp, libgomp, bolt, ???). This is visible as a larger per-task overhead, necessitating fewer larger tasks, as well as a higher latency for switching between single-threaded and multi-threaded regions.

@Rkiefe In order for us to help you, you should post a (ideally simplified) fully self contained piece of code that exhibits the behavior you dislike.

As in:

here is my code file. If I do julia -t auto -e 'include("./foo.jl"); benchmark_single_threaded();'" then this prints “… 1.3s …”. If I do julia -t auto -e 'include("./foo.jl"); benchmark_multi_threaded();' then this prints “…using 16 threads. … 1.6s …”

Ideally you cut out unnecessary details from your code. A detail is “definitely not unnecessary” if the relevant phenomenon (parallelization doesn’t help) fails to reproduce in the smaller example.

You will get much better feedback on this forum if we can copy-paste some block of code that reproduces the problematic behavor, and if, ideally, this code is simplified.