Do you tend to have problems parallelizing thousands of fast computations over a large matrix? I’m going to show a Julia and then a C++ version of the code snippet. I get a lot more performance benefits from parallelization with #pragma omp parallel for in C++ than I can get with Julia for some reason…
P1a::Matrix{Float64} = zeros(4, mesh.nt)
P1b::Matrix{Float64} = zeros(4, mesh.nt)
P1c::Matrix{Float64} = zeros(4, mesh.nt)
P1d::Matrix{Float64} = zeros(4, mesh.nt)
# Singlethread
@time P1!(mesh, P1a, P1b, P1c, P1d, 1:mesh.
#Outputs: 0.781466 seconds (18.03 M allocations: 736.519 MiB, 9.84% gc time, 6.23% compilation time)
# Multithreaded
@time begin
chunks = Iterators.partition(1:mesh.nt, cld(mesh.nt, Threads.nthreads()))
@sync for chunk in chunks
Threads.@spawn P1!(mesh, P1a, P1b, P1c, P1d, chunk)
end
end
#Outputs: 1.051446 seconds (18.00 M allocations: 735.137 MiB, 8.05% gc time, 15.01% compilation time)
The mesh has ‘nt’ elements (sometimes 1M), and each column of P1a,b,c,d is independent, so I batch the elements in chunks and send each chunk to a thread and @sync. Its always somehow slower than the single threaded version.
Here is the P1! function.
function P1!(mesh::MESH,
P1a::Matrix{Float64},
P1b::Matrix{Float64},
P1c::Matrix{Float64},
P1d::Matrix{Float64},
chunk)
for k in chunk
nds = @view mesh.t[:, k]
for i in 1:4
P1a[i, k], P1b[i, k],
P1c[i, k], P1d[i, k] = abcd(mesh.p, nds, nds[i])
end
end
end
The same code in C++, paralellized with #pragma omp parallel for gives me 10x speed ups.
#pragma omp parallel for // Each column is independent
for(int k = 0; k<mesh.nt; k++){ // For each element
for(int i = 0; i<4; i++){ // For each node of current element
Eigen::Vector4d abcd = linearBasis3D(mesh.p, mesh.t.col(k), mesh.t(i, k));
P1a(i, k) = abcd(0);
P1b(i, k) = abcd(1);
P1c(i, k) = abcd(2);
P1d(i, k) = abcd(3);
}
}
Any help with this would be amazing.