Hi @francispoulin, thanks for reaching out!
Do you know how high of efficiency we can get with multi-threading?
Depends how you define efficiency. In the JuliaCon2021 workshop you refer to, we defined efficiency by the effective memory throughput metric
T_eff. Looking at memory-bound configurations (mostly the case when solving PDEs on modern (many-core) hardware), 1) GPUs have a much larger peak memory throughput
T_peak compared to CPUs and, 2) achieving computation utilising over 90% of the peak memory bandwidth is possible on the GPU without restrictive and non-portable optimisations.
On the CPU, achieving
T_eff rates close to
T_peak seems more challenging. Besides
Base.Threads.@threads, I am aware of
KernelAbstractions.jl (amongst certainly others) to expose multi-threading support. Multi-threaded (and more - such as AVX) CPU applications show implementation- and problem-specific variability in efficiency (speaking in terms of
T_eff) that I did not further investigate an understand yet. Getting
T_eff close to
T_peak on the CPU seems more challenging than on the GPU.
So given the higher
T_peak of GPUs compared to CPUs and the fact that getting
T_eff values closer to
T_peak is possible, we did for now prioritize the GPU backend in ParallelStencil and in the workshop.
Workshop recall: The presented 2D “stencil-compiler” implementations solving nonlinear diffusion with ParallelStencil.jl deliver, on the
Base.Threads.@threads backend, about 9GB/s using 4 threads -one thread per core- on an Intel i5 4 cores (peak memory bandwidth 25GB/s - thus ~36% of
T_peak) and 780GB/s on an Nvidia Tesa V100 16GB PCIe GPU (peak memory throughput 840GB/s - thus ~92% of