Parallel assembly of a finite element sparse matrix

PetrKryslUCSD · March 16, 2023, 7:14pm

Yes, in this case. It could be the number of cores…

purplishrock · March 16, 2023, 8:54pm

Well the XEON does have 8 cores, so … you think it would be good up to 7

Really interesting data points that you have provided. I really like to see these kind of metrics in the Julia forum !

edit: math is hard. 2x xeon = 16. so 1 for windows, and 1 for the virus scanner in windows leaves 14

PetrKryslUCSD · March 17, 2023, 3:11am

Actually, each of the sockets has 8 cores. 16 in total then. I don’t have an explanation for the poor performance.

termi-official · March 17, 2023, 1:49pm

Thanks for investigating this issue further Prof. Krysl. Sorry, I could not find time to come back to the issue linked by Kristoffer yet (but I still follow the thread). From a performance perspective the assembly on a single thread is more or less compute bound. Parallelizing assembly with a low number of threads should not be a big issue (with sufficient memory bandwidth and cache). However, with more threads you increase the pressure on all memory lanes. Here I still think it is a mixture of cache/bandwidth issues (i.e. bad or even conflicting cache access patterns+memory bus cannot keep up with the CPUs read/write access) and frequency boosting (i.e. at lower total load each core has higher frequency). Fore some discussion I highly recommend the WorkStream paper (dx.doi.org/10.1145/2851488), because we basically reproduce Figure 4 from this paper. However, take this with a grain of salt, as I still have to confirm everything in more detailed benchmarks.

Another relevant thread is How to achieve perfect scaling with Threads (Julia 1.7.1) - #10 by carstenbauer which discusses some of the mentioned problems in more detail.

PetrKryslUCSD · March 17, 2023, 3:29pm

The machine Firenze with Windows 10 Julia 1.8.5.
Better speed ups than the WSL2 setup.

PetrKryslUCSD · March 17, 2023, 3:55pm

Also, I think I solved the problem with the tasks: I think I need to start (N+1) threads in order to have N tasks. Apparently, one thread needs to be the main to carry the spine of the computation. If I do this, the tasks finish in more or less the same time (there is still a bit of variation, on the order of 10 %, but nothing like the huge differences between fast and slow tasks before).

This is the speed up of Firenze/Windows 10/with tasks.

Edit: The tasks take about 0.03 sec to start and another 0.03 seconds to spawn. So that can add up.

PetrKryslUCSD · March 17, 2023, 5:03pm

Alas, I spoke to soon. No, the problem of the slow tasks still not solved. On the Horntail, even starting N+1 (or even N+8) threads for N tasks, some of the tasks are still delayed by a significant factor (100-200%) relative to others.

How do I debug this?

PetrKryslUCSD · February 10, 2024, 1:13am

I see similar irregularities in the task time consumed (some tasks, a random number of them, will take much longer than others, often double the time) on Apple Mac M2 Ultra.

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 24 × Apple M2 Ultra
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 16 virtual cores

I don’t know what it is, but I suspect task scheduling. An issue has been filed,

github.com/JuliaLang/julia

Tasks with the same workloads sometimes finish in much longer time than others

opened 12:49AM - 10 Feb 24 UTC

PetrKryslUCSD

``` Julia Version 1.10.0 Commit 3120989f39b (2023-12-25 18:01 UTC) Build Info…: Official https://julialang.org/ release Platform Info: OS: macOS (arm64-apple-darwin22.4.0) CPU: 24 × Apple M2 Ultra WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1) Threads: 1 on 16 virtual cores ``` This batch file is run on an unloaded machine (single user): ``` if !isdir("FinEtoolsHeatDiff.jl") run(`git clone https://github.com/PetrKryslUCSD/FinEtoolsHeatDiff.jl.git`) end using Pkg cd("FinEtoolsHeatDiff.jl/examples") println("Current folder: $(pwd())") Pkg.activate(".") Pkg.instantiate() Pkg.update() @show Threads.nthreads() include(joinpath(pwd(), "steady_state/3-d/Poisson_examples.jl")) using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_tasks_example(100) using .Main.Poisson_examples; Main.Poisson_examples.Poisson_FE_H20_parass_tasks_example(100) ``` As shown in the batch file, a computation is run twice with the same parameters: ``` pkrysl@samadira exp % ~/julia-1.10.0/bin/julia -t 17 batch.jl Current folder: /Users/pkrysl/exp/FinEtoolsHeatDiff.jl/examples Activating project at `~/exp/FinEtoolsHeatDiff.jl/examples` Updating registry at `~/.julia/registries/General.toml` Updating git-repo `https://github.com/PetrKryslUCSD/Targe2.jl.git` No Changes to `~/exp/FinEtoolsHeatDiff.jl/examples/Project.toml` No Changes to `~/exp/FinEtoolsHeatDiff.jl/examples/Manifest.toml` Threads.nthreads() = 17 [ Info: All examples may be executed with using .Main.Poisson_examples; Main.Poisson_examples.allrun() [ Info: Starting <<<<<<<<<<<<<<<<<< FIRST RUN [ Info: 1000000 elements [ Info: Number of free degrees of freedom: 3910599 [ Info: Conductivity: serial [ Info: Conductivity done serial 19.16116690635681 [ Info: All done serial 21.8343768119812 [ Info: Conductivity: parallel [ Info: 1: Started 0.03969597816467285 [ Info: 1: Spawned 0.09760499000549316 [ Info: 2: Started 0.09940600395202637 [ Info: 3: Started 0.09951305389404297 [ Info: 2: Spawned 0.0995330810546875 [ Info: 4: Started 0.09955406188964844 [ Info: 5: Started 0.09959006309509277 [ Info: 3: Spawned 0.09958100318908691 [ Info: 6: Started 0.09961915016174316 [ Info: 5: Spawned 0.09964394569396973 [ Info: 4: Spawned 0.09963512420654297 [ Info: 6: Spawned 0.09971213340759277 [ Info: 7: Started 0.09971809387207031 [ Info: 8: Started 0.09978413581848145 [ Info: 8: Spawned 0.09986209869384766 [ Info: 7: Spawned 0.09983396530151367 [ Info: 9: Started 0.0999000072479248 [ Info: 10: Started 0.09997797012329102 [ Info: 9: Spawned 0.10000205039978027 [ Info: 10: Spawned 0.10006904602050781 [ Info: 11: Started 0.10009407997131348 [ Info: 12: Started 0.1001901626586914 [ Info: 11: Spawned 0.10021495819091797 [ Info: 12: Spawned 0.1002650260925293 [ Info: 13: Started 0.10025405883789062 [ Info: 14: Started 0.10038900375366211 [ Info: 13: Spawned 0.10040903091430664 [ Info: 14: Spawned 0.10048103332519531 [ Info: 15: Started 0.10048699378967285 [ Info: 16: Started 0.10061907768249512 [ Info: 16: Spawned 0.10070610046386719 [ Info: 15: Spawned 0.10070610046386719 [ Info: 1: Started conductivity 0.29084300994873047 [ Info: 6: Started conductivity 0.30785512924194336 [ Info: 2: Started conductivity 0.31269001960754395 [ Info: 8: Started conductivity 0.31369996070861816 [ Info: 10: Started conductivity 0.3148770332336426 [ Info: 11: Started conductivity 0.3154640197753906 [ Info: 3: Started conductivity 0.3155040740966797 [ Info: 4: Started conductivity 0.31554698944091797 [ Info: 7: Started conductivity 0.3156399726867676 [ Info: 14: Started conductivity 0.3165130615234375 [ Info: 5: Started conductivity 0.3165619373321533 [ Info: 9: Started conductivity 0.3167459964752197 [ Info: 12: Started conductivity 0.31691908836364746 [ Info: 13: Started conductivity 0.3172330856323242 [ Info: 16: Started conductivity 0.3172590732574463 [ Info: 15: Started conductivity 0.3174870014190674 [ Info: 6: Finished 1.7215111255645752 <<<<<<<<<<<< ALL TASKS FINISH [ Info: 2: Finished 1.7214860916137695 [ Info: 7: Finished 1.7238011360168457 [ Info: 3: Finished 1.7238130569458008 [ Info: 1: Finished 1.723978042602539 [ Info: 16: Finished 1.7248311042785645 [ Info: 13: Finished 1.7270441055297852 [ Info: 8: Finished 1.7272679805755615 [ Info: 14: Finished 1.729992151260376 [ Info: 4: Finished 1.7307720184326172 [ Info: 5: Finished 1.7324531078338623 [ Info: 9: Finished 1.7324731349945068 [ Info: 10: Finished 1.7330989837646484 [ Info: 11: Finished 1.733449935913086 [ Info: 15: Finished 1.7334620952606201 [ Info: 12: Finished 1.7362711429595947 <<<<<<<<<<< IN ROUGHLY THE SAME TIME [ Info: Started make-matrix 1.7365190982818604 [ Info: All done 1.743549108505249 [ Info: Short-circuited exit [ Info: Starting <<<<<<<<<<<<<<<<< SECOND RUN [ Info: 1000000 elements [ Info: Number of free degrees of freedom: 3910599 [ Info: Conductivity: serial [ Info: Conductivity done serial 18.674499034881592 [ Info: All done serial 21.335928916931152 [ Info: Conductivity: parallel [ Info: 1: Started 0.02824687957763672 [ Info: 2: Started 0.02838897705078125 [ Info: 1: Spawned 0.02837681770324707 [ Info: 2: Spawned 0.028560876846313477 [ Info: 3: Started 0.028569936752319336 [ Info: 3: Spawned 0.028743982315063477 [ Info: 4: Started 0.028760910034179688 [ Info: 5: Started 0.028944015502929688 [ Info: 4: Spawned 0.028958797454833984 [ Info: 6: Started 0.029123783111572266 [ Info: 5: Spawned 0.029111862182617188 [ Info: 7: Started 0.029152870178222656 [ Info: 6: Spawned 0.029179811477661133 [ Info: 8: Started 0.02931690216064453 [ Info: 7: Spawned 0.02931690216064453 [ Info: 8: Spawned 0.029472827911376953 [ Info: 9: Started 0.029507875442504883 [ Info: 10: Started 0.029678821563720703 [ Info: 9: Spawned 0.029706954956054688 [ Info: 11: Started 0.029832839965820312 [ Info: 12: Started 0.029864788055419922 [ Info: 10: Spawned 0.029883861541748047 [ Info: 11: Spawned 0.029930830001831055 [ Info: 13: Started 0.030006885528564453 [ Info: 14: Started 0.030041933059692383 [ Info: 13: Spawned 0.030104875564575195 [ Info: 12: Spawned 0.030069828033447266 [ Info: 15: Started 0.030176877975463867 [ Info: 16: Started 0.030210018157958984 [ Info: 16: Spawned 0.03033900260925293 [ Info: 14: Spawned 0.030319929122924805 [ Info: 15: Spawned 0.030336856842041016 [ Info: 13: Started conductivity 0.09562182426452637 [ Info: 9: Started conductivity 0.09810400009155273 [ Info: 2: Started conductivity 0.1115868091583252 [ Info: 1: Started conductivity 0.12673091888427734 [ Info: 3: Started conductivity 0.12754082679748535 [ Info: 7: Started conductivity 0.12888884544372559 [ Info: 10: Started conductivity 0.12903380393981934 [ Info: 11: Started conductivity 0.1291518211364746 [ Info: 5: Started conductivity 0.12967586517333984 [ Info: 6: Started conductivity 0.1297318935394287 [ Info: 4: Started conductivity 0.12999987602233887 [ Info: 12: Started conductivity 0.1302487850189209 [ Info: 16: Started conductivity 0.13028287887573242 [ Info: 8: Started conductivity 0.13030695915222168 [ Info: 15: Started conductivity 0.15201091766357422 [ Info: 14: Started conductivity 0.162139892578125 [ Info: 12: Finished 1.3724877834320068 [ Info: 1: Finished 1.372588872909546 [ Info: 3: Finished 1.4589488506317139 [ Info: 5: Finished 1.4598939418792725 [ Info: 2: Finished 1.4610848426818848 [ Info: 11: Finished 1.4626657962799072 [ Info: 7: Finished 1.4626657962799072 [ Info: 13: Finished 1.4634709358215332 [ Info: 10: Finished 1.4653029441833496 [ Info: 16: Finished 1.4664249420166016 [ Info: 6: Finished 1.4673268795013428 [ Info: 9: Finished 1.468505859375 [ Info: 4: Finished 1.4687118530273438 [ Info: 14: Finished 2.7745578289031982 <<<<<<<<<<<<<<<< THESE FINISH in double the time as the others [ Info: 15: Finished 2.776059865951538 <<<<<<<<<<<<<<<< [ Info: 8: Finished 2.776533842086792 <<<<<<<<<<<<<<<< [ Info: Started make-matrix 2.776829957962036 [ Info: All done 2.7807087898254395 [ Info: Short-circuited exit pkrysl@samadira exp % ``` So, some tasks take practically double the time of the others. The number of tasks that takes longer varies from run to run.

Domenico_Lahaye · February 10, 2024, 7:05am

Greetings. Much like your work. Is sparse matrix assembly still running in parallel here?

carstenbauer · February 10, 2024, 7:22am

Just to be clear here and to avoid potential confusion, your @threads based implementation also creates tasks under the hood. Roughly speaking @threads is the same as dividing the iteration interval into nthreads() parts and then @spawning a task per chunk (i.e. one thread runs one task corresponding to one chunk).

AFAICS, you’re trying to do the same in your “task variant”, i.e. chunking with chunks and @spawning the tasks manually.

Am I understanding correctly, that the “thread variant” behaves correctly but the “task variant” is not? Given the above, if you see (vastly) different behavior between the two variants, you should check what you’re doing differently in both cases. For example, you seem to be running (i.e. allocating)

_a = _task_local_assembler(a, b)

on each task in the “task variant” but outside of the @threads loop in the “threads variant”.

PetrKryslUCSD · February 10, 2024, 2:34pm

Yes, indeed. There are two implementations here task-based and threads-based.

carstenbauer · February 10, 2024, 3:03pm

At the risk of coming off as nitpicky (but to be more precise for beginners who might be reading this): Both implementations are creating tasks which then run on multiple threads. Hence, they are both “task-based” and “threads-based”.
What is (likely) meant here is that one implementation uses @spawn whereas the other uses @threads.

PetrKryslUCSD · February 10, 2024, 3:13pm

Thank you for honing my description (nitpicky is good!). It is as you describe.
In the “spawn” case I create the tasks explicitly, whereas “threads” do whatever it is they do under the hood to process their chunks.

PetrKryslUCSD · February 11, 2024, 12:02am

It would seem I found a possible reason why the tasks are sometimes delayed, and also a possible solution: Tasks with the same workloads sometimes finish in much longer time than others · Issue #53269 · JuliaLang/julia · GitHub

The parallel speedups on the Mac M2 Ultra of the assembly with 2, 4, 8, and 16 assembly tasks for an 8-million finite elements mesh:
2: 1.9, 4: 3.49, 8: 6.53, 16: 11.08.

PetrKryslUCSD · February 28, 2024, 5:04am

Problem solved: Threaded Assembly Performance Degradation · Issue #526 · Ferrite-FEM/Ferrite.jl · GitHub

General parallel assembly now available in FinEtools.jl. Script to test it: Threaded Assembly Performance Degradation · Issue #526 · Ferrite-FEM/Ferrite.jl · GitHub

Topic		Replies	Views
Slower with threads Performance question	26	1224	August 6, 2022
Parallelizing multiple Crank–Nicolson solvers Performance linearalgebra	21	1544	March 13, 2021
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2767	December 1, 2021
Why with @threads, the execution time is worse? Performance question , multithreading	19	2857	September 16, 2021
Why doesn't multithreading help here? Performance	12	1475	August 22, 2020

Parallel assembly of a finite element sparse matrix

Related topics