Best practice: threaded sparse matrix assembly

What is current state-of-the-art regarding threaded sparse matrix assembly?

We’re currently having an issue where sparse matrix assembly is taking more time than the linear solver (at low resolutions). One way to improve that is to use multithreading.
In C, I use to compute IJV independently per thread and then reassemble everything into a single triplet. This was not trivial and rather error prone.

I could imagine that there a better ways of dealing with this in Julia tooling, in particular with tools like ExtendableSparse ?
If someone has hints, or links to MWE of threaded sparse assemblies, it would be very welcome.

Cheers!

This may be specific to FinEtools, but perhaps you will find it useful. GitHub - PetrKryslUCSD/FinEtoolsMultithreading.jl: Parallel finite element library · GitHub
A paper was published as well https://www.sciencedirect.com/science/article/pii/S0045782524003323