Why is this taking so much time on multiple threads?
Running Pkg.test("MyModule") without coverage takes approximately the same time for every number of threads (around 2 minutes).
When using all 16 threads of my CPU the load on every thread is at 100% pretty much the whole time.
Running with code coverage inserts a lot of loads and stores to increment the coverage counters. From my understanding, all threads load and store to the same location (so it’s racy). One core writing to a cache line that is in the cache for another core will invalidate that cache line and the core has to re-fetch it (which can be expensive) to have cache coherence. Maybe that is what you are seeing. Note that this is just a theory.
Is there a way to make this faster in the future? For example every thread could collect it’s own coverage data and then all data can be merge once in the end.
This would make testing much faster for programs that scale very well with the number of threads.