Benchmarking with @time @btime and subsequent runs return shorter execution time

In my not so long experience, I would say that 4% of GC is not necessarily an indication of a problem, but it may be. I had a similar situation and in my case I finally found where those allocations where occuring and fixed them, making the threaded version much better. Ideally one would like a code that does not allocate anything in the performance-critical parts.

I would try to track those allocations and be sure that they are strictly necessary.

Take a look at this thread: Track memory usage