To iterate on Benny’s comment about Amdahl’s law. In your benchmarks for test_gen_keys2 the runtime with 1 thread is roughly 30 seconds with a 15% gc time. Assuming the gc is not parallelized at all and the rest is perfectly parallelized, then 4 threads would give a theoretical time of 30 * 0.15 + 30 * 0.85 / 4 = 10.875 seconds, which is not too far of from the 13 seconds you get. Similarly 16 threads would give a theoretical time of 30 * 0.15 + 30 * 0.85 / 16 = 6.09375, which again is not too far of from your 9.5 seconds.
One alternative to BitIntegers.jl could be to use BigInt together MutableArithmetic.jl to reduce the number allocations, that would however likely require rather intrusive changes to the code.
The fmpz type in the FLINT library works like this. I do however not think this can be fully leveraged in Julia. it would likely require some sort of finalizer to check if an allocation was made or not, and finalizers can only be attached to mutable types and those are tracked by the gc (in principle one could possibly insert the finalizers inline with some escape analysis, I think Julia can do something like that but I don’t know how flexible it is).