Precompilation of complex code does not make time to first model step faster

No, see the explanation in Invalidations findings (from a GMT case) - #32 by tim.holy. If you use ascend (see the SnoopCompile docs), you can see how methods used in your workload ultimately trace back to num_threads. You can see that invalidation is very expensive, more than 7s of compilation time. (The other one is negligible.)

I think somewhere I saw @Elrod say that CPUSummary.num_threads method might be eliminated?