Here are my numbers with BLAS.set_num_threads(1) for
@btime test1(1)
@btime test1(2)
@btime test1(5)
@btime test1(10)
@btime test1(20)
@btime test2()
1.139 s (3566 allocations: 569.10 MiB)
614.184 ms (2029 allocations: 565.13 MiB)
308.655 ms (1080 allocations: 553.58 MiB)
302.716 ms (738 allocations: 534.46 MiB)
279.247 ms (526 allocations: 496.25 MiB)
1.141 s (502 allocations: 572.62 MiB)
Scales way better!