I get 1.95 sec. something like 24% speedup for the matmul if I run in the REPL (and add @simd
), partial timing here:
@time a = matgen(n)
0.005693 seconds (2 allocations: 17.166 MiB)
@time c = matmul(n, a, b);
1.932954 seconds (2 allocations: 17.166 MiB)
So it’s worth it to AOT compile at least that one. And the for loops need to be split in two, i.e. dual-for loop doesn’t take @simd
but maybe should?
And I’m unclear why to I get 2, not one allocation each?