Thanks! In my first attempt with Tullio, I did not import LoopVectorization. In fact, if we use avx
macro for this micro example, without any threading we can immediately get 4x speedup. However, that is not what I want to test initially (and probably not a fair comparison?). avx
looks like the simd
clause in OpenMP, if I interpret it correctly.
Now with 2 threads and no avx
, using Tullio gives me 1.5x speed up compared with base case, but @threads on the outer loop
is slightly faster than that:
Number of threads = 2
base line:
158.373 ms (2 allocations: 61.04 MiB)
@threads on the outer loop:
99.353 ms (14 allocations: 61.04 MiB)
@tullio on the nested loops:
109.440 ms (18 allocations: 61.04 MiB)
@tullio avx on the nested loops:
34.556 ms (17 allocations: 61.04 MiB)