Thanks! In my first attempt with Tullio, I did not import LoopVectorization. In fact, if we use avx macro for this micro example, without any threading we can immediately get 4x speedup. However, that is not what I want to test initially (and probably not a fair comparison?). avx looks like the simd clause in OpenMP, if I interpret it correctly.
Now with 2 threads and no avx, using Tullio gives me 1.5x speed up compared with base case, but @threads on the outer loop is slightly faster than that:
Number of threads = 2
base line:
  158.373 ms (2 allocations: 61.04 MiB)
@threads on the outer loop:
  99.353 ms (14 allocations: 61.04 MiB)
@tullio on the nested loops:
  109.440 ms (18 allocations: 61.04 MiB)
@tullio avx on the nested loops:
  34.556 ms (17 allocations: 61.04 MiB)