I have a sample code here, where blockedLoop_5_2_16
vectorizes but hilbertLoop_5_2_16
doesn’t (using julia 0.4, neither vectorize on 0.6, issue filed here). Any advise on how to get it to vectorize? What I have tried so far:
- running julia with --check-bounds=no --math-mode=fast (needed to get
blockedLoop
to vectorize - Sprinkling
@simd
on all the loops (I think it only makes a difference on the innermost loop?) (needed to getblockedLoop
to vectorize) - updating the kernel to
+=
into the output array (in case the problem is that the compiler can’t prove each index is visited only once, hence can’t reorder assignments) (no effect) - lifting the
16*(di - 1)
into the outermost loop and combining withdioffset
(no effect)
For a performance comparison:
blockLoop timing:
0.111497 seconds (1.70 k allocations: 93.875 KB)
0.095680 seconds
hilbertLoop timing:
0.218142 seconds (1.43 k allocations: 81.031 KB)
0.191973 seconds
I ran this on a Ivy Bridge i7 processor (supports AVX instructions)