Vectorization Advice

I have a sample code here, where blockedLoop_5_2_16 vectorizes but hilbertLoop_5_2_16 doesn’t (using julia 0.4, neither vectorize on 0.6, issue filed here). Any advise on how to get it to vectorize? What I have tried so far:

  • running julia with --check-bounds=no --math-mode=fast (needed to get blockedLoop to vectorize
  • Sprinkling @simd on all the loops (I think it only makes a difference on the innermost loop?) (needed to get blockedLoop to vectorize)
  • updating the kernel to += into the output array (in case the problem is that the compiler can’t prove each index is visited only once, hence can’t reorder assignments) (no effect)
  • lifting the 16*(di - 1) into the outermost loop and combining with dioffset (no effect)

For a performance comparison:

blockLoop timing:
  0.111497 seconds (1.70 k allocations: 93.875 KB)
  0.095680 seconds
hilbertLoop timing:
  0.218142 seconds (1.43 k allocations: 81.031 KB)
  0.191973 seconds

I ran this on a Ivy Bridge i7 processor (supports AVX instructions)