@inbounds code slower than one without

Alright, for posterity, and since the documentation for valloc seems quite sparse, here’s the version I ended up with. Changing to use a pointer, @simd no longer makes a difference for me. I also optimized the iterator according to our recent discussion. Finally, with valloc, there’s no need to take care of the remaining elements at the end, since there’s extra room at the end to write an additional vector element.

function test_nt!(a::SubArray, n)
    v = Vec{4,Float64}((0, 1, 2, 3))
    for i = 0:(n-1)>>2
        vstorent(4i + 1 + v, pointer(a, 4i + 1))
    end
end

a = valloc(Float64, 8, 199)
test_nt!(a, 199)