Julia does it too, but the unrolling can be by too much. For example, in a simple dot product:
function dot_product(x::AbstractArray{T},y::AbstractArray{T}) where T
@boundscheck length(x) == length(y) || throw(BoundsError())
out = zero(T)
@fastmath @inbounds for i ∈ eachindex(x)
out += x[i] * y[i]
end
out
end
It unrolls by a factor of 16 on a computer with avx2 (it’d be 32 given avx512):
x = randn(1);
y = randn(1);
@code_native dot_product(x,y)
Gives output including:
; Function getindex; {
; Location: array.jl:731
L160:
vmovupd (%rdx,%rcx,8), %ymm4
vmovupd 32(%rdx,%rcx,8), %ymm5
vmovupd 64(%rdx,%rcx,8), %ymm6
vmovupd 96(%rdx,%rcx,8), %ymm7
;}
; Function add_fast; {
; Location: fastmath.jl:161
vfmadd231pd (%rax,%rcx,8), %ymm4, %ymm0
vfmadd231pd 32(%rax,%rcx,8), %ymm5, %ymm1
vfmadd231pd 64(%rax,%rcx,8), %ymm6, %ymm2
vfmadd231pd 96(%rax,%rcx,8), %ymm7, %ymm3
addq $16, %rcx
cmpq %rcx, %r8
jne L160
vaddpd %ymm0, %ymm1, %ymm0
cmpq %r8, %r9
vaddpd %ymm0, %ymm2, %ymm0
vaddpd %ymm0, %ymm3, %ymm0
vextractf128 $1, %ymm0, %xmm1
vaddpd %ymm1, %ymm0, %ymm0
vhaddpd %ymm0, %ymm0, %ymm0
;}
This is the for loop.
The block following L160:
says to move data from memory (eg, (%rdx,%rcx,8)
into registers (eg, %ymm4
). Registers starting with “y” are 256 bits, meaning that’s 4 doubles. It says to do this 4 times, for 16 doubles total.
With avx-512, the "y"s would be "z"s, and that’d be 32 numbers total.
Next up are four vfmadd231pd
, saying to do four vector fused mutiply-adds on numbers from memory, those numbers just loaded, adding and storing them registers ymm0 through ymm3.
Then jne L160
says to conditionally jump back to L160, to repeat this process.
The problem is, if the loop is shorter than 16, it doesn’t get unrolled at all.
julia> using BenchmarkTools
julia> x15 = randn(15); y15 = randn(15);
julia> x16 = randn(16); y16 = randn(16);
julia> @benchmark dot_product($x15, $y15)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 9.567 ns (0.00% GC)
median time: 9.628 ns (0.00% GC)
mean time: 9.648 ns (0.00% GC)
maximum time: 23.377 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark dot_product($x16, $y16)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.540 ns (0.00% GC)
median time: 5.580 ns (0.00% GC)
mean time: 5.636 ns (0.00% GC)
maximum time: 20.368 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> x8 = randn(8); y8 = randn(8);
julia> @benchmark dot_product($x8, $y8)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.510 ns (0.00% GC)
median time: 5.580 ns (0.00% GC)
mean time: 5.626 ns (0.00% GC)
maximum time: 19.136 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Basically, it unrolls to optimize for long loops. If you know the loops are short, it helps to pass along that information.
With SVectors (which are amazingly faster), it does speed up for vectors of length 8 vs 16:
julia> sx16 = @SVector randn(16);
julia> sy16 = @SVector randn(16);
julia> @benchmark dot_product($sx16, $sy16)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.616 ns (0.00% GC)
median time: 3.647 ns (0.00% GC)
mean time: 3.865 ns (0.00% GC)
maximum time: 22.252 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> sx8 = @SVector randn(8);
julia> sy8 = @SVector randn(8);
julia> @benchmark dot_product($sx8, $sy8)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.915 ns (0.00% GC)
median time: 2.955 ns (0.00% GC)
mean time: 3.105 ns (0.00% GC)
maximum time: 24.025 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
Not that much faster though, which I think is I think why LLVM normally unrolls by 4 (instructions aren’t done one at a time).
15 is slower, I guess because it doesn’t neatly divide into chunks of 4:
julia> sx15 = @SVector randn(15);
julia> sy15 = @SVector randn(15);
julia> @benchmark dot_product($sx15, $sy15)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.548 ns (0.00% GC)
median time: 4.589 ns (0.00% GC)
mean time: 4.830 ns (0.00% GC)
maximum time: 22.592 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000