Julia uses LLVM to generate code which is the same backend as e.g. clang uses. So modulo the exact optimization passes used, the same code will be generated. The question that is relevant is for what input does the generated code for the function be valid. As an example, a small sum function:
function sum(x::AbstractVector{Float64})
s = 0.0
@inbounds @simd for i in 1:length(x)
s += x[i]
end
return s
end
import StaticArrays
@code_llvm sum(rand(8))
@code_llvm sum(rand(SVector{8}))
Looking at some of the code in the first example:
vector.ph: ; preds = %min.iters.checked
%18 = insertelement <2 x double> <double undef, double 0.000000e+00>, double %s.0.ph, i32 0
br label %vector.body
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <2 x double> [ %18, %vector.ph ], [ %23, %vector.body ]
%vec.phi122 = phi <2 x double> [ zeroinitializer, %vector.ph ], [ %24, %vector.body ]
%19 = getelementptr double, double* %14, i64 %index
%20 = bitcast double* %19 to <2 x double>*
%wide.load = load <2 x double>, <2 x double>* %20, align 8
%21 = getelementptr double, double* %19, i64 2
%22 = bitcast double* %21 to <2 x double>*
%wide.load124 = load <2 x double>, <2 x double>* %22, align 8
%23 = fadd fast <2 x double> %vec.phi, %wide.load
%24 = fadd fast <2 x double> %vec.phi122, %wide.load124
%index.next = add i64 %index, 4
%25 = icmp eq i64 %index.next, %n.vec
br i1 %25, label %middle.block, label %vector.body
middle.block: ; preds = %vector.body
%bin.rdx = fadd fast <2 x double> %24, %23
%rdx.shuf = shufflevector <2 x double> %bin.rdx, <2 x double> undef, <2 x i32> <i32 1, i32 undef>
%bin.rdx127 = fadd fast <2 x double> %bin.rdx, %rdx.shuf
%26 = extractelement <2 x double> %bin.rdx127, i32 0
%cmp.n = icmp eq i64 %5, %n.vec
br i1 %cmp.n, label %L11.outer.L11.outer.split_crit_edge.loopexit, label %scalar.ph
scalar.ph: ; preds = %middle.block, %min.iters.checked, %if12.lr.ph
%bc.resume.val = phi i64 [ %n.vec, %middle.block ], [ 0, %if12.lr.ph ], [ 0, %min.iters.checked ]
%bc.merge.rdx = phi double [ %26, %middle.block ], [ %s.0.ph, %if12.lr.ph ], [ %s.0.ph, %min.iters.checked ]
br label %if12
Since this code has to be valid no matter the length of the array, it unrolls the loop by a factor of 4, tries to do SIMD, then has a fallback when the number of elements is not divisible by 4 etc etc. Very nice and fast if the array is big, but some overhead if it is small.
Now let’s look at the code for the static array
julia> @code_llvm sum(rand(SVector{8}))
define double @julia_sum_33701({ [8 x double] } addrspace(11)* nocapture nonnull readonly dereferenceable(64)) {
top:
%1 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 0
%2 = load double, double addrspace(11)* %1, align 8
%3 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 1
%4 = load double, double addrspace(11)* %3, align 8
%5 = fadd fast double %2, %4
%6 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 2
%7 = load double, double addrspace(11)* %6, align 8
%8 = fadd fast double %5, %7
%9 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 3
%10 = load double, double addrspace(11)* %9, align 8
%11 = fadd fast double %8, %10
%12 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 4
%13 = load double, double addrspace(11)* %12, align 8
%14 = fadd fast double %11, %13
%15 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 5
%16 = load double, double addrspace(11)* %15, align 8
%17 = fadd fast double %14, %16
%18 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 6
%19 = load double, double addrspace(11)* %18, align 8
%20 = fadd fast double %17, %19
%21 = getelementptr inbounds { [8 x double] }, { [8 x double] } addrspace(11)* %0, i64 0, i32 0, i64 7
%22 = load double, double addrspace(11)* %21, align 8
%23 = fadd fast double %20, %22
ret double %23
}
Here, LLVM knows that there is no need to generate all the general code because this function only needs to be valid for input arrays of length 8 and it knows all the complicated stuff above is not worth it. The exact size is known the whole loop can be unrolled.
julia> @btime sum($(rand(8)))
6.402 ns (0 allocations: 0 bytes)
2.405228701641119
julia> @btime sum($(rand(SVector{8})))
3.278 ns (0 allocations: 0 bytes)
4.368139012221248
In e.g. C++ you would typically do this with templates, encoding the size of the arrays as a template parameter.