In case of numpy, vectorization offloads heavy computation to the code in compiled libraries, improving speed. Julia by default compiles down to machine code, why then is this implementation of reduce
operation faster than the naive implementation:
function mapreduce_impl(f, op::Union{typeof(max), typeof(min)},
A::AbstractArray, first::Int, last::Int)
a1 = @inbounds A[first]
v1 = mapreduce_first(f, op, a1)
v2 = v3 = v4 = v1
chunk_len = 256
start = first + 1
simdstop = start + chunk_len - 4
while simdstop <= last - 3
# short circuit in case of NaN
v1 == v1 || return v1
v2 == v2 || return v2
v3 == v3 || return v3
v4 == v4 || return v4
@inbounds for i in start:4:simdstop
v1 = _fast(op, v1, f(A[i+0]))
v2 = _fast(op, v2, f(A[i+1]))
v3 = _fast(op, v3, f(A[i+2]))
v4 = _fast(op, v4, f(A[i+3]))
end
checkbounds(A, simdstop+3)
start += chunk_len
simdstop += chunk_len
end
v = op(op(v1,v2),op(v3,v4))
...
return v
The variable names suggest that it has something to do with simd. How does using four variables instead of one allow simd optimizations to take place?