In case of numpy, vectorization offloads heavy computation to the code in compiled libraries, improving speed. Julia by default compiles down to machine code, why then is this implementation of `reduce`

operation faster than the naive implementation:

```
function mapreduce_impl(f, op::Union{typeof(max), typeof(min)},
A::AbstractArray, first::Int, last::Int)
a1 = @inbounds A[first]
v1 = mapreduce_first(f, op, a1)
v2 = v3 = v4 = v1
chunk_len = 256
start = first + 1
simdstop = start + chunk_len - 4
while simdstop <= last - 3
# short circuit in case of NaN
v1 == v1 || return v1
v2 == v2 || return v2
v3 == v3 || return v3
v4 == v4 || return v4
@inbounds for i in start:4:simdstop
v1 = _fast(op, v1, f(A[i+0]))
v2 = _fast(op, v2, f(A[i+1]))
v3 = _fast(op, v3, f(A[i+2]))
v4 = _fast(op, v4, f(A[i+3]))
end
checkbounds(A, simdstop+3)
start += chunk_len
simdstop += chunk_len
end
v = op(op(v1,v2),op(v3,v4))
...
return v
```

The variable names suggest that it has something to do with simd. How does using four variables instead of one allow simd optimizations to take place?