Is Julia’s loop as fast as the code written in vectorized form?
Yes, usually.
One exception is that vectorized code can sometimes be optimised better because it has different semantics. For example, suppose you want to add 1 to a vector. You could write it in two ways:
# Using broadcasting
add_a(a) = a .+ one(eltype(a))
# Using a normal for loop
function add_b(a)
y = similar(a)
for i in eachindex(y, a)
y[i] = a[i] + one(eltype(a))
end
y
end
And it’ll be pretty much equally fast. However, the broadcases one (add_a
) is more generic and can therefore specialize in some circumstances. For example, look what happens if we pass 1:5
into the two functions:
julia> add_a(1:5)
2:6
julia> add_b(1:5)
5-element Vector{Int64}:
2
3
4
5
6
The implementation of the broadcasted one is faster since it dispatches to a specialized algorithm that can have different semantics. Broadcasted code also tends to run better on GPUs (I hear).
“Vectorized” has a couple different meanings. In contrast to loops, it refers to array operations on the user level, but otherwise it refers to data-level parallelism like SIMD. I’m just saying this first because vector has a dozen different meanings in different contexts that can be easily conflated, it’s not the main point.
The somewhat faulty premise of comparing the speed of loops and array operations is that they don’t have the same semantics. This topic shows up more specifically in dynamic languages whose semantics don’t allow for optimizing compilation, and those overheads add up to a lot in loops. Array operations are implemented in AOT-compiled languages and heavily encouraged instead of loops. This performance difference across a language barrier forces practical limitations on coding style, and avoiding this is one of the big reasons for Julia’s existence. This doesn’t imply Julia can simply replace those languages or has all of their strengths, in fact Julia is still gaining or improving features that emulate other languages’. The general gulf between loops and array operations is solved at least, what matters more is the semantics and situational implementation.
A loop at a high level iterates a collection and does something at each iteration, and it can traverse that collection in whatever order you specify. In the most general case, iteration must be done one by one in order because each iteration can depend on a previous iteration. However, the compiler can exploit details for optimizations, like moving redundant code before the loop or detecting when iterations can be reordered to allow things like SIMD. Macros have been designed to allow optimizations by effectively letting the user assert assumptions about the loop.
Array operations perform calculations on its elements, and how this happens can vary a lot more because element order is reasonably ambiguous or unimportant (refactor to a loop for an explicit order). It’s on par with a loop in the cases where it’s done by iteration, but comparisons may vary over the aforementioned loop optimizations. Writing simpler loops can avoid small overheads of setting up a loop for general array operations, but this is easily overrun by other performance variance with enough elements. An array operation doesn’t always need to iterate e.g. a UnitRange
1:5
in the previous comment. That is a 5-element sequence that only stores 2 endpoints, and it’s obviously more efficient to operate on the endpoints instead of iterating the elements, when possible.
True for CUDA.jl at least. As said before, loops in general don’t make assumptions that allow parallelization, and there’s the extra overhead of transferring elements to the CPU where scalar operations are possible. There’s no assumption-laden macro for transforming a typical loop, but CUDA.jl compiles elementwise code called kernels, much like the underlying CUDA C++ API, for when array operations (which compile implicit kernels) are not flexible enough.
Yes, and in fact the “vectorized” routines in Julia (with the exception of some linear algebra BLAS calls) are typically implemented as loops written in Julia itself.
See also the discussion in this blog post. As I wrote there:
There is a tension between two general principles in computing: on the one hand, re-using highly optimized code is good for performance; on the other other hand, optimized code that is specialized for your problem can usually beat general-purpose functions.
On the one hand, Julia’s vectorized routines were implemented by experts, so if you are a novice it may well be that your own loops are slower because you don’t know all the performance tricks. On the other hand, if you write your own loops you can optimize specifically for your problem in a way that can sometimes be hard to beat by gluing together a sequence of “vectorized” library calls.