Hi everyone.
I am optimizing some older codes of mine. I find that sometimes vectorizing is faster than looping (which did not happen before) Could anyone explain when it is better to use one or the other in Julia.
Thanks
Hi everyone.
I am optimizing some older codes of mine. I find that sometimes vectorizing is faster than looping (which did not happen before) Could anyone explain when it is better to use one or the other in Julia.
Thanks
In most cases the advantage of a vectorized operation comes from the guarantee that the operation is inbounds (the broadcast fails if the dimensions do not match), something that you might have to inform explicitly the compiler if writing a loop.
It can also help if things are type unstable by providing a function barrier.
Chris, could you please decode your important statement for the noobs with a simple example?
In the code sample below, looping via a comprehension performs better than vectorization (if I understand the term correctly) but I do not see a function barrier:
A = ["abcd", [1,2,10], Ď]
length.(A) # 661 ns (9 allocs: 416 bytes)
[length(a) for a in A] # 242 ns (3 allocs: 112 bytes)
Iâm new to Julia. Can you please explain when vectorizing would be SLOWER than looping? Intuition says vectorizing (when possible) is always faster.
Iâm sure youâre aware, that array comprehensions compile to a Generator
(with an anonymous function), which is immediately collect
ed into an array. (see the bottom six lines)
julia> Meta.@lower [length(a) for a in A]
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 â $(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 â global var"#13#14"
â const var"#13#14"
â %3 = Core._structtype(Main, Symbol("#13#14"), Core.svec(), Core.svec(), Core.svec(), false, 0)
â Core._setsuper!(%3, Core.Function)
â var"#13#14" = %3
â Core._typebody!(%3, Core.svec())
âââ return nothing
)))
â %2 = Core.svec(var"#13#14", Core.Any)
â %3 = Core.svec()
â %4 = Core.svec(%2, %3, $(QuoteNode(:(#= none:0 =#))))
â $(Expr(:method, false, :(%4), CodeInfo(
@ none within `none`
1 â %1 = length(a)
âââ return %1
)))
â #13 = %new(var"#13#14")
â %7 = #13
â %8 = Base.Generator(%7, A)
â %9 = Base.collect(%8)
âââ return %9
))))
Vectorizing an operation is executing a loop under the hood. In theory a loop and a vectorized operation are doing the same thing.
Thank you and no, I am not. I only know Julia from the walks in the park, but not from seeing her under the operating table.
Ah, then you might find this interesting
Vectorized expressions sometimes create intermediate temporary arrays which are unnecessary. In many cases you can avoid this, but not always.
With a loop, you can sometimes get much better performance:
# vectorized
foo(A) = maximum(abs2.(A))
# loop
function bar(A)
val = zero(eltype(A))
for i in eachindex(A)
@inbounds val = max(val, abs2(A[i]))
end
return val
end
Benchmarks:
julia> using BenchmarkTools
julia> A = rand(-100:100, 10_000);
julia> @btime foo($A);
7.600 Îźs (2 allocations: 78.17 KiB)
julia> @btime bar($A);
2.322 Îźs (0 allocations: 0 bytes)
As you see, the loop is much faster. But itâs not always easy to predict. If I supply a vector of Floats instead of a vector of Ints, the benchmarks are very different.
(Sorry, made an error in the first draft.)
You may also want to check the âperformance tipsâ from the official doc.
Performance Tips ¡ The Julia Language
Maybe youâll find some good tips.
Maybe relevant for you, broadcast is also useful to build generic code running on both GPU and CPU.
For example, considering a n_s\times n_s mesh, a basic finite difference scheme can be written as
function update_force!(fx::AbstractArray{Float64,2},xc::AbstractArray{Float64,2},ks)
ns=size(xc,1)
r0=1:ns-2
r1=2:ns-1
r2=3:ns
@views @. fx[r1,r1] = -ks*(4xc[r1,r1]
-xc[r1,r0] # Down
-xc[r0,r1] # Left
-xc[r2,r1] # Right
-xc[r1,r2]) # Up
end
which corresponds to the loop version that will not directly run on GPUs.
function update_force!(fx::Array{Float64,2},xc::Array{Float64,2},ks)
nsx,nsy=size(xc)
@inbounds for j â 2:nsy-1
for i â 2:nsx-1
fx[i,j] = -ks*(4xc[i,j]
-xc[i,j-1]
-xc[i-1,j]
-xc[i+1,j]
-xc[i,j+1])
end
end
end
Even the version that does not artificially create an intermediate,
foo(A) = maximum(abs2,A)
turns out to be slower than the loop:
julia> @btime foo($A);
1.695 Îźs (0 allocations: 0 bytes)
julia> @btime bar($A);
860.459 ns (0 allocations: 0 bytes)
(why in this case, I donât know, but it seems that the loop version is being able to SIMD better).
A function barrier helps if youâre doing a lot of work behind the barrier, but length
is not a lot of work.
Here is a small real-world example that may come up frequently:
using DataFrames
df = DataFrame(x = rand(1024); y = rand(1024));
function loop!(df)
(; x, y) = df
z = similar(x)
@inbounds for i = eachindex(x,y,z)
z[i] = x[i] + y[i]
end
df.z = z
nothing
end
function broadcast!(df)
df.z = df.x .+ df.y
nothing
end
@btime loop!($df)
@btime broadcast!($df)
I get
julia> @btime loop!($df)
104.254 Îźs (5124 allocations: 104.16 KiB)
julia> @btime broadcast!($df)
1.188 Îźs (4 allocations: 8.20 KiB)
My point is that the broadcast itself is a function barrier.
A loop is not a function barrier.
Thus, if the broadcast is type stable, it will be fast even if the inputs are not. There will be a slow dynamic dispatch to the fast ::Vector{Float64} .+ ::Vector{Float64}
.
However, a loop doesnât have this barrier, so it suffers from dynamic dispatches on every iteration.
âvectorizingâ in terms of broadcasting will often fail to âvectorizeâ in the low level sense even when loops would, because broadcasting is fundamentally dynamic.
This is why FastBroadcast.jl
has the option broadcast=false
, which defaults to false
, meaning it wonât actually broadcast.
This gives access to the nice syntax of broadcasting and the ability to still support GPUs, but without the occasionally-performance-killing dynamism of broadcast=true
.
Even if you set broadcast=true
, FastBroadcast.jl
will still special case the common situation of not broadcasting, and generate specialized code (loops) for only that case, falling back to a slower implementation if necessary.
Anyway, my point is that vectorized syntax by default is bad for vectorization. You can use FastBroadcast.jl
to remedy this, which the SciML ecosystem does.
Just a meta comment: the use of the word âvectorizationâ in programming is pretty overloaded. To some people, it means something along the lines of âfunctions acting on an entire arrayâ and to others it means SIMD.
In julia circles, people typically mean SIMD when they say âvectorizationâ but this is different from say Python, Matlab, and R where typically the only way to access SIMD is through their âvectorizedâ functions, whereas in julia itâs pretty easy to get a loop to use SIMD instructions (itâll often just happen automatically).
This can sometimes cause people to get confused on both sides, so something to be aware of.
@Elrod, thanks for the very educational example using DataFrames, which are type-unstable, making your point very clear. Very appreciated.
As a side note, if we create a function barrier with the loop then it would be even faster than the vectorized version:
function sumxy!(z, x, y)
@inbounds for i = eachindex(x,y,z)
z[i] = x[i] + y[i]
end
return z
end
function loop!(df)
(; x, y) = df
z = similar(x)
df.z = sumxy!(z, x, y) # function barrier
nothing
end
Can anybody explain why the type instability? Iâd naively think, df
is passed as a function argument, eltype
of the df
columns is Float64
@Eben60, see @bkamins multiple posts and courses on this matter, including here, and his latest book that covers this issue nicely.
Basically, the columns of a DataFrame are AbstractVectors (for maximal user flexibility) and the compiler does not know the actual memory layout. However, inside the function barrier, everything becomes type stable for the compiler and faster code can be generated.