Getting non-vectorized code to the speed of vectorized ones

I am not new to Julia, but I am not really a regular user either. I am trying to make a code that is non vectorized to achieve the performance of a vectorized one, but I cannot do it. Here is the problem in general lines:

I have a function f(x,u,d) representing a discrete time dynamical system, that takes 3 arguments, where x,u and d are (column) vectors, and it returns an output. I am currently working with the particular function

function f(x,u,d)

which represents a discrete time Dubins’ vehicle with a single input, but this should not matter too much.

I want to calculate the output of this function for several x,u and d, in this case, organized in tensors of dimensions 3xTxK for x_tens and d_tens and a 1xTxK tensor for u_tens. I cannot use broadcast because my function needs to access particular elements of x. The functions map and mapslices do not work because I have more than one input. I have tried to implement this in three different ways:

  1. By using eachcol and multiple dispatch:
function dyn_eachcol(x::Matrix{Type},u::Matrix{Type},d::Matrix{Type}) where {Type} 
    reduce(hcat,f.(eachcol(x),eachcol(u),eachcol(d))) ::Matrix{Type}
function dyn_eachcol(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type} 
    reshape(dyn_eachcol(x,u,d),:,T,K) ::Array{Type,3}

(instead of using reshape, one could use eachslice, but the result is a little bit faster this way)

  1. Using a for loop:
function dyn_for(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type} 
    for k=1:K
        for t=1:T
            @views xout[:,t,k]=f(x[:,t,k],u[:,t,k],d[:,t,k])
    return xout ::Array{Type,3}
  1. Vectorizing:
function dyn_vectorized(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type} 

When I benchmark the speed of each one of this approaches I get a something like

@benchmark dyn_eachcol(x_tens,u_tens,d_tens) # median = 32ms
@benchmark dyn_for(x_tens,u_tens,d_tens) # median = 24ms
@benchmark dyn_vectorized(x_tens,u_tens,d_tens) # median = 4ms

So, the vectorized code is a little bit faster, but not that much. But what I really want to do is to take the gradient of the result, using Zygote. Here are some benchmarks:

@benchmark gradient((x,u)->sum(dyn_eachcol(x,u,d_tens)),x_tens,u_tens) # median = 823 ms
@benchmark gradient((x,u)->dyn_for_sum(x,u,d_tens),x_tens,u_tens) # median = 90277 ms (this is not a typo) 
@benchmark gradient((x,u)->sum(dyn_vectorized(x,u,d_tens)),x_tens,u_tens) #median = 12 ms

Note: for the second example Zygote does not support mutation arrays, so the line xout[:,t,k]=f(x[:,t,k],u[:,t,k],d[:,t,k]) is substituted by out+=sum(f(x[:,t,k],u[:,t,k],d[:,t,k]))

This is a dramatic difference. I don’t think the vectorized code is calling any BLAS routine, this should be pure Julia. Yet, the results are so different. What am I missing? How can I write a non vectorized code which achieves the speed of the vectorized one?

1 Like

Swap in Enzyme for the differentiation of pieces that need mutation and scalar operations.

Thanks @ChrisRackauckas ! did not know about Enzyme.

However, this would speed up the differentiation, but the solution with the function evaluation using the for loop is still much slower than the vectorized. Is there no solution, or you meant that there is another way to use Enzyme to speed up things?

Your loops are not fully non-allocating. You should make it fuse the array creation into the write.

1 Like

This makes many small arrays, so one approach is to use StaticArrays:

julia> using StaticArrays

julia> function f(x::StaticArray, u, d)
           s,c = sincos(x[3])
           x .+ SA[c, s, u] .+ 0.5 .* SA[-s*u, c*u, 0] .+ d
f (generic function with 2 methods)

julia> function dyn_static(x,u,d)
           xs = reinterpret(reshape, SVector{3,eltype(x)}, x)
           ds = reinterpret(reshape, SVector{3,eltype(d)}, d)
           um = reshape(u,T,K)
           zs = f.(xs, um, ds)
           reinterpret(reshape, eltype(eltype(zs)), zs)
dyn_static (generic function with 1 method)

julia> x_tens,d_tens = (rand(3,10,100) for _ in 1:3); u_tens = rand(1,10,100);  # smaller than yours

julia> dyn_static(x_tens,u_tens,d_tens) ≈ dyn_eachcol(x_tens,u_tens,d_tens)

julia> @btime dyn_eachcol(x_tens,u_tens,d_tens);
  min 5.551 ms, mean 5.979 ms (75017 allocations, 2.82 MiB. GC mean 5.67%)

julia> @btime dyn_vectorized(x_tens,u_tens,d_tens);
  min 42.333 μs, mean 65.566 μs (81 allocations, 183.50 KiB. GC mean 18.78%)

julia> @btime dyn_static(x_tens,u_tens,d_tens);  # was mean 21μs without sincos
  min 7.354 μs, mean 10.488 μs (5 allocations, 23.61 KiB. GC mean 13.27%)

julia> gradient((x,u)->sum(dyn_static(x,u,d_tens)), x_tens, u_tens);  # sadly this doesn't work
ERROR: Need an adjoint for constructor Base.ReinterpretArray{Float64, 3, SVector{3, Float64}, Matrix{SVector{3, Float64}}, true}. Gradient is of type FillArrays.Fill{Float64, 3, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, Base.OneTo{Int64}}}

julia> @btime ForwardDiff.gradient(x->sum(dyn_static(x,u_tens,d_tens)), x_tens);  # add these two! 30ms
  min 24.067 ms, mean 26.520 ms (3709 allocations, 74.86 MiB. GC mean 12.99%)
julia> @btime ForwardDiff.gradient(u->sum(dyn_static(x_tens,u,d_tens)), u_tens);
  min 5.442 ms, mean 7.277 ms (1218 allocations, 25.15 MiB. GC mean 18.78%)

julia> @btime gradient((x,u)->sum(dyn_vectorized(x,u,d_tens)), x_tens, u_tens);
  min 101.167 μs, mean 165.642 μs (175 allocations, 463.67 KiB. GC mean 21.65%)

It probably wouldn’t be hard to make these reinterpret(...) calls Zygote-differentiable, maybe that would be better. But ForwardDiff doesn’t mind. Nor would Enzyme, I believe.


Thanks both @ChrisRackauckas and @mcabbott . I learned a lot from this post and this all will help me a lot in my future research!!!.

At the same time, I have to admit I am kind of sad that there is no magic command like broadcast that would make the performances equivalent (Static Array and allocating back are almost magical enough, but not quite).

A lot of people in the Julia community say that there is no need to vectorize code in Julia, and while this is true in the sense that I am not seeing the kind of performance degradation I would have in Python or Matlab, the vectorized code is still the simpler solution for me in this case.

1 Like

This is beyond my level but would like to ask if the “vectorization” presented with all the reshape() functions, is not cheating a bit? Just a doubt.

What do you mean by cheating?

Zygote is an AD designed for vectorized code, because most ML code is calling vectorized routines in some external library. To be clear, this is a trade-off: most people are going to be porting over code from TensorFlow, PyTorch, etc. which mandate vectorized routines (because much of modern ML literature is built on them, it’s a self-reinforcing cycle). I don’t think anyone would be caught dead writing a loop nest on a critical path in these frameworks, because it’s been ingrained in them that doing so will be incredibly inefficient.

However, all this also means that Zygote is absolutely not representative of the language in general. That’s why a bunch of alternative ADs that are loop/scalar iteration friendly were recommended. The magic you’re looking for is exactly ForwardDiff or Enzyme: both are just as idiomatic and will probably take less time to compile as well :wink:


No offense, just a doubt as mentioned. By vectorization of f(x) in Julia I thought about the application of f elementwise to x with the syntax f.(x)

None taken :wink:

Yes indeed, but this is not conceptually different (when one talks about vectorization) from the application of f “columwise” to x. Solutions such as eachcol or eachslice are good, but not on par with what I was getting with the vectorized code (albeit, I have not tested the performance using them with either ForwarDiff, ReverseDiff or Enzyme).

1 Like

Could you ellaborate on this?

FWIW, tried to code below with 2 for loops and it seems to be faster on my Win10 laptop than OP’s vectorized version.

(NB: code further improved with input below from @DNF)

function dyn_for2(x::Array{Tp,3},u::Array{Tp,3},d::Array{Tp,3}) where {Tp} 
    _, T, K =size(d)
    xout = copy(x)
    @inbounds for k in 1:K, t in 1:T
        xtk, utk = x[3,t,k], u[1,t,k]   # >10% speedup
        # stk, ctk = sincos(xtk)   # ~40% speedup to compute only once
        xout[1,t,k] += cos(xtk) - 0.5*sin(xtk)*utk + d[1,t,k]
        xout[2,t,k] += sin(xtk) + 0.5*cos(xtk)*utk + d[2,t,k]
        xout[3,t,k] += utk + d[3,t,k]
    return xout

This looks pretty efficient. Some comments:

You’re not creating any slices, only using scalar indices, so @views has no purpose here.

Type already has a meaning Types · The Julia Language so I don’t think you should use it as a type parameter.

You’re accessing the same indices over and over. I’m not sure if the compiler hoists the access out, but you can try to store them in temporary variables, and see if it makes a difference.

Don’t add type assertions like that, it’s just visual noise with no purpose.

Oh, and you’re calculating sin and cos twice for the same inputs here:


@DNF, thanks for the learnings.
The @views was a leftover from previous tests, I have removed it. As for the rest it’s a direct copy of OP’s code, but I admit it’s beyond my skills. Thanks again for your advice.

So checked the repeated indices but got no improvement by using cartesian:

ix1, ix2, ix3 = CartesianIndex.(((1,t,k), (2,t,k), (3,t,k)))

However, computing cos and sin only once, delivers about 40% speedup. Did not update code to stay in tune with OP.

1 Like

I was actually referring to accessing x[3,t,k] four times etc, instead of doing for example

x_temp = x[3,t,k]
u_temp = u[1,t,k]

and reusing those. I’m not sure if there will be an effect, but maybe.

1 Like

My mistake, should not have done that.

I have tried it in my computer, did not observe any significant change (when talking about the forward pass, I believe you are not taking gradients). I would expect the compiler to realize it can reuse the values. I tried to use @code_typed and @code_llvm but I am not good enough in Julia to see any difference.

PS: Wondering whether me not seeing any change has to do with having compiler optimization set to 3

In general, Julia isn’t good at communicating to LLVM that functions like sin or cos always produce the same results, so refactoring to only call them once is often smart. You can also use sincos which can compute sin and cos at the same time a little bit faster since the reduction math is shared.

1 Like

The macro @benchmark showed some ~10% improvement after doing this and it is surprising to me.

I’m totally sure about this, but I think that since arrays are mutable, and could potentially have been mutated concurrently from somewhere else, you cannot be completely certain that these values won’t have to be reloaded from the array each time, which takes a bit of time.

1 Like