I am not new to Julia, but I am not really a regular user either. I am trying to make a code that is non vectorized to achieve the performance of a vectorized one, but I cannot do it. Here is the problem in general lines:

I have a function `f(x,u,d)`

representing a discrete time dynamical system, that takes 3 arguments, where `x`

,`u`

and `d`

are (column) vectors, and it returns an output. I am currently working with the particular function

```
function f(x,u,d)
x+[cos(x[3]);sin(x[3]);u]+0.5*[-sin(x[3])*u;cos(x[3])*u;0.]+d;
end
```

which represents a discrete time Dubinsâ vehicle with a single input, but this should not matter too much.

I want to calculate the output of this function for several `x`

,`u`

and `d`

, in this case, organized in tensors of dimensions 3xTxK for `x_tens`

and `d_tens`

and a 1xTxK tensor for `u_tens`

. I cannot use broadcast because my function needs to access particular elements of `x`

. The functions map and mapslices do not work because I have more than one input. I have tried to implement this in three different ways:

- By using
`eachcol`

and multiple dispatch:

```
function dyn_eachcol(x::Matrix{Type},u::Matrix{Type},d::Matrix{Type}) where {Type}
reduce(hcat,f.(eachcol(x),eachcol(u),eachcol(d))) ::Matrix{Type}
end
function dyn_eachcol(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type}
T=size(d)[2]
K=size(d)[3]
x=reshape(x,:,T*K)
d=reshape(d,:,T*K)
u=reshape(u,:,T*K)
reshape(dyn_eachcol(x,u,d),:,T,K) ::Array{Type,3}
end
```

(instead of using reshape, one could use eachslice, but the result is a little bit faster this way)

- Using a for loop:

```
function dyn_for(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type}
T=size(d)[2]
K=size(d)[3]
xout=similar(x)
for k=1:K
for t=1:T
@views xout[:,t,k]=f(x[:,t,k],u[:,t,k],d[:,t,k])
end
end
return xout ::Array{Type,3}
end
```

- Vectorizing:

```
function dyn_vectorized(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type}
T=size(d)[2]
K=size(d)[3]
x+[cos.(reshape(x[3,:,:],1,:,K));sin.(reshape(x[3,:,:],1,:,K));u]+0.5*[-sin.(reshape(x[3,:,:],1,:,K)).*u;cos.(reshape(x[3,:,:],1,:,K)).*u;zeros(1,T,K)]+d
end
```

When I benchmark the speed of each one of this approaches I get a something like

```
@benchmark dyn_eachcol(x_tens,u_tens,d_tens) # median = 32ms
@benchmark dyn_for(x_tens,u_tens,d_tens) # median = 24ms
@benchmark dyn_vectorized(x_tens,u_tens,d_tens) # median = 4ms
```

So, the vectorized code is a little bit faster, but not that much. But what I really want to do is to take the gradient of the result, using Zygote. Here are some benchmarks:

```
@benchmark gradient((x,u)->sum(dyn_eachcol(x,u,d_tens)),x_tens,u_tens) # median = 823 ms
@benchmark gradient((x,u)->dyn_for_sum(x,u,d_tens),x_tens,u_tens) # median = 90277 ms (this is not a typo)
@benchmark gradient((x,u)->sum(dyn_vectorized(x,u,d_tens)),x_tens,u_tens) #median = 12 ms
```

Note: for the second example Zygote does not support mutation arrays, so the line xout[:,t,k]=f(x[:,t,k],u[:,t,k],d[:,t,k]) is substituted by out+=sum(f(x[:,t,k],u[:,t,k],d[:,t,k]))

This is a dramatic difference. I donât think the vectorized code is calling any BLAS routine, this should be pure Julia. Yet, the results are so different. What am I missing? How can I write a non vectorized code which achieves the speed of the vectorized one?