I am not new to Julia, but I am not really a regular user either. I am trying to make a code that is non vectorized to achieve the performance of a vectorized one, but I cannot do it. Here is the problem in general lines:
I have a function f(x,u,d)
representing a discrete time dynamical system, that takes 3 arguments, where x
,u
and d
are (column) vectors, and it returns an output. I am currently working with the particular function
function f(x,u,d)
x+[cos(x[3]);sin(x[3]);u]+0.5*[-sin(x[3])*u;cos(x[3])*u;0.]+d;
end
which represents a discrete time Dubinsâ vehicle with a single input, but this should not matter too much.
I want to calculate the output of this function for several x
,u
and d
, in this case, organized in tensors of dimensions 3xTxK for x_tens
and d_tens
and a 1xTxK tensor for u_tens
. I cannot use broadcast because my function needs to access particular elements of x
. The functions map and mapslices do not work because I have more than one input. I have tried to implement this in three different ways:
- By using
eachcol
and multiple dispatch:
function dyn_eachcol(x::Matrix{Type},u::Matrix{Type},d::Matrix{Type}) where {Type}
reduce(hcat,f.(eachcol(x),eachcol(u),eachcol(d))) ::Matrix{Type}
end
function dyn_eachcol(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type}
T=size(d)[2]
K=size(d)[3]
x=reshape(x,:,T*K)
d=reshape(d,:,T*K)
u=reshape(u,:,T*K)
reshape(dyn_eachcol(x,u,d),:,T,K) ::Array{Type,3}
end
(instead of using reshape, one could use eachslice, but the result is a little bit faster this way)
- Using a for loop:
function dyn_for(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type}
T=size(d)[2]
K=size(d)[3]
xout=similar(x)
for k=1:K
for t=1:T
@views xout[:,t,k]=f(x[:,t,k],u[:,t,k],d[:,t,k])
end
end
return xout ::Array{Type,3}
end
- Vectorizing:
function dyn_vectorized(x::Array{Type,3},u::Array{Type,3},d::Array{Type,3}) where {Type}
T=size(d)[2]
K=size(d)[3]
x+[cos.(reshape(x[3,:,:],1,:,K));sin.(reshape(x[3,:,:],1,:,K));u]+0.5*[-sin.(reshape(x[3,:,:],1,:,K)).*u;cos.(reshape(x[3,:,:],1,:,K)).*u;zeros(1,T,K)]+d
end
When I benchmark the speed of each one of this approaches I get a something like
@benchmark dyn_eachcol(x_tens,u_tens,d_tens) # median = 32ms
@benchmark dyn_for(x_tens,u_tens,d_tens) # median = 24ms
@benchmark dyn_vectorized(x_tens,u_tens,d_tens) # median = 4ms
So, the vectorized code is a little bit faster, but not that much. But what I really want to do is to take the gradient of the result, using Zygote. Here are some benchmarks:
@benchmark gradient((x,u)->sum(dyn_eachcol(x,u,d_tens)),x_tens,u_tens) # median = 823 ms
@benchmark gradient((x,u)->dyn_for_sum(x,u,d_tens),x_tens,u_tens) # median = 90277 ms (this is not a typo)
@benchmark gradient((x,u)->sum(dyn_vectorized(x,u,d_tens)),x_tens,u_tens) #median = 12 ms
Note: for the second example Zygote does not support mutation arrays, so the line xout[:,t,k]=f(x[:,t,k],u[:,t,k],d[:,t,k]) is substituted by out+=sum(f(x[:,t,k],u[:,t,k],d[:,t,k]))
This is a dramatic difference. I donât think the vectorized code is calling any BLAS routine, this should be pure Julia. Yet, the results are so different. What am I missing? How can I write a non vectorized code which achieves the speed of the vectorized one?