Hey all,

I worked with tensorflow1.x/2.x/pytorch/jax for years and now I found Julia language where I find there is a lot more opportunity.

I would like to ask everyone so we can learn from each other, how would you optimise a code like this. I want to ask, what is the maximum speed you guys can you reach with this code?

```
using Flux
using Distributions
batchsize, timesteps = 2000, 200
data = (randn(Float32, timesteps, batchsize, 2),
randn(Float32, timesteps, batchsize, 1))
init_norm() = (shape) -> randn(Float32, shape)
mse_fn(Y_pred, Y_true) = mean((Y_pred .- Y_true).^2)
function model(prms, data)
X, Y = data
X_t = [X[t, :, :] for t in 1:timesteps]::Array{Array{Float32,2},1}
initial_state = ones(Float32, 2000, 1)
ylist, state = Array{Array{Float32,2},1}(), [initial_state]
function predict_cell(state, inputs)
predict_fn_const(prms, inputs, state)
end
cell = Flux.Recur(predict_cell, state)
ylist = cell.(X_t)::Array{Array{Float32,2},1}
# concat result back.
predict = cat(dims=2, ylist...)'
predict = reshape(predict, size(predict)..., 1)
loss = mse_fn(predict, Y)
loss
end
function init(data)
prms = [init_norm()((2000, 1))]
loss = model(prms, data)
opt = ADAM(0.06, (0.3, 0.7))
loss, prms, opt
end
function gradient_pro(f, args...)
@time y, back = Flux.pullback(f, args...) # forwardprop time
return y, back(one(y))
end
function step(data, prms, opt)
loss, grads = gradient_pro((p) -> model(p, data), prms)
return loss, prms, opt
end
function predict_fn_const(prms, inputs, state)
# SOME random multiplications...
i1, i2 = inputs[:, 1:1], inputs[:, 2:2]
v1 = prms[1]
o14 = i1 + i2
o104 = o14 .* v1
r206 = state[1]
out, next_states = o104 + r206, [o104]
return next_states, out
end
function test()
loss, params, opt = init(data)
for i in 1:10
@time loss, params, opt = step(data, params, opt) # forward & backwardprop time
end
loss
end
@time test()
```

My run results:

```
0.838855 seconds (1.48 M allocations: 89.775 MiB)
1.736759 seconds (2.47 M allocations: 1.347 GiB, 2.03% gc time)
0.012596 seconds (36.88 k allocations: 16.770 MiB)
0.531656 seconds (53.78 k allocations: 1.228 GiB, 25.55% gc time)
0.011113 seconds (36.88 k allocations: 16.770 MiB)
0.430338 seconds (53.78 k allocations: 1.228 GiB, 52.53% gc time)
0.008225 seconds (36.88 k allocations: 16.770 MiB)
0.353133 seconds (53.78 k allocations: 1.228 GiB, 2.43% gc time)
0.011098 seconds (36.88 k allocations: 16.770 MiB)
0.282289 seconds (53.78 k allocations: 1.228 GiB, 1.17% gc time)
0.007328 seconds (36.88 k allocations: 16.770 MiB)
0.223698 seconds (53.78 k allocations: 1.228 GiB, 15.76% gc time)
0.006849 seconds (36.88 k allocations: 16.770 MiB)
0.406923 seconds (53.78 k allocations: 1.228 GiB)
0.007938 seconds (36.88 k allocations: 16.770 MiB)
0.410816 seconds (53.78 k allocations: 1.228 GiB, 37.27% gc time)
0.006129 seconds (36.88 k allocations: 16.770 MiB)
0.218867 seconds (53.78 k allocations: 1.228 GiB)
0.007709 seconds (36.88 k allocations: 16.770 MiB)
0.397147 seconds (53.78 k allocations: 1.228 GiB, 11.62% gc time)
5.268901 seconds (2.95 M allocations: 12.403 GiB, 17.46% gc time)
5.495894 seconds (3.26 M allocations: 12.430 GiB, 16.74% gc time)
```

I have a strong confidence about that a 10x speed up should be possible.

Also the 50x diff between the backwardprop and forwardprop sounds a little bit huge. Can you guys help me?

p.s. I also made a reverse diff solution, which is 5x faster but I didn’t see if GPU is viable there so I have to stick with Flux + Zygote.