Hey,

I’m using Zygote for a minimization problem. Actually, writing down the *loss* function in for-loops style is a factor of ~2 faster than using a vectorized function. I then need to take the derivative.

Zygote fails completely (in terms of performance) on the for-loop function. The vectorized is ok.

But Zygote is roughly 5-10x slower than the function itself.

I was wondering if I’m doing it correctly or if there are some tricks how to improve the speed. I couldn’t find any good online resources for that. I tried to remove the @inbounds and also the @views, it stays roughly the same.

```
function laplace_vec(rec)
@views a = (rec[1:end-2, 2:end-1] .- 4 .* rec[2:end - 1, 2:end - 1]
.+ rec[3:end, 2:end - 1])
@views b = (rec[2:end-1, 1:end-2]
.+ rec[2:end - 1, 3:end])
return @views sum((a .+ b) .^ 2)
end
function laplace_for(rec)
res = zero(eltype(rec))
for j = 2:size(rec)[2] - 1
for i = 2:size(rec)[1] - 1
@inbounds res += (rec[i - 1, j] + rec[i+1, j]
+ rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
end
end
return res
end
x = rand(100, 100)
@time a = laplace_vec(x)
@time a = laplace_vec(x)
@time b = laplace_for(x)
@time b = laplace_for(x)
print("Zygote", "\n")
@time Zygote.gradient(laplace_vec, x)
@time Zygote.gradient(laplace_vec, x)
@time Zygote.gradient(laplace_for, x)
@time Zygote.gradient(laplace_for, x)
```

The performance results are roughly:

```
0.000059 seconds (12 allocations: 225.750 KiB)
0.000051 seconds (12 allocations: 225.750 KiB)
0.000021 seconds (1 allocation: 16 bytes)
0.000015 seconds (1 allocation: 16 bytes)
Zygote
0.000233 seconds (174 allocations: 1.424 MiB)
0.000315 seconds (174 allocations: 1.424 MiB)
1.151743 seconds (836.62 k allocations: 7.183 GiB, 19.52% gc time)
1.144881 seconds (836.62 k allocations: 7.183 GiB, 18.95% gc time)
```

Can anybody help me?

Thanks,

Felix