This is a known issue. Every indexing operation makes zero(rec) to accumulate the gradient, which is why the scalar indexing in laplace_for is so much slower. Here’s one issue about this:
https://github.com/FluxML/Zygote.jl/issues/644
I have a package solution, although not yet registered:
julia> using Pkg; pkg"add https://github.com/mcabbott/Tullio.jl"
julia> using Tullio
julia> laplace_tul(rec) = @tullio res = (rec[i - 1, j] + rec[i+1, j] +
rec[i, j+1] + rec[i, j-1] - 4 * rec[i,j])^2
julia> @btime Zygote.gradient(laplace_vec, $x); # 18.673 μs forwards
269.441 μs (167 allocations: 1.42 MiB)
julia> @btime Zygote.gradient(laplace_for, $x); # 10.977 μs forwards
4.892 s (836613 allocations: 7.18 GiB)
julia> @btime Zygote.gradient(laplace_tul, $x); # 3.811 μs forwards
44.920 μs (41 allocations: 79.34 KiB)