On simple reduction tasks Zygote seems to perform exceptionally poorly, and I would be grateful to hear why and what alternative patterns I should use to avoid these problems? I’ve started implementing custom adjoints via `rrule`

but if I do this for every little piece of code then I might as well implement all gradients myself and forget about Zygote.

```
julia> using BenchmarkTools, Zygote
f(x) = mapreduce(xi -> xi^2, +, x)
x = rand(100)
f(x)
Zygote.gradient(f, x)
@btime f($x)
@btime Zygote.gradient($f, $x)
11.128 ns (0 allocations: 0 bytes)
480.792 μs (3369 allocations: 191.02 KiB)
```

We can do a little better like this:

```
julia> using BenchmarkTools, Zygote
_sq(xi) = xi^2
f(x) = sum(_sq, x)
x = rand(100)
f(x)
Zygote.gradient(f, x)
@btime f($x)
@btime Zygote.gradient($f, $x)
14.658 ns (0 allocations: 0 bytes)
18.477 μs (392 allocations: 12.30 KiB)
```

but why!?!

And even better like this

```
julia> using BenchmarkTools, Zygote
f(x) = sum(x.^2)
x = rand(100)
f(x)
Zygote.gradient(f, x)
@btime f($x)
@btime Zygote.gradient($f, $x)
117.833 ns (1 allocation: 896 bytes)
239.328 ns (3 allocations: 1.77 KiB)
```

But even that third example doesn’t come anywhere near the 20-40ns that I would expect here.