Suggestions to improve Zygote performance for simple vector map/broadcast/comprehension?

I’ve got some code which is building some moderate sized vectors, which I’d like to derive through. Here’s an example building these vectors with a comprehension (the issue is the same if I use map or broadcast):

build_vector(x) = [i<500 ? x : 0 for i=1:1000]
@btime gradient(x -> sum(build_vector(x)), 1) # ~2ms

This is on a perfromance critical inner loop, and it turns out this is ~1000 times slower than if I wrote the adjoint by hand,

build_vector_with_adjoint(x) = build_vector(x)
@adjoint function build_vector_with_adjoint(x)
    y = build_vector(x)
    function back(Δ)
        b = [i<500 ? 1 : 0 for i=1:1000]
        (b'Δ,)
    end
    y, back
end
@btime gradient(x->sum(build_vector_with_adjoint(x)), 1) # ~2μs

I don’t think I’m cheating too bad with this custom adjoint, it seems like this should basically be what Zygote should be writing for me. Profiling does show me some dynamic dispatch deep in the Zygote call-tree but I’m not familiar enough with the internals to make sense of it. The Zygote broadcast.jl source code has some comments alluding to performance hits and generic fallbacks, maybe I’m inadvertantly hitting something here? Any other suggestions to gain some performance without writing custom adjoints (which in my real non-MWE I think would be far more painful than here)? Thanks.

Here are a few ideas, depending on how close this is to your non-MWE:

julia> build_vector(x) = [i<500 ? x : zero(x) for i=1:1000];

julia> @btime Zygote.gradient(x -> sum(build_vector(x)), 1)
  2.490 ms (15573 allocations: 644.61 KiB)
(499,)

julia> @btime ForwardDiff.derivative(x -> sum(build_vector(x)), 1)
  1.470 μs (1 allocation: 15.75 KiB)
499

julia> @btime Zygote.gradient(x -> sum(Zygote.forwarddiff(build_vector,x)), 1)
  6.590 μs (28 allocations: 40.34 KiB)
(499,)

Thanks, thats helpful to see. In my non-MWE x is ~10 dimensional so I wanted to use reverse-mode, but maybe this still wins out, I can try it.