I’ve got some code which is building some moderate sized vectors, which I’d like to derive through. Here’s an example building these vectors with a comprehension (the issue is the same if I use
build_vector(x) = [i<500 ? x : 0 for i=1:1000] @btime gradient(x -> sum(build_vector(x)), 1) # ~2ms
This is on a perfromance critical inner loop, and it turns out this is ~1000 times slower than if I wrote the adjoint by hand,
build_vector_with_adjoint(x) = build_vector(x) @adjoint function build_vector_with_adjoint(x) y = build_vector(x) function back(Δ) b = [i<500 ? 1 : 0 for i=1:1000] (b'Δ,) end y, back end @btime gradient(x->sum(build_vector_with_adjoint(x)), 1) # ~2μs
I don’t think I’m cheating too bad with this custom adjoint, it seems like this should basically be what Zygote should be writing for me. Profiling does show me some dynamic dispatch deep in the Zygote call-tree but I’m not familiar enough with the internals to make sense of it. The Zygote broadcast.jl source code has some comments alluding to performance hits and generic fallbacks, maybe I’m inadvertantly hitting something here? Any other suggestions to gain some performance without writing custom adjoints (which in my real non-MWE I think would be far more painful than here)? Thanks.