I somehow managed to solve this, though I don’t understand exactly why this solution works. Basically, it has to do with the chunking in ForwardDiff.jl.
(This reply, How to make this function compatible with ForwardDiff? - #9 by ChrisRackauckas pointed me to the solution).
Staying with the code I posted above, the following does the trick:
aux(x) = logl!(C,x)
cfg1 = ForwardDiff.GradientConfig(aux, W, ForwardDiff.Chunk{1}())
# The call below is now fast and allocates far less
ForwardDiff.gradient(aux, W, cfg1)
However: even if the above works, I am not sure why it does. Hence, to anyone that has the same problem and is reading this, I am not sure whether this is a good solution. Perhaps more knowledgeable people can offer a comment.
Also: letting ForwardDiff.jl decide on its own about the chunking, works even better:
cfg1 = ForwardDiff.GradientConfig(aux, W, ForwardDiff.Chunk{1}());
@btime ForwardDiff.gradient(aux, W, cfg1)
10.771 μs (161 allocations: 39.89 KiB)
# VS
cfg = ForwardDiff.GradientConfig(aux, W);
@btime ForwardDiff.gradient(aux, W, cfg)
2.956 μs (21 allocations: 18.17 KiB)
Obviously, it doesn’t beat the Enzyme solution above.