Not using PreallocationTools.jl correctly

I somehow managed to solve this, though I don’t understand exactly why this solution works. Basically, it has to do with the chunking in ForwardDiff.jl.

(This reply, How to make this function compatible with ForwardDiff? - #9 by ChrisRackauckas pointed me to the solution).

Staying with the code I posted above, the following does the trick:

aux(x) = logl!(C,x)
cfg1 = ForwardDiff.GradientConfig(aux, W, ForwardDiff.Chunk{1}())
# The call below is now fast and allocates far less
ForwardDiff.gradient(aux, W, cfg1) 

However: even if the above works, I am not sure why it does. Hence, to anyone that has the same problem and is reading this, I am not sure whether this is a good solution. Perhaps more knowledgeable people can offer a comment.

Also: letting ForwardDiff.jl decide on its own about the chunking, works even better:

cfg1 = ForwardDiff.GradientConfig(aux, W, ForwardDiff.Chunk{1}());
@btime ForwardDiff.gradient(aux, W, cfg1)
  10.771 μs (161 allocations: 39.89 KiB)

# VS

cfg = ForwardDiff.GradientConfig(aux, W);
@btime ForwardDiff.gradient(aux, W, cfg)
  2.956 μs (21 allocations: 18.17 KiB)

Obviously, it doesn’t beat the Enzyme solution above.