I’ve tried my best to distill my problem into a minimal working example to illustrate exactly what’s wrong, instead of dumping my entire 2k+ line project here; have I been wrong in this? I feel pretty confident that I’ll be able to write my program to be efficient, but it’s hard when I don’t understand why Julia is generating very slow code on operations I know can be written very fast.
(Likewise the zygote issue ,
gradient(identity, 1)
works pretty well!)
I see you’ve replied on the issue as well, but I’ll save a click to other readers by saying, no, my problem wasn’t that I wanted to differentiate the identity function and got stuck
I wasn’t suggesting that this isn’t a bottleneck for you, simply that focusing on the allocations per se (and keeping other things unchanged) may not be the best approach.
This is the quote I was referencing:
if this is really a bottleneck in your code then
Looking at it again, I might have been too rash, but at the time it looked like you were saying “If this is really the bottleneck in your code, then [you should do something completely different”. My bad if this wasn’t what you meant
Eg if the
xs
small, you can write very clean, idiomatic code usingStaticArrays
(as suggested by others), which should be rather fast, and Zygote will just work fine with it.
xs
is indeed very small, and I tried the StaticArrays
route, but see the issue linked by @mcabbott.
The Julia ecosystem has evolved some very efficient techniques for dealing with the issues you are facing, but it is very hard to help without context.
My issue was that the simplest way I could think of of defining a single matrix of 9 elements ended up being 13 allocations, which is almost 150% of the number of elements. This was very surprising to me, and having spent some time in my life looking at compiler output I do have a feeling for when the compiler is doing the Wrong Thing, and in this case it 100% was.
Now, using the broadcasting version suggested by @DNF did fix it, I would like to better understand why Julia generated sub-par code before, so that I don’t have to flip every stone in my program or run benchmark every time I write code just to avoid having bad code generated. Furthermore, since I’ve already ran into these problems with the optimizer for the primal, and I’m ADing this, I’d like to make the job of the compiler as simple as possible so that it can do what it does best. This includes not having allocations unless your really need to, even if you can’t immediately spot it as a hot spot on a profile.
I’m not sure if this is a fruitful approach, but if mutation is the only way to get the performance you need, then perhaps the optimal strategy is to use something like FiniteDiff.jl for calculating the pullbacks of the mutating code and then using Zygote’s custom adjoint machineryto embed that so that the rest of your code can be handled by Zygote.
Thanks! I’ll definitely look into it.
Oh dear, I realize just now I’m completely butchering the reply system on Discourse