Consider the following model which creates an output after every layer
’
function track(model, input)
track::Vector{Vector{Float64}} =
p = input
for layer in model
p = layer(p)
track = vcat(track, [p,])
end
return track
end
’
which is unfortunately quite inefficient. Is there a way to store the outputs more efficiently into preallocated memory? Ideally into a mutli-dimensional array?
Sure, if you know the size(s) beforehand (and things are rectangular, i.e. size of the output of each layer is the same) E.g. for 5 layers, each with an output of length 64, you can initialize a multi(i.e. 2)-dimensional array like this:
track = Matrix{Float64}(undef, 64, 5)
And then loop over the layers like this:
for (i, layer) in enumerate(model)
p = layer(p)
track[:, i] = p
end
Not that this still allocates a vector p
and then copies it into your matrix. To avoid allocations completely, you would want to rewrite your layer
function to operate in-place on a pre-allocated vector (and then pass in e.g. @view track[:, i]
to write the results directly into the matrix column).
1 Like
Unfortunately this approach is not supported by Flux/Zygote
Mutating arrays is not supported – called setindex!(::Matrix{Float64}, _…)
But I guess there must be some workaround for this since the array is not mutated in a mathematical sense. In contrast to in-place operations which override data necessary for calculation of gradients, this just stores results into memory.
I don’t think it was clear up front that this was in a gradient context, because that changes the solution space quite dramatically. There are two ways you can go about this in Zygote:
- Use
Zygote.Buffer
instead of a plain array.
- Put your layers in a
Chain
, call Flux.activations
to get a set of outputs and then reduce(hcat, outputs)
to allocate the array once.
#1 is your best shot (short of writing AD rules) for the in-place option @stevengj described. #2 may be faster if you can live with the p = layer(p)
allocation for each layer, but I would try both just to be sure.
2 Likes
Thanks a lot!
In case I stick with ‘p = layer(p)’ Flux.activations outperforms Zygote.Buffer significantly:
60.174614 seconds (79.67 M allocations: 16.958 GiB, 3.24% gc time)
vs
210.037932 seconds (691.79 M allocations: 49.948 GiB, 4.98% gc time)