And that’s not always a good assumption! I am working on a masked dense layer, which looks something like this
struct MaskedDense{F, M <: AbstractMatrix, B <: AbstractVector}
(d::MaskedDense)(x::AbstractArray) = d.σ.(d.b .+ (d.W .* d.M)*x
@Flux.functor MaskedDense (W, b)
If we try to move it to the GPU, the trainable parameters specified in the functor declaration are moved to the GPU
m = MaskedDense(randn(2, 2), randn(2), [true false; false true], Flux.sigmoid)
typeof(gpu(m)) = Main.MaskedDense{typeof(NNlib.σ), CUDA.CuArray{Float32, 2}, CUDA.CuArray{Float32, 1}}
but this does not work with the mask! And if we include the mask in the functor definition @Flux.functor MaskedDense (W, b, M)
, it is indeed moved, but we get masks that are being updated in training which is not desirable. It should of course be moved to the GPU but not trainable.
I tried to hijack the gpu
function like so
Flux.gpu(d::MaskedDense) = MaskedDense(cu(d.W), cu(d.b), cu(d.M), d.σ)
but that causes my Julia session to explode in errors like [4] _setindex! at ./abstractarray.jl:1290
when training, so I assume that’s not the way to go.
Has anyone here implemented something similar and know how to tackle this? Thanks!