Sparse Feed Forward NN

Hi! I have a question about (hypothetical) sparse layers in feedforward neural neworks:

How would I go about using sparse array in place of W(eights) and b(iases) in the Flux’es Dense layer? What I imagine is that I have sparse inputs and sparse layers, and that I could obtain sparse gradients (either including gradients for zero weights or not) via backpropagation in Julia. I’ve tried to do the following (Julia 1.3.1, Flux 0.10.0):

using Flux
using Flux: Chain, crossentropy, gradient
using SparseArrays
struct SprAffine{F,S,T}
SprAffine(in::Integer, out::Integer, σ::Function) = SprAffine(sprandn(out, in, 0.5), sprandn(out, 0.5), σ)
(m::SprAffine)(x) = m.σ.(m.W * x .+ m.b)
Flux.@functor SprAffine
loss(x, y) = crossentropy(m(x), y)
m = Chain(SprAffine(4, 3, σ), SprAffine(3, 2, σ), softmax);
 loss(sparse(1:4), sparse(1:2)) computes (and returns a Float64), but
gradient(params(m)) do
    loss(sparse(1:4), sparse(1:2))

gives me an error: MethodError: no method matching zero(::Type{Tuple{Float64,Zygote.var"#916#back#380"{Zygote.var"#378#379"{Float64}}}}).

Having downgraded to Flux v0.9.0, and using Tracker, I do get technically sparse gradients. They are filled with explicit zeros after start of tracking, but I can call dropzeros! by hand on all gradients, and they seem to remain sparse (i.e. without explicit zeros) after calling train!. Does Tracker really behave differently than Zygote when handling sparse weights, or am I just making some mistake?

Just to add some concreteness to what this is supposed to be for: I was wondering how hard would it be to implement in Julia something similar (if not necessarily identical) to experiments done in 1711.05136v5 or 1901.09181. The first paper actually does not compute gradients of sparse zeroes, whereas the second one does, as far as I understand.

Any explanations or directions would be appreciated! :slightly_smiling_face: