Sparse Feed Forward NN

Hi! I have a question about (hypothetical) sparse layers in feedforward neural neworks:

How would I go about using sparse array in place of W(eights) and b(iases) in the Flux’es Dense layer? What I imagine is that I have sparse inputs and sparse layers, and that I could obtain sparse gradients (either including gradients for zero weights or not) via backpropagation in Julia. I’ve tried to do the following (Julia 1.3.1, Flux 0.10.0):

using Flux
using Flux: Chain, crossentropy, gradient
using SparseArrays
struct SprAffine{F,S,T}
    W::S
    b::T
    σ::F
end
SprAffine(in::Integer, out::Integer, σ::Function) = SprAffine(sprandn(out, in, 0.5), sprandn(out, 0.5), σ)
(m::SprAffine)(x) = m.σ.(m.W * x .+ m.b)
Flux.@functor SprAffine
loss(x, y) = crossentropy(m(x), y)
m = Chain(SprAffine(4, 3, σ), SprAffine(3, 2, σ), softmax);
 loss(sparse(1:4), sparse(1:2)) computes (and returns a Float64), but
gradient(params(m)) do
    loss(sparse(1:4), sparse(1:2))
end

gives me an error: MethodError: no method matching zero(::Type{Tuple{Float64,Zygote.var"#916#back#380"{Zygote.var"#378#379"{Float64}}}}).

Having downgraded to Flux v0.9.0, and using Tracker, I do get technically sparse gradients. They are filled with explicit zeros after start of tracking, but I can call dropzeros! by hand on all gradients, and they seem to remain sparse (i.e. without explicit zeros) after calling train!. Does Tracker really behave differently than Zygote when handling sparse weights, or am I just making some mistake?

Just to add some concreteness to what this is supposed to be for: I was wondering how hard would it be to implement in Julia something similar (if not necessarily identical) to experiments done in 1711.05136v5 or 1901.09181. The first paper actually does not compute gradients of sparse zeroes, whereas the second one does, as far as I understand.

Any explanations or directions would be appreciated! :slightly_smiling_face:

Just a small correction. The following works without the quoted error message:

using SparseArrays
using Zygote
using Flux: Chain, σ, crossentropy, softmax

struct SprAffine{F,S,T}
    W::S
    b::T
    σ::F
end
SprAffine(in::Integer, out::Integer, σ::Function) = SprAffine(sprandn(out, in, 0.5), sprandn(out, 0.5), σ)
(m::SprAffine)(x) = m.σ.(m.W * x .+ m.b)

loss(model, x, y) = crossentropy(Array(model(x)), y)
model = Chain(SprAffine(4, 3, σ), SprAffine(3, 2, σ), softmax);

x_sparse = sparse(1:4)
y = [1 2]
grads =
gradient(model) do m
    loss(m, x_sparse, y)
end[1]

typeof(grads[1][1].W)   # SparseMatrixCSC{Float64,Int64}

So the returned gradients can be sparse, which is awesome. But, unsurprisingly, they are the full correct gradients that one would obtain using dense arrays, only in the sparse format.

I would love to also learn, how to tell the gradient function to ignore zeros in the sparse matrix and to not compute gradients for them. My guess is that I should maybe redefine the adjoint for a sparse-matrix-times-vector multiplication.
As before, any explanations or directions would be appreciated! :slight_smile: