Differentiable argmin? Trying VQ-VAE in Flux.jl

afishy · October 8, 2022, 2:46am

I’m trying to follow this VQ-VAE tutorial in Julia, particularly the VectorQuantizer module:

        # Calculate distances
        distances = (torch.sum(flat_input**2, dim=1, keepdim=True) 
                    + torch.sum(self._embedding.weight**2, dim=1)
                    - 2 * torch.matmul(flat_input, self._embedding.weight.t()))
            
        # Encoding
        encoding_indices = torch.argmin(distances, dim=1).unsqueeze(1)
        encodings = torch.zeros(encoding_indices.shape[0], self._num_embeddings, device=inputs.device)
        encodings.scatter_(1, encoding_indices, 1)
        
        # Quantize and unflatten
        quantized = torch.matmul(encodings, self._embedding.weight).view(input_shape)

It looks like Pytorch is able to differentiate through torch.argmin, while Zygote in my Julia implementation can’t:

emb = Flux.Embedding(args[:emb_dim], args[:num_embeddings]; init=Flux.glorot_uniform) |> gpu
ps = Flux.params(emb)

loss, grad = withgradient(ps) do
    distances = sum(flat_input.^ 2, dims=2)' .+ sum(emb.weight .^ 2, dims=2) + 2.0f0 * emb.weight * flat_input'

    encoding_indices = argmin(distances, dims=1)

    encodings = NNlib.scatter(+, zeros(size(z_flat, 1), num_embeddings)' |> gpu, enc_inds; dstsize=(num_embeddings, size(flat_input, 1)))

    sum(encodings)
end

where the grads w.r.t. ps returns nothing

I understand that argmin isn’t AD-friendly. Is there a way I could implement this vector quantization step differentiably?

avikpal · October 8, 2022, 2:45pm

argmin is not the problem here. scatter is non-differentiable wrt encoding_indices NNlib.jl/scatter.jl at c0b4b8b6e969422ff4af18b473d02192b27c9cf4 · FluxML/NNlib.jl · GitHub

julia> using Zygote, NNlib

julia> gradient((src, idx) -> sum(NNlib.scatter(+, src, idx)), [10, 100], [1, 3])
([1.0, 1.0], nothing)

ToucheSir · October 8, 2022, 3:09pm

Isn’t the issue rather that the PyTorch example does a further matmul with self._embedding.weight (which will propagate a gradient signal back) whereas the Julia one does not? I wouldn’t think PyTorch’s scatter is differentiable wrt indices either.

avikpal · October 8, 2022, 3:11pm

Yes, I think so too. I did not mean to say the scatter implementation is incorrect, just that argmin was not causing the problem.

afishy · October 10, 2022, 2:51am

Thank you, I missed that!

Topic		Replies	Views
Differentiating Jacobian-vector product for sliced score matching? Machine Learning flux , zygote , ad	18	530	June 29, 2023
Maliar, Maliar, and Winant using Flux.jl (I just want to write a custom objective) Machine Learning question , flux , zygote	8	672	January 19, 2024
[ANN] Flux v0.10 Machine Learning	36	5294	February 4, 2020
Errors when learning a function of a Jacobian. What am I doing wrong? Machine Learning flux , zygote	0	405	February 13, 2022
Custom Optimisers and Projections in Flux/Zygote: Is There a Canonical Way? Machine Learning question	1	56	April 5, 2025

Differentiable argmin? Trying VQ-VAE in Flux.jl

Related topics