Group Convolution Implementation - Work-around mutating arrays in gradient calcs

I’m attempting to implement group convolutions which applies group transformations to incoming filters and then apply convolution.

To do so, I create a set of transformation indices and then create an array with each of these indices mapped to the original filters to get a group set of filters. The below mapping when calculating the gradient is where the issue comes from →

function apply_transformation_group_to_group(m_transformation_set, channels_in, filters, in_group_size, filter_size)
    no_filters,  in_group_size, transform_group_size = size(m_transformation_set)
    # This will hold the set of filters after group transformations are applied
    filter_result = Array{Any}(undef, ( filter_size..., in_group_size, channels_in,  size(filters)[1], transform_group_size))
    for filter in 1:size(filters)[1]
        for group=1:in_group_size # for each of the group transformations applied to the group filters
            for channel=1:channels_in
             # permute which transformations are applied to which group filter
            for g_filter in 1:transform_group_size
                perm = circshift(1:in_group_size, g_filter)
                m_transformation_set_for_filter_set = m_transformation_set[filter,group, g_filter][2] # get the set of transformation indicies for this filter, group, group_action
                filter_result[: ,:,  perm[group] , channel,  filter, g_filter] = filters[filter, group, :, :][m_transformation_set_for_filter_set] # apply the transformation through array referencing
        end
    end
    end
end

I’m running into this issue that Mutating Arrays is not supported in taking the gradient. I’m a Julia noob so am not sure how to reframe the approach to overcome this, all the values in the multi-dim array are just reindexed from the filters stored in the P4GroupConv.w shown below.

struct P4GroupConv{N,M,F,A,V,S, R}
    w :: A
    bias::V
    σ :: F
    stride::NTuple{N, Int}
    pad::NTuple{M, Int}
    dilation::NTuple{N,Int}
    channels_in :: Int
    transformation_set::Array{Array{Any,S}, R}
end

Is this kind of re-indexing and summing the gradient contributions from the original indexed filters possible in Juila?

Could you provide some sample data? Something like this, except that it should run…

m_transformation_set = [[nothing, rand(1:10, 7)] for f in 1:2, g in 1:3, t in 1:4]
filters = rand(6, 3, 10, 1)
apply_transformation_group_to_group(m_transformation_set, 5, filters, 3, 10)

Sure,

So m_transformation_set is an array of Cartesian indexes which refers to transformation indexing like so:

4×3×3 Array{CartesianIndex{2},3}:
[:, :, 1] =
CartesianIndex(1, 3) CartesianIndex(1, 2) CartesianIndex(1, 1)
CartesianIndex(3, 3) CartesianIndex(2, 3) CartesianIndex(1, 3)
CartesianIndex(3, 1) CartesianIndex(3, 2) CartesianIndex(3, 3)
CartesianIndex(1, 1) CartesianIndex(2, 1) CartesianIndex(3, 1)

[:, :, 2] =
CartesianIndex(2, 3) CartesianIndex(2, 2) CartesianIndex(2, 1)
CartesianIndex(3, 2) CartesianIndex(2, 2) CartesianIndex(1, 2)
CartesianIndex(2, 1) CartesianIndex(2, 2) CartesianIndex(2, 3)
CartesianIndex(1, 2) CartesianIndex(2, 2) CartesianIndex(3, 2)

[:, :, 3] =
CartesianIndex(3, 3) CartesianIndex(3, 2) CartesianIndex(3, 1)
CartesianIndex(3, 1) CartesianIndex(2, 1) CartesianIndex(1, 1)
CartesianIndex(1, 1) CartesianIndex(1, 2) CartesianIndex(1, 3)
CartesianIndex(1, 3) CartesianIndex(2, 3) CartesianIndex(3, 3)

This is 4 transformations applied to a 3x3 filter passed in as shape:

number_of_filters * transformational_group_size_in * 3 * 3

an example being:

filter[1,1,:,:]
3×3 Array{Float64,2}:
-1.0 -1.0 -2.0
0.0 0.0 0.0
-1.0 1.0 0.0

transformational_group_size_in refers to the group input, so in a standard first layer this will just be 1 which will increase depending on which transformation group is applied (so p4m would result in an output of 4 groups).

I was going to suggest this as a way to calculate gradients for such “gather” operations:

julia> using Tullio, Zygote

julia> Zygote.gradient(rand(4,5), rand(1:4,7,3)) do data, ind
         @tullio res[i] := data[ind[i,k],j]
         sum(res)
       end
([1.0 1.0 … 1.0 1.0; 5.0 5.0 … 5.0 5.0; 8.0 8.0 … 8.0 8.0; 7.0 7.0 … 7.0 7.0], nothing)

But your example is a bit more complicated, and involves CaresianIndex although it looks like you only keep one element? Here’s how far I got with simplifying, I think this matches yours except for .=:

transforms = cat([CartesianIndices((1:3, 1:3)) for _=1:3]..., dims=3)
filters = rand(3, 3, 3, 3);

function apply_tr(transforms, cs::Int, filters, gs::Int, f_size::Tuple)
    _,  gs, zs = size(transforms)
    @assert size(filters)[3:end] == f_size
    @assert size(transforms,1) == size(filters,1)

    result = similar(filters, (f_size..., gs, cs, size(filters,1), zs))
    for f in 1:size(filters,1)
        for g=1:gs, c=1:cs, z in 1:zs # nothing depends on c

        	perm = circshift(1:gs, z)
            pg = mod(g-z, 1:gs)
            @assert pg == perm[g]

            i = transforms[f, g, z][2] # always picks 2nd element of CaresianIndex
            @assert i isa Int

            rhs = filters[f, g, :, :][i] # linear indexing into a matrix
            @assert rhs isa Number

            result[: ,:,  pg, c,  f, z] .= rhs # same value for all 1,2,4 directions
        end
    end
    result
end

apply_tr(transforms, 3, filters, 3, (3,3)) # this runs!
1 Like

Great thank you!

Although this isn’t exactly what I’m looking for (my explanation wasn’t the best sorry about that) in general I can make the small alterations to get the output required.

Just to check, this would then allow the gradient to propagate back to the original filter values?

And if so, is the main adjustment to allow this the use of similar? or the .= rather than = ?

So I’ve altered some of the code to match your suggestion

function apply_transformation_group_to_group(m_transformation_set, channels_in::Int, filters, in_group_size::Int, filter_size::Tuple)
    no_filters,  in_group_size, transform_group_size = size(m_transformation_set)
    # This will hold the set of filters after group transformations are applied
    filter_result = similar(filters, ( filter_size..., in_group_size, channels_in,  size(filters)[1], transform_group_size))
    for filter in 1:size(filters,1)
        for group=1:in_group_size, channel=1:channels_in, g_filter=1:transform_group_size # for each of the group transformations applied to the group filters
                perm = circshift(1:in_group_size, g_filter)
                i_set = m_transformation_set[filter,group, g_filter][2]  # get the set of transformation indicies for this filter, group, group_action
                rhs = filters[filter, group, :, :][i_set]
                filter_result[: ,:,  perm[group] , channel,  filter, g_filter] .= rhs # apply the transformation through array referencing
        end
    end
    return filter_result 
end

But this still has the issue stating that mutating arrays is not supported? The main diff between the code you suggest and this is that ‘i’ refers not to an int but a 3x3 set of Cartesian Indicies

So to give an example, say we have a filter from filters[f, g, :, :]
defined as →

[1 1 0; 0 2 3; 0 0 9]

The i coming from transforms[f,g,z][2] (using your notation) is a set of co-ordinates to reference the filter

i = [(1,1), (2,2), (1,3); (2,1), (2,3), (1,2); (3,3), (3,2), (3,1)]

so rhs would be someting like: [1 2 0; 0 1 1; 9 0 0]

These two functions agree, and if you put back my @assertions you will see that i is still an integer, and rhs is still a scalar. I suspect that one step closer would be

    ci = transforms[f, g, z]::CartesianIndex
    rhs = filters[f, g, ci]::Number

But the final result is still a 6-D array, whose values are depend on only 3 of the indices, I suspect this isn’t what you intend:

length(unique(res[:,:,1,:,1,1])) == 1

(similar is just a tidier way to make an empty array really. perm seems like an array you need not allocate.)

These loops won’t be differentiable. @tullio writes similar loops, and works out the corresponding derivative. It may struggle with CartesianIndexes, I’m not sure, depending exactly what the desired operation is.

Adding the assertions back in throws an error, i_set, (before the [2]) for example, is of type Array{CartesianIndex{2},2}

So is the main issue the CartesianIndexing, if I instead just reformulated things to index by integer rather than with CartesianIndex would this help?

And thanks for the heads up on @tullio, I’ll have a read

i_set as defined in the code above is an Int. “Before the [2]” it is what I just called ci. The Array called m_transformation_set isa Array{CartesianIndex{2},3}, with a 3. If you have a 2D array here, then the inputs must be different, and all bets are off.

I think the first issue is making sure you have one formulation in which the computer does what you intend… it’s even less good at guessing than I am.