Workaround for subarray/selectdim on gpu or how does Flux.Conv do it?

I’m trying to do something similar to what Flux.Conv does. I’ve tried to use the Conv layer but it seems to be opinionated about how the array is structured and what it outputs and so doesn’t quite match with what I want. I have a 3 dim matrix of features x series x batch as input to the layer, and I want to run a Dense on flattened chunks of features x i:i+k x batch across that matrix.

It works on the CPU, but I want it to run on the GPU. Here’s a general idea of it:

    mapreduce(
        i -> reshape(dense(Flux.flatten(selectdim(x, dim, i:i+plus))), sizeOut),
        (x1, x2) -> cat(x1, x2; dims=dim),
        1:(size(x, dim)-plus)
    )

Flux.Conv is able to do the something similar on the GPU:

using Flux, CUDA
CUDA.allowscalar(false)
gpu(Flux.Conv((2,2), 4=>2))(gpu(rand(2,2,4,1)))

Here’s an MWE to show the error that selectdim causes:

using Flux, CUDA
CUDA.allowscalar(false)
den = Flux.Dense(30 => 4) |> gpu
x = [i + j / 10 for i in 1:10, j in 1:20, b in 1:5] |> gpu
x2 = Flux.flatten(selectdim(x, 2, 1:3))
den(x2)
# ERROR: Scalar indexing is disallowed.

So, there must be a way to do it because Flux.Conv works. I tried to look at Conv’s source code, but it calls NNlib.conv and that is implemented with multiple stages of generated functions which is a bit difficult for me to piece together.

Any suggestions on how this might be accomplished?

Conv has an easier time of things because every window goes from width K -> 1 spatially in each window, whereas for your example it seems that is K1 -> K2 for each window because of the dense layer. What are the expected input and output dimensions? If you have an example of this using a different library, we could try to help port the code. Otherwise, the expected behaviour is still somewhat ambiguous to me.

There shouldn’t be any generated functions called in NNlib.conv (perhaps you were thinking of @eval?), but the multiple layers can be confusing. The default CPU implementation lives in NNlib.jl/conv_im2col.jl at master · FluxML/NNlib.jl · GitHub, with a generic fallback in NNlib.jl/conv_direct.jl at master · FluxML/NNlib.jl · GitHub. For GPU we just use cuDNN, so that won’t be much of a help for your use case.

Right, I think that’s the primary difference between what I’m trying to do and Conv. I want to map to multiple values out of each window/region but Conv only outputs one. I figured I might sort of be able to use it if I moved the features into the channels dimension, but I’d rather not recreate all the data just for that one layer. At that point, copying the view to the GPU every iteration would probably be just as fast (or rather just as slow).

An example without batches (numbers arbitrary): 8 x 100 matrix. I want to roll an 8 x 5 window across that and output 7 values for each place using the same mapping for each place (eg. a Dense(40 => 7)). The result would be a 7 x 96 matrix (second dim is smaller because no padding).

In general:
input: an A x B x C matrix would be transformed to D x (B-(len-1)) x C matrix where len is the desired length of the window and D is the Dense output width.

Here’s the whole layer implementation that does work on the CPU (sorry, the code isn’t pretty because I was just trying to get it to work):

struct SeqCombLayer{INT,T}
    plus::INT
    layer::T
    function SeqCombLayer(pair::Pair; σ=identity)
        plus = last(first(pair))-1
        den = Dense((*)(first(pair)...) => last(pair), σ)
        new{typeof(plus),typeof(den)}(plus, den)
    end
    SeqCombLayer(plus::Real, layer; σ=identity) = new{typeof(plus),typeof(layer)}(plus, layer)
end
Flux.@functor SeqCombLayer
function (m::SeqCombLayer)(x::AbstractArray)
    dim = ndims(x)-1
    sz = size(x)
    lenOut = length(m.layer.bias)
    szOut = (lenOut, sz[2:(dim-1)]..., 1, sz[dim+1:end]...)
    mapreduce(
        i -> reshape(m.layer(Flux.flatten(selectdim(x, dim, i:i+m.plus))), szOut),
        (x1, x2) -> cat(x1, x2; dims=dim),
        1:(size(x, dim)-m.plus)
    )
end

and example usage:

layer = fl.SeqCombLayer((8, 5) => 7)
out = layer(rand(8, 100, 12))
@assert size(out) == (7, 96, 12)

The above runs fine on the CPU. Sending to GPU will fail in the mapreduce because of the selectdim.

Ah, my mistake. I saw NNlib.jl/conv.jl at master · FluxML/NNlib.jl · GitHub and thought it’s generating functions, but it’s using @eval to do so as you mention, and not the “generated functions” language feature.

You can reshape this (the original question) to just 3 calls of the Dense:

using Flux
den = Flux.Dense(30 => 4)
x = [i + j / 10 for i in 1:10, j in 1:20, b in 1:5]; # 10×20×5
selectdim(x, 2, 1:3);  # 10×3×5
x2 = Flux.flatten(selectdim(x, 2, 1:3))  # 30×5
den(x2)  # 4×5

# This is what I think your function is, but please provide code which runs:
myguess(den::Dense, sizeOut::Tuple, x; dim=2, plus=2) = mapreduce(
    i -> reshape(den(Flux.flatten(selectdim(x, dim, i:i+plus))), sizeOut),
    (x1, x2) -> cat(x1, x2; dims=dim),
    1:(size(x, dim)-plus)
)
y1 = myguess(den, (4, 1, 5), x); size(y1)

# Original, overlapping windows:
# out[c, i, z] := sum(a,j) w[c, (a, j)], x[a, i+j, z]  (j in 1:3)

# Easier problem, non-overlapping:
# out[c, i, z] := sum(a,j) w[c, (a, j)], x[a, j + 3i, z]  (j in 1:3)
# reshape:
# out[c, i, z] := sum(aj) w[c, aj], x[aj, 3i, z]  (j in 1:3)
# then do 3i, 3i+1, 3i+2 separately:

myfun(den, x) = hcat(
    den(reshape(x[:, 1:18, :], :, 6, 5)),
    den(reshape(x[:, 2:19, :], :, 6, 5)),
    den(reshape(x[:, 3:20, :], :, 6, 5)),
);
y3 = myfun(den, x);
y3 ≈ hcat(y1[:, 1:3:end, :], y1[:, 2:3:end, :], y1[:, 3:3:end, :])  # true

# about 10x faster, gradient 15x

I didn’t try on GPU but it ought to work. And I see now more code, after I started… something like this can surely be generalised. plus will control the number of terms you need.

Mine has the 2nd index in a different order, if that matters of course you could fix it… by indexing afterwards, or perhaps by writing some kind of “inner not outer” variant of cat. The indexing x[:, 1:18, :] etc. could also be done better.

1 Like

Thank you! I think that will do the trick. I’ll probably need to do some eval or function generation or something to generalize it for window size, but that’s doable.

Here it is generalized except for dealing with variable window size:

struct SeqComb{INT,T}
    window::INT
    outWidth::INT
    layer::T
    function SeqComb(pair::Pair; σ=identity)
        window = last(first(pair))
        outWidth = last(pair)
        den = Dense((*)(first(pair)...) => outWidth, σ)
        new{typeof(window),typeof(den)}(window, outWidth, den)
    end
    SeqComb(window::Integer, outWidth::Integer, layer; σ=identity) = new{typeof(window),typeof(layer)}(window, outWidth, layer)
end
Flux.@functor SeqComb
function (m::SeqComb)(x::AbstractArray)
    _, seriesCount, batchCount = size(x)
    seriesCombined, extra = divrem(seriesCount, m.window)
    len1 = seriesCombined * m.window
    len2 = len1 + (1 > extra ? 1 - m.window : 1)
    len3 = len1 + (2 > extra ? 2 - m.window : 2)
    return hcat(
        m.layer(reshape(x[:, 1:len1, :], :, len1 ÷ m.window, batchCount)),
        m.layer(reshape(x[:, 2:len2, :], :, len2 ÷ m.window, batchCount)),
        m.layer(reshape(x[:, 3:len3, :], :, len3 ÷ m.window, batchCount))
    )
end

function test()
    x = [i + j / 10 + 1000 * b for i in 1:10, j in 1:20, b in 1:5]
    layer1 = fl.SeqCombLayer((10, 3) => 4)
    layer2 = fl.SeqComb((10, 3) => 4)
    copyto!(layer1.layer.weight, layer2.layer.weight) # same weights for testing equality
    out1 = layer1(x)
    out2 = layer2(x)
    return out2 ≈ hcat(out1[:, 1:3:end, :], out1[:, 2:3:end, :], out1[:, 3:3:end, :])
end