Flux output layer with custom activation function

I am trying to create a neural network that is mostly standard, except with an output layer that applies a sigmoid activation function to one of its four outputs, and just identity to the other three outputs.
I made a custom activation function following this thread:

custom_final_activation(x::AbstractArray{Float32,2}) = vcat(
    NNlib.sigmoid(x[1:1,:]), NNlib.identity(x[2:end,:])   # PROBLEMATIC?

and included this as my final layer to my NN model (I have the full, (not)-working minimal example at the end of this post):

# append final layer to on-going list of layers
push!(layers, Dense(n_intermediate, n_OUT, custom_final_activation))
# construct model
model = Chain(
) |> gpu;

However, when I try to evaluate this model, I get the error

ERROR: LoadError: GPU broadcast resulted in non-concrete element type Union{}.
This probably means that the function you are broadcasting contains an error or type instability.

I suspect that I have defined my custom_final_activation function incorrectly, but I am not sure where the type instability is coming from…
I would appreciate any help!

Here is a (not)-working minimal example:

using Flux
using CUDA

n_IN = 10
n_OUT = 4 
n_intermediate = 128
intermediate_layers = 3

# example data 
xs = rand(Float32, n_IN, 100)
ys = rand(Float32, n_OUT, 100)

# build model
layers = Any[Dense(n_IN, n_intermediate, NNlib.relu)]
for i = 1:intermediate_layers
    push!(layers, Dense(n_intermediate, n_intermediate, NNlib.relu))
# final layer 
custom_final_activation(x::AbstractArray{Float32,2}) = vcat(
    NNlib.sigmoid(x[1:1,:]), NNlib.identity(x[2:end,:])   # PROBLEMATIC?
push!(layers, Dense(n_intermediate, n_OUT, custom_final_activation))
# chain layers together
model = Chain(
) |> gpu;

# pull out data 
const bs = 20
train_loader = Flux.DataLoader((xs, ys), batchsize=bs, shuffle = true);
batch = gpu(first(train_loader))

# evaluate model
val, grads = Flux.withgradient(model) do m 
    result = m(batch[1])
    Flux.Losses.mse(result, batch[2])

You want something like Chain(Dense(n_intermediate => n_OUT), custom_final_activation).

Activation functions inside Dense are broadcasted, i.e. applied to each number, not to the whole array.

Without the gpu you get a clearer error:

julia> Dense(2 => 4, custom_final_activation)(rand32(2, 3))
ERROR: MethodError: no method matching custom_final_activation(::Float32)

Closest candidates are:

Thank you, this worked!