FlexLayer: A Custom Layer with Different Activation Fcns, Non-negativity, and more

This is a follow-up to a half-dozen posts in the past couple of weeks. I’ve had some amazing help from very generous members of this community; it’s allowed a newbie like me to make some progress!

The specific posts this builds on are listed below. But I think the sum is greater than the parts and I wanted to share in one place the working code that I’ve assembled, in case it’s of any use to others, or it provides some example code to others who are new to Julia.

The code fits three different outputs and plots the data, the overlaid fits, and the loss as the training runs. The final layer is a custom layer (“FlexLayer”) that is fairly unique: it can accept an array of activation functions, so that each output has a different activation function applied to it. (This feature is used in the code to enforce a non-negativity constraint on one of the outputs.) It also can accept an ancillary 2D numeric array that can be used however one sees fit within the layer. (This feature is used to “un-normalize” each output, which is useful if the target had been scaled from some min/max range to [0,1].)

The code is included in the block below. If you have any suggestions, comments, etc., I’m all ears. I still have a lot to learn. Following the code block, there’s another block that calls a different demo according to the chosen option.

using Distributions: Normal
using Tullio, ForwardDiff, Zygote
using Flux
using Flux: mse, shuffle, throttle, @treelike, Params #(Params uses Zygote)
using Base.Iterators: partition
using Plots

##### DATA #####################
num_samples = 50
x_noise_std = 0.01
y_noise_std = 0.1
function generate_data()
    x = reshape(range(0, stop=1.5π, length=num_samples), num_samples, 1)
    y_noise = rand(Normal(0,y_noise_std), 3, num_samples)
    
    # 3 outputs
    y = hcat(
        (sin.(x).^2 .-0.2 .+ y_noise[1,:]),
        cos.(x) .+ y_noise[2,:],
        cos.(2x).*sin.(x) .+ y_noise[3,:]
    )
    
    x = x'
    y = y'
    
    return x, y
end
X, Y = generate_data() # Training data of shape (3,num_samples)

##### CALLBACK & PLOTS #################
LossLog = Array{Float64, 1}()
LossLog_T = Array{Int, 1}()
function evalcb()
    global LossLog
    global LossLog_T
    loss_value = loss(X, Y)
    push!(LossLog, loss_value)
    push!(LossLog_T, length(LossLog))
    if mod(length(LossLog_T),100)==1
        live_plots()
    end
end

function plot_with_fit(x,y,yfit,label)
    return plot([x' x'], [y' yfit'], color=[:black :red], lw=[0 2], marker=[:circle :none],alpha=0.7,legend=false,ylabel=label)
end

function live_plots()
    NumPlots = size(Y,1)
    p_arr = Array{Plots.Plot{Plots.GRBackend},1}(undef, NumPlots)
    for i in 1:NumPlots
        p_arr[i] = plot_with_fit(X, Y[i:i,:], m(X)[i:i,:], "y"*string(i))
    end
    
    p_stack = plot(p_arr...,layout=(NumPlots,1))
    p_loss = plot(LossLog_T, LossLog,yscale=:log10,legend=false,ylabel="Loss")

    IJulia.clear_output(true)
    p = plot(p_stack, p_loss, layout=(1,2))
    display(p)
end

##### AUXILLARY FUNCTIONS #################
"Applies each element of the array fs of functions to all columns
of the corresponding row.
See https://discourse.julialang.org/t/mutating-arrays-not-supported/42123/6?u=jmurray"
rowmap(fs, x) = @tullio y[r,c] := begin
         f = getindex(fs, r)
         @inbounds f(x[r,c])
       end grad=Dual;

"Applies softplus to rows of x for which non-negativity is required, as specified
by whether the corresponding element of nonneg is > 0. (The number of elements in nonneg
should match the number of rows in x.)"
rowwise_nonneg(nonneg, x) = @tullio y[r,c] := begin
         nn_r = getindex(nonneg, r)
         @inbounds nn_r > 0. ? softplus(x[r,c]) : x[r,c]
       end grad=Dual;

"Applies a row-dependent normalization to each element of x such that 
x_norm == (x - min) / (max - min) falls between 0 and 1. The array m
has as many rows as x and stores in its 2 columns the min and max (in that
order) value to which each value in that row is to be normalized. (Columns
in excess of 2 are allowed but ignored.)"
rowwise_norm(m, x) = @tullio y[r,c] := begin
    (x[r,c] - m[r,2])/(m[r,1] - m[r,2])
    end grad=Dual;

"Applies a row-dependent un-normalization to each element of x.
Inverse of rowwise_norm(); see notes for rowwise_norm.
The matrix m consists of either 2 or 3 columns; the third column, if supplied, is
interpreted as a boolean telling us whether to clip negative values
to zero for the corresponding rows. Thus if we unnormalize after applying
a nonnegativity constraint via an activation function (softplus) we still
can retain nonnegativity."
function rowwise_unnorm(m, x)
    if size(m,2)==2
        @tullio y[r,c] := x[r,c]*(m[r,1] - m[r,2]) + m[r,2] grad=Dual;
    else
        @tullio y[r,c] := begin
            val = x[r,c]*(m[r,1] - m[r,2]) + m[r,2]
            m[r,3]>0 && val < 0. ? 0. : val
            end grad=Dual;
    end 
end


##### CUSTOM FLEXIBLE LAYER #########
mutable struct FlexLayer{F,S<:AbstractArray,T<:AbstractArray,fa<:Array{Function,1},num_a<:Array{Float64,2}}
    W::S
    b::T
    σ::F
    fcn_array::fa # array of functions; used for per-row activation functions
    num_array::num_a # ancillary array for normalization or other purposes; must be batch-size independent!
                     # If output of layer has size (n_out, dataset_length), then to include a multiplier and
                     # offset for each output, num_array should be of size (n_out, 2). It can certainly be
                     # re-purposed, but as FlexLayer is written currently, num_array is used for un-normalizing
                     # each row and it should be of size (n_out, 2) or (n_out, 3), with the 1st and 2nd columns
                     # holding the min and max values to which the output should be unnormalized. A 3rd column,
                     # if present, should be 0 or 1 according to whether unnormalized values for that row should
                     # be clipped to zero if negative. (This helps retain any desired nonnegativity while unnormalizing.)
end

# Three useful "sub-classes" / special cases of FlexLayer:
#
# Apply different functions to each output, but do not do any unnormalization
#FlexLayer(W, b, fcn_array) = FlexLayer(W, b, fcn_array::Array{Function,1}, min_max::Array{Float64,2}(), identity)
FlexLayer(in::Integer, out::Integer, fcn_array) = FlexLayer(in::Integer, out::Integer, fcn_array::Array{Function,1}, min_max::Array{Float64,2}(), identity)
#
# Unnnormalize each row, without any activation function
#UnnormLayer(W, b, min_max) = FlexLayer(W, b, Array{Function,1}(), min_max::Array{Float64,2}, identity)
UnnormLayer(in::Integer, out::Integer, min_max) = FlexLayer(in::Integer, out::Integer, Array{Function,1}(), min_max::Array{Float64,2}, identity)
#
# Enforce non-negativity constraint at rows indicated by row_is_nonneg boolean array; applies softplus to those rows, identity elsewhere
# Using softplus keeps output non-negative without depressing fits to peaks (as sigmoid would do)
#NonnegLayer(W, b, row_is_nonneg::Array{Bool,1}) = FlexLayer(W, b, [ans ? softplus : identity for ans in row_is_nonneg]::Array{Function,1}, Array{Float64}(undef, 0, 2), identity)
NonnegLayer(in::Integer, out::Integer, row_is_nonneg::Array{Bool,1}) = FlexLayer(in::Integer, out::Integer,[ans ? softplus : identity for ans in row_is_nonneg]::Array{Function,1}, Array{Float64}(undef, 0, 2), identity)

function FlexLayer(in::Integer, out::Integer, fcn_array::Array{Function,1}, num_array::Array{Float64,2}, σ=identity) 
    return FlexLayer(randn(out, in), randn(out), σ, fcn_array, num_array)
end

Flux.@functor FlexLayer # makes trainable

"Custom dense layer having the option of different activations functions for each output.
This is especially useful for applying non-negativity constraints to certain outputs but
could be useful in other circumstances. Also included is the option to pass an ancillary 2D array,
which could be used for any purpose but is here used for per-row un-normalization of the output.

Note: If both are to be used, the relative order between un-normalization and applying activation 
functions matters. If un-normalization is applied first, then the (possibly large) outputs will be 
in the wings of the activation function and will 'kill the gradient', which is bad for training.
For this reason, activation functions are applied before unnormalization. However, un-normalization
will remove any non-negativity constraint enforced by the (softplus) activation function. This can 
be corrected: rather than passing a max_min array with two columns, pass in one with three columns,
the third column being one or zero according, respectively, to whether negative values (after 
unnormalization) should or should not be set to zero. See rowwise_unnorm()."
function (a::FlexLayer)(x::AbstractArray)
    x_out = a.W * x .+ a.b

    if length(a.fcn_array)==size(x_out, 1)
        x_out = rowmap(a.fcn_array, x_out)
    else
        x_out = a.σ.(x_out)
    end

    if size(a.num_array,1)==size(x_out, 1)
        x_out = rowwise_unnorm(a.num_array, x_out)
    end
    return x_out
end

loss(x, y) = mse(m(x), y)

# @treelike FlexLayer # some say to use @treelike, but it's not used in the Flux definition of Dense

"Demonstration of FlexLayer and special cases.
    scale_data: if true, first row has large values (-200 to 800); otherwise, between approx. -0.2 and 0.8.
    use_unnorm: if true, unnormalize scaled data in first row.
    enforce_nonneg: if true, enforce nonnegativity on first row."
function FlexLayerDemo(mode::Integer)
    #scale_data::Bool, use_unnorm::Bool, enforce_nonneg::Bool
    scale_data, use_unnorm, enforce_nonneg, act_fcn_demo = (mode & 2^i > 0 for i in 0:8)
    global X, Y, m
    global LossLog, LossLog_T
    LossLog = []
    LossLog_T = []
    
    X, Y = generate_data() # Training data of shape (3,num_samples)
    
    ##### MODEL & TRAINING #####################
    n = 10 # # neurons
    act_f = tanh
    
    if scale_data
        Y[1,:] .*= 1000.
    end
    unnorm_min, unnorm_max = extrema(Y[1,:])
    
    if act_fcn_demo
        last_layer = FlexLayer(n, 3, [relu, sigmoid, tanh], [0. 1.; 0. 1.; 0. 1.])
    else
        if use_unnorm && enforce_nonneg
            # Because the non-negativity-inducing activation function is applied BEFORE un-normalization
            # (so as not to kill the gradient), un-normalization will remove the non-negativity constraint.
            # But by using a min_max array with *three* columns, the third column indicating whether negative
            # values should be set to 0, we recover non-negativity (though it does not go to zero smoothly)
            last_layer = FlexLayer(n, 3, [softplus, identity, identity], [unnorm_min unnorm_max 1.; 0. 1. 0.; 0. 1. 0.])
        elseif use_unnorm
            last_layer = UnnormLayer(n, 3, [unnorm_min unnorm_max; 0. 1.; 0. 1.])
        elseif enforce_nonneg
            last_layer = NonnegLayer(n, 3, [true, false, false])
        else
            last_layer = Dense(n, 3)
        end
    end

    m = Chain(Dense(size(X, 1), n, act_f), Dense(n, n, act_f), last_layer)
    opt = ADAM()

    println("Data Set Size: X: ", size(X), ", Y: ", size(Y))
    batchsize = 5
    NumEpochs = 2000
    for epoch in 1:NumEpochs
        dataset = [(X[:, i], Y[:, i]) for i in partition(shuffle(1:size(X, 2)), batchsize)] # create mini-batches
        Flux.train!(loss, Flux.params(m), dataset, opt; cb=throttle(evalcb, 0.1))
    end

    # To use DataLoader instead, declare
    #    using Flux.Data: DataLoader
    # and define
    #    train_loader = DataLoader(X, Y, batchsize=batchsize, shuffle=true, partial=false) 
    # then replace the inner contents of the for epoch loop with:
    #    for (x, y) in train_loader
    #       Flux.train!(loss, Flux.params(m), train_loader, opt, cb=throttle(evalcb, 0.1))
    #    end
end

To try this out, run the code block below for different values of option (read the comments for each option prior to running). It’s suggested that they be tried in order (0, 1, …, 5). (Note that a few options give intentionally bad results but are included to illustrate some effect.)

# A few cases of interest, described below. Set option to a value between 0 and 5. Instructive to proceed in order.
option = 0

scale_data, use_unnorm, enforce_nonneg, act_fcn_demo = false, false, false, false
if option==0
    # ordinary, Dense layer
    scale_data, use_unnorm, enforce_nonneg = false, false, false
elseif option==1 # INTENTIONALLY BAD
    # Example of using different application functions for each output. The choice AREN'T *good* choices;
    # but you can see that different functions are applied to each. In this example, the other options are ignored.
    act_fcn_demo = true
elseif option==2
    # Example of non-negativity constraint applied to first output.
    scale_data, use_unnorm, enforce_nonneg = false, false, true
elseif option==3 # INTENTIONALLY BAD
    # ordinary, Dense layer, with first-row data intentionally having values well outside range of activation 
    # functions; will NOT work well, so we progress to option 4...
    scale_data, use_unnorm, enforce_nonneg = true, false, false
elseif option==4
    # Unnorm layer, so that output of first row is unnormalized prior to calcuating loss wwith respect to 
    # large-valued first-row data. (In practice, because the values between -200 to 800 give absolute errors
    # much larger than the absolute errors from the outputs having values between -1 and 1, training will 
    # prefer to fit the first output at the expense of the others.)
    scale_data, use_unnorm, enforce_nonneg = true, true, false
elseif option==5
    # Same as option 2, but also constrain first output to be non-negative. More discussion in code.
    scale_data, use_unnorm, enforce_nonneg = true, true, true
end

mode = sum([(x ? 1 : 0)*2^(i-1) for (i, x) in enumerate([scale_data, use_unnorm, enforce_nonneg, act_fcn_demo])])
FlexLayerDemo(mode)

The posts that were instrumental in building this up are listed below, along with the usernames of many who have helped. Any mistakes, inefficiencies, or exhibitions of bad coding style are entirely my own (and again, crrections are welcome; they’ll help me learn!)

Flux: Custom Layer (6/15/20)
Thanks to @LudiWin for the initial example and @contradict for finding my bug and pointing me to Flux.DataLoader
(Here I followed up with a simple working custom layer – one which enforces non-negativity for (all) outputs.)

Array of Plot objects (6/18/20)
For an n-output Flux model, one can plot each output (overlayed on the corresponding target), for arbitrary n.
Thanks to @rdeits and @DrPapa for the syntax correction.

Flux: Custom Training and Logging (6/18/20)
Thanks to @contradict and @oxinabox for getting the docs updated for the custom training example in Flux.

Row-wise function application (different functions) (6/25/20)
Thanks to @contradict for a one-liner approach to applying an array of functions, one to each row of an array. And to @mforets for a fast solution using view and map!.

Mutating arrays not supported (6/26/20)
A huge thanks to @mcabbott for his Tullio package and his extensive help that enabled me to make changes to arrays inside a custom layer (including row-wise function application). This is what allowed my FlexLayer example to actually come together. His help with extending rowwise_unnorm to work with a nonnegativity condition was also appreciated.

Finally, a big thanks goes to @rajnrao for getting me started in all this…and for much offline help.

2 Likes