Bayesian Neural Network (Multiclass Classification) in Turing.jl and Flux.jl

question
#1

I want to do a multiclass classification using Bayesian Neural Network (BNN) in Turing.jl and Flux.jl. There’s a good implementation already of the binary classification using BNN in Turing.jl, check it here. Hence, my goal is simply to extend this binary classification into multiclass. I raised this issue before in the TuringTutorials Github repository, check here, but there seems to be a problem in Flux’s softmax function (missing method).

Here’s my objective: I want to model the iris dataset classifying three species: setosa, versicolor, and virginica. In this case, the output layer is a multiclass and in Bayesian context this can be characterized by a Categorical distribution. Having said, the Categorical distribution takes probability as input, thus I added softmax on the network to normalize the output values between 0 and 1 (summing all values to 1). Here is the complete code:

using Flux
using RDatasets
using Measures
using StatsPlots
using PlotThemes
using Turing
pyplot()
theme(:orange)

iris = dataset("datasets", "iris");

@df iris scatter(:SepalLength, :SepalWidth, group = :Species,
    xlabel = "Length", ylabel = "Width", markersize = 5,
    markeralpha = 0.75, markerstrokewidth = 0, linealpha = 0, 
    m = (0.5, [:cross :hex :star7], 12),
    margin = 5mm)

inputs = convert(Matrix, iris[1:4]);
labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    WO = reshape(theta[50:61], 3, 4); bO = reshape(theta[61:63], 3)    

    return W0, b0, W1, b1, WO, bO
end

function feedforward(inp::Array{Float64, 2}, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        xs -> exp.(xs) / sum(exp.(xs))
    )

    return model(inp)
end

alpha = 0.09;              
sigma = sqrt(1.0 / alpha); 

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(63), sigma .* ones(63))
    
    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ Categorical(preds[:, i])
    end
end

steps = 5000
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4));

The line xs -> exp.(xs) / sum(exp.(xs)) is simply the softmax function, but if I replace that with the Flux function, that is replacing xs -> exp.(xs) / sum(exp.(xs)) with softmax, and run the last line: chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4)); I get the following error:

ERROR: MethodError: no method matching softmax!(::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2}, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2})
Closest candidates are:
  softmax!(::Any) at /Users/alahmadgaidasaad/.julia/packages/NNlib/x0XUf/src/softmax.jl:28
Stacktrace:
 [1] softmax(::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2}) at /Users/alahmadgaidasaad/.julia/packages/NNlib/x0XUf/src/softmax.jl:29
 [2] applychain(::Tuple{typeof(softmax)}, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2}) at /Users/alahmadgaidasaad/.julia/packages/Flux/8XpDt/src/layers/basic.jl:31 (repeats 4 times)
  :
  : 

However, if I use the above code as it is (not using softmax). I’m getting the following error:

ERROR: ArgumentError: Categorical: the condition isprobvec(p) is not satisfied.
Stacktrace:
 [1] macro expansion at /Users/alahmadgaidasaad/.julia/packages/Distributions/WHjOk/src/utils.jl:6 [inlined]
 [2] Type at /Users/alahmadgaidasaad/.julia/packages/Distributions/WHjOk/src/univariate/discrete/categorical.jl:27 [inlined]
 [3] Categorical(::Array{Float64,1}) at /Users/alahmadgaidasaad/.julia/packages/Distributions/WHjOk/src/univariate/discrete/categorical.jl:38
 [4] macro expansion at /Users/alahmadgaidasaad/.julia/packages/Turing/sghu6/src/core/compiler.jl:43 [inlined]
  :
  :

Theoretically, the softmax defined by this line xs -> exp.(xs) / sum(exp.(xs)) should already satisfy isprobvec, since the sum of the output of the network should be 1 and their values should be between 0 and 1. I tried to debug this by checking the isprobvec using the following codes:

theta = rand(MvNormal(zeros(63), sigma .* ones(63)), 1)
W0, b0, W1, b1, W2, b2 = weights(theta[:, 1])
model = Chain(
    Dense(W0, b0, tanh),
    Dense(W1, b1, tanh),
    Dense(W2, b2, σ),
    xs -> exp.(xs) / sum(exp.(xs))
)
model(inputs[1, :])
# 3-element Array{Float64,1}:
#  0.5650137463877715 
#  0.22707964752865994
#  0.2079066060835685

Now if we check the sum and isprobvec, we have the following:

sum(model(inputs[1, :]))
# 0.9999999999999999

isprobvec(model(inputs[1, :]))
# true

Though the sum is not 1.0, but that’s already it theoretically. In fact, the isprobvec confirmed it to be true already. However, if I use the softmax function instead, we have:

theta = rand(MvNormal(zeros(63), sigma .* ones(63)), 1)
W0, b0, W1, b1, W2, b2 = weights(theta[:, 1])
model = Chain(
    Dense(W0, b0, tanh),
    Dense(W1, b1, tanh),
    Dense(W2, b2, σ),
    softmax
)
model(inputs[1, :])
# 3-element Array{Float64,1}:
#  0.5730746731083979 
#  0.2157227988217102 
#  0.21120252806989193

sum(model(inputs[1, :]))
# 1.0

isprobvec(model(inputs[1, :]))
# true

We get exactly 1.0 instead of 0.999999999. I want to use xs -> exp.(xs) / sum(exp.(xs)) instead of softmax since I don’t have to bother with the missing method (see above error), but I’m not sure why isprobvec is not satisfied for the Categorical distribution, when in fact it is already true as seen in my initial debugging. Am I missing something? Thanks.

#2

I had to run off to an appointment and haven’t looked into this too much yet, but I printed out the preds = feedforward(inp, theta) array, and it doesn’t sum to one on any axis. Not sure why not.

I’ll take another look in a couple hours, but that seems to be why Categorical is throwing that error.

#3

Seems there’s two problems here. The first is that the softmax function in NNLib has some pretty restrictive typing, which means it’s not playing nice with the automatic differentiation. The other is that the anonymous function you defined as a replacement for softmax, xs -> exp.(xs) / sum(exp.(xs)) doesn’t actually work columnwise – it’ll take the sum of an entire rectangular array and divide every element by that sum. So the whole array will sum to one, but no individual column is a probability vector. I wrote my own softmax function that plays fairly nice with Flux:

function softmax_(arr::AbstractArray)
    ex = mapslices(x -> exp.(x), arr, dims=1)
    rows, cols = size(arr)
    val = similar(ex)
    for i in 1:cols
        s = sum(ex[:,i])
        for j in 1:rows
            val[j,i] = ex[j,i] / s
        end
    end
    return val
end

And changed the feedforward function to:

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        softmax_
    )

    return model(inp)
end

Note that you’ll also get a speedup from using Turing’s reverse mode AD (which uses Flux) by calling setadbackend(:reverse_diff). Reverse mode tends to be faster if you have a lot of parameters, as in the case of neural nets.

I haven’t investigated to see if the results are sensible, but here’s the full code I’ve modified to work:

using Flux
using RDatasets
using Measures
using StatsPlots
using PlotThemes
using Turing
pyplot()
theme(:orange)

iris = dataset("datasets", "iris");

@df iris scatter(:SepalLength, :SepalWidth, group = :Species,
    xlabel = "Length", ylabel = "Width", markersize = 5,
    markeralpha = 0.75, markerstrokewidth = 0, linealpha = 0,
    m = (0.5, [:cross :hex :star7], 12),
    margin = 5mm)

inputs = convert(Matrix, iris[1:4]);
labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function softmax_(arr::AbstractArray)
    ex = mapslices(x -> exp.(x), arr, dims=1)
    rows, cols = size(arr)
    val = similar(ex)
    for i in 1:cols
        s = sum(ex[:,i])
        for j in 1:rows
            val[j,i] = ex[j,i] / s
        end
    end
    return val
end

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    WO = reshape(theta[50:61], 3, 4); bO = reshape(theta[61:63], 3)

    return W0, b0, W1, b1, WO, bO
end

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        softmax_
    )

    return model(inp)
end

alpha = 0.09;
sigma = sqrt(1.0 / alpha);

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(63), sigma .* ones(63))

    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ Categorical(preds[:,i])
    end
end

setadbackend(:reverse_diff)
steps = 5000
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4));

Let me know how it works out! It’s a nifty little project you’re doing here.

3 Likes
#4

Thank you very much for the detailed explanation. I will study this and check the results.

#5

Hi! I am interested in BNNs as well and am trying to run this code (it seems, there is a tiny typo in the expression for Wo - see the corrected code below). Anyhow, the samples do not display any variation, why could this be?

using Flux
using RDatasets
using Measures
using Turing

iris = dataset("datasets", "iris");

inputs = convert(Matrix, iris[1:4]);

labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function softmax_(arr::AbstractArray)
    ex = mapslices(x -> exp.(x), arr, dims=1)
    rows, cols = size(arr)
    val = similar(ex)
    for i in 1:cols
        s = sum(ex[:,i])
        for j in 1:rows
            val[j,i] = ex[j,i] / s
        end
    end
    return val
end

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    W2 = reshape(theta[50:61], 3, 4); b2 = reshape(theta[62:64], 3)

    return W0, b0, W1, b1, W2, b2
end

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        softmax_
    )

    return model(inp)
end

alpha = 0.09;
sigma = sqrt(1.0 / alpha)

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(64), sigma .* ones(64))

    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ Categorical(preds[:,i])
    end
end

Turing.turnprogress(false);

setadbackend(:reverse_diff);

steps = 100
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4));

# Extract all weight and bias parameters.
theta = chain[:theta].value.data
#6

Hi! You code is mathematically correct but not numerically stable, mainly due to the softmax function being used there. Thus the HMC actually rejects each step it proposed due to numerical error. We can instead work only in the log space by introducing a customised distribution. See my adapted code below:

using Flux
using RDatasets
using Measures
using Turing
using StatsFuns: logsumexp

iris = dataset("datasets", "iris");

inputs = convert(Matrix, iris[1:4]);

labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    W2 = reshape(theta[50:61], 3, 4); b2 = reshape(theta[62:64], 3)
    return W0, b0, W1, b1, W2, b2
end

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2)
    )
    return model(inp)
end

# Create `CategoricalLogit` to prevent numerical issues
struct CategoricalLogit <: DiscreteUnivariateDistribution
    logitp
end

# Mainly the `logsumexp` avoids the numerical issue
function Distributions.logpdf(d::CategoricalLogit, k::Int)
    return (d.logitp .- logsumexp(d.logitp))[k+1]    # use k+1 as lab[i] is 0-indexed
end

alpha = 0.09;
sigma = sqrt(1.0 / alpha)

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(64), sigma .* ones(64))

    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ CategoricalLogit(preds[:,i])
    end
end

Turing.turnprogress(false);

setadbackend(:forward_diff);
# setadbackend(:reverse_diff);

steps = 1_000
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 5e-3, 5));

# Extract all weight and bias parameters.
theta = chain[:theta].value.data

Also it seems that the reverse mode AD provided by Flux.Tracker has some issue with this example … thus I turned on forward mode by setadbackend(:forward_diff). Not sure what’s the cause. I will look into it and update if I figured it out.