Bayesian Neural Network (Multiclass Classification) in Turing.jl and Flux.jl

alasaadstat · February 24, 2019, 6:25am

I want to do a multiclass classification using Bayesian Neural Network (BNN) in Turing.jl and Flux.jl. There’s a good implementation already of the binary classification using BNN in Turing.jl, check it here. Hence, my goal is simply to extend this binary classification into multiclass. I raised this issue before in the TuringTutorials Github repository, check here, but there seems to be a problem in Flux’s softmax function (missing method).

Here’s my objective: I want to model the iris dataset classifying three species: setosa, versicolor, and virginica. In this case, the output layer is a multiclass and in Bayesian context this can be characterized by a Categorical distribution. Having said, the Categorical distribution takes probability as input, thus I added softmax on the network to normalize the output values between 0 and 1 (summing all values to 1). Here is the complete code:

using Flux
using RDatasets
using Measures
using StatsPlots
using PlotThemes
using Turing
pyplot()
theme(:orange)

iris = dataset("datasets", "iris");

@df iris scatter(:SepalLength, :SepalWidth, group = :Species,
    xlabel = "Length", ylabel = "Width", markersize = 5,
    markeralpha = 0.75, markerstrokewidth = 0, linealpha = 0, 
    m = (0.5, [:cross :hex :star7], 12),
    margin = 5mm)

inputs = convert(Matrix, iris[1:4]);
labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    WO = reshape(theta[50:61], 3, 4); bO = reshape(theta[61:63], 3)    

    return W0, b0, W1, b1, WO, bO
end

function feedforward(inp::Array{Float64, 2}, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        xs -> exp.(xs) / sum(exp.(xs))
    )

    return model(inp)
end

alpha = 0.09;              
sigma = sqrt(1.0 / alpha); 

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(63), sigma .* ones(63))
    
    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ Categorical(preds[:, i])
    end
end

steps = 5000
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4));

The line xs -> exp.(xs) / sum(exp.(xs)) is simply the softmax function, but if I replace that with the Flux function, that is replacing xs -> exp.(xs) / sum(exp.(xs)) with softmax, and run the last line: chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4)); I get the following error:

ERROR: MethodError: no method matching softmax!(::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2}, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2})
Closest candidates are:
  softmax!(::Any) at /Users/alahmadgaidasaad/.julia/packages/NNlib/x0XUf/src/softmax.jl:28
Stacktrace:
 [1] softmax(::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2}) at /Users/alahmadgaidasaad/.julia/packages/NNlib/x0XUf/src/softmax.jl:29
 [2] applychain(::Tuple{typeof(softmax)}, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{getfield(Turing.Core, Symbol("#f#24")){Turing.Core.VarReplay.VarInfo,Turing.Model{Tuple{:theta},Tuple{:lab},getfield(Main, Symbol("###inner_function#571#13")){Array{Float64,2}},NamedTuple{(:lab,),Tuple{Array{Int64,1}}},NamedTuple{(:lab,),Tuple{Symbol}}},Turing.Sampler{HMC{Turing.Core.ForwardDiffAD{40},Any}}},Real},Float64,10},2}) at /Users/alahmadgaidasaad/.julia/packages/Flux/8XpDt/src/layers/basic.jl:31 (repeats 4 times)
  :
  :

However, if I use the above code as it is (not using softmax). I’m getting the following error:

ERROR: ArgumentError: Categorical: the condition isprobvec(p) is not satisfied.
Stacktrace:
 [1] macro expansion at /Users/alahmadgaidasaad/.julia/packages/Distributions/WHjOk/src/utils.jl:6 [inlined]
 [2] Type at /Users/alahmadgaidasaad/.julia/packages/Distributions/WHjOk/src/univariate/discrete/categorical.jl:27 [inlined]
 [3] Categorical(::Array{Float64,1}) at /Users/alahmadgaidasaad/.julia/packages/Distributions/WHjOk/src/univariate/discrete/categorical.jl:38
 [4] macro expansion at /Users/alahmadgaidasaad/.julia/packages/Turing/sghu6/src/core/compiler.jl:43 [inlined]
  :
  :

Theoretically, the softmax defined by this line xs -> exp.(xs) / sum(exp.(xs)) should already satisfy isprobvec, since the sum of the output of the network should be 1 and their values should be between 0 and 1. I tried to debug this by checking the isprobvec using the following codes:

theta = rand(MvNormal(zeros(63), sigma .* ones(63)), 1)
W0, b0, W1, b1, W2, b2 = weights(theta[:, 1])
model = Chain(
    Dense(W0, b0, tanh),
    Dense(W1, b1, tanh),
    Dense(W2, b2, σ),
    xs -> exp.(xs) / sum(exp.(xs))
)
model(inputs[1, :])
# 3-element Array{Float64,1}:
#  0.5650137463877715 
#  0.22707964752865994
#  0.2079066060835685

Now if we check the sum and isprobvec, we have the following:

sum(model(inputs[1, :]))
# 0.9999999999999999

isprobvec(model(inputs[1, :]))
# true

Though the sum is not 1.0, but that’s already it theoretically. In fact, the isprobvec confirmed it to be true already. However, if I use the softmax function instead, we have:

theta = rand(MvNormal(zeros(63), sigma .* ones(63)), 1)
W0, b0, W1, b1, W2, b2 = weights(theta[:, 1])
model = Chain(
    Dense(W0, b0, tanh),
    Dense(W1, b1, tanh),
    Dense(W2, b2, σ),
    softmax
)
model(inputs[1, :])
# 3-element Array{Float64,1}:
#  0.5730746731083979 
#  0.2157227988217102 
#  0.21120252806989193

sum(model(inputs[1, :]))
# 1.0

isprobvec(model(inputs[1, :]))
# true

We get exactly 1.0 instead of 0.999999999. I want to use xs -> exp.(xs) / sum(exp.(xs)) instead of softmax since I don’t have to bother with the missing method (see above error), but I’m not sure why isprobvec is not satisfied for the Categorical distribution, when in fact it is already true as seen in my initial debugging. Am I missing something? Thanks.

cpfiffer · February 24, 2019, 10:44pm

I had to run off to an appointment and haven’t looked into this too much yet, but I printed out the preds = feedforward(inp, theta) array, and it doesn’t sum to one on any axis. Not sure why not.

I’ll take another look in a couple hours, but that seems to be why Categorical is throwing that error.

cpfiffer · February 25, 2019, 2:14am

Seems there’s two problems here. The first is that the softmax function in NNLib has some pretty restrictive typing, which means it’s not playing nice with the automatic differentiation. The other is that the anonymous function you defined as a replacement for softmax, xs -> exp.(xs) / sum(exp.(xs)) doesn’t actually work columnwise – it’ll take the sum of an entire rectangular array and divide every element by that sum. So the whole array will sum to one, but no individual column is a probability vector. I wrote my own softmax function that plays fairly nice with Flux:

function softmax_(arr::AbstractArray)
    ex = mapslices(x -> exp.(x), arr, dims=1)
    rows, cols = size(arr)
    val = similar(ex)
    for i in 1:cols
        s = sum(ex[:,i])
        for j in 1:rows
            val[j,i] = ex[j,i] / s
        end
    end
    return val
end

And changed the feedforward function to:

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        softmax_
    )

    return model(inp)
end

Note that you’ll also get a speedup from using Turing’s reverse mode AD (which uses Flux) by calling setadbackend(:reverse_diff). Reverse mode tends to be faster if you have a lot of parameters, as in the case of neural nets.

I haven’t investigated to see if the results are sensible, but here’s the full code I’ve modified to work:

using Flux
using RDatasets
using Measures
using StatsPlots
using PlotThemes
using Turing
pyplot()
theme(:orange)

iris = dataset("datasets", "iris");

@df iris scatter(:SepalLength, :SepalWidth, group = :Species,
    xlabel = "Length", ylabel = "Width", markersize = 5,
    markeralpha = 0.75, markerstrokewidth = 0, linealpha = 0,
    m = (0.5, [:cross :hex :star7], 12),
    margin = 5mm)

inputs = convert(Matrix, iris[1:4]);
labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function softmax_(arr::AbstractArray)
    ex = mapslices(x -> exp.(x), arr, dims=1)
    rows, cols = size(arr)
    val = similar(ex)
    for i in 1:cols
        s = sum(ex[:,i])
        for j in 1:rows
            val[j,i] = ex[j,i] / s
        end
    end
    return val
end

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    WO = reshape(theta[50:61], 3, 4); bO = reshape(theta[61:63], 3)

    return W0, b0, W1, b1, WO, bO
end

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        softmax_
    )

    return model(inp)
end

alpha = 0.09;
sigma = sqrt(1.0 / alpha);

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(63), sigma .* ones(63))

    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ Categorical(preds[:,i])
    end
end

setadbackend(:reverse_diff)
steps = 5000
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4));

Let me know how it works out! It’s a nifty little project you’re doing here.

alasaadstat · February 25, 2019, 2:17am

Thank you very much for the detailed explanation. I will study this and check the results.

elizavetasemenova · May 9, 2019, 10:51am

Hi! I am interested in BNNs as well and am trying to run this code (it seems, there is a tiny typo in the expression for Wo - see the corrected code below). Anyhow, the samples do not display any variation, why could this be?

using Flux
using RDatasets
using Measures
using Turing

iris = dataset("datasets", "iris");

inputs = convert(Matrix, iris[1:4]);

labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function softmax_(arr::AbstractArray)
    ex = mapslices(x -> exp.(x), arr, dims=1)
    rows, cols = size(arr)
    val = similar(ex)
    for i in 1:cols
        s = sum(ex[:,i])
        for j in 1:rows
            val[j,i] = ex[j,i] / s
        end
    end
    return val
end

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    W2 = reshape(theta[50:61], 3, 4); b2 = reshape(theta[62:64], 3)

    return W0, b0, W1, b1, W2, b2
end

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2, σ),
        softmax_
    )

    return model(inp)
end

alpha = 0.09;
sigma = sqrt(1.0 / alpha)

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(64), sigma .* ones(64))

    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ Categorical(preds[:,i])
    end
end

Turing.turnprogress(false);

setadbackend(:reverse_diff);

steps = 100
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 0.05, 4));

# Extract all weight and bias parameters.
theta = chain[:theta].value.data

Kai_Xu · May 9, 2019, 3:08pm

Hi! You code is mathematically correct but not numerically stable, mainly due to the softmax function being used there. Thus the HMC actually rejects each step it proposed due to numerical error. We can instead work only in the log space by introducing a customised distribution. See my adapted code below:

using Flux
using RDatasets
using Measures
using Turing
using StatsFuns: logsumexp

iris = dataset("datasets", "iris");

inputs = convert(Matrix, iris[1:4]);

labels = map(x -> x == "setosa" ? 0 : x == "versicolor" ? 1 : 2, iris[end]);

function weights(theta::AbstractVector)
    W0 = reshape(theta[ 1:20], 5, 4); b0 = reshape(theta[21:25], 5)
    W1 = reshape(theta[26:45], 4, 5); b1 = reshape(theta[46:49], 4)
    W2 = reshape(theta[50:61], 3, 4); b2 = reshape(theta[62:64], 3)
    return W0, b0, W1, b1, W2, b2
end

function feedforward(inp::AbstractArray, theta::AbstractVector)
    W0, b0, W1, b1, W2, b2 = weights(theta)
    model = Chain(
        Dense(W0, b0, tanh),
        Dense(W1, b1, tanh),
        Dense(W2, b2)
    )
    return model(inp)
end

# Create `CategoricalLogit` to prevent numerical issues
struct CategoricalLogit <: DiscreteUnivariateDistribution
    logitp
end

# Mainly the `logsumexp` avoids the numerical issue
function Distributions.logpdf(d::CategoricalLogit, k::Int)
    return (d.logitp .- logsumexp(d.logitp))[k+1]    # use k+1 as lab[i] is 0-indexed
end

alpha = 0.09;
sigma = sqrt(1.0 / alpha)

@model bayesnn(inp, lab) = begin
    theta ~ MvNormal(zeros(64), sigma .* ones(64))

    preds = feedforward(inp, theta)
    for i = 1:length(lab)
        lab[i] ~ CategoricalLogit(preds[:,i])
    end
end

Turing.turnprogress(false);

setadbackend(:forward_diff);
# setadbackend(:reverse_diff);

steps = 1_000
chain = sample(bayesnn(Array(inputs'), labels), HMC(steps, 5e-3, 5));

# Extract all weight and bias parameters.
theta = chain[:theta].value.data

Also it seems that the reverse mode AD provided by Flux.Tracker has some issue with this example … thus I turned on forward mode by setadbackend(:forward_diff). Not sure what’s the cause. I will look into it and update if I figured it out.

Topic		Replies	Views
Issue with BNN example from Turing Tutorial Probabilistic Programming turing	3	643	December 24, 2019
Turing.jl, differentiation & categorical outputs -- `isprobvec` conundrum Machine Learning question , differentiation , turing , potential-bug	5	718	February 8, 2021
Bayesian Neural Network New to Julia question , package	5	1609	April 13, 2022
Bayesian logistic regression with Turing.jl Probabilistic Programming turing , monte-carlo	29	4496	May 18, 2021
How to make a basic classification in flux? Machine Learning question , flux , classification	2	1157	April 2, 2021

Bayesian Neural Network (Multiclass Classification) in Turing.jl and Flux.jl

Related topics