Bayesian logistic regression with Turing.jl

Hi everyone, first-time user of Turing.jl here. It is really nice, thanks! :star:

I am trying to implement a simple Bayesian logistic regression model, but I keep getting this warning from AdvancedHMC.jl:

┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47

Is there a good method to debug such situations? The posterior samples in the chain are not updated at all as you can see in the attached image:

Here is the baseline model, which consumes 3D data arrays for conditioning:

@model function logistic(Δᵦ, Δₛ, Δᵣ, I)
    # Our prior belief
    θₒ ~ Normal(0.0, 5.0)
    θᵦ ~ Normal(0.0, 5.0)
    θₛ ~ Normal(0.0, 5.0)
    θᵣ ~ Normal(0.0, 5.0)

    # Data conditioning
    nx, ny, nt = size(I)
    for x in 1:nx, y in 1:ny, t in 1:nt
        z = θₒ + θᵦ*log(Δᵦ[x,y,t] + 1) +
                 θₛ*log(Δₛ[x,y,t] + 1) +
                 θᵣ*log(Δᵣ[x,y,t] + 1)

        p = 1 / (1 + exp(-z))

        I[x,y,t] ~ Bernoulli(p)
    end
end

When I feed the actual data to the model, I make sure that the values in the data arrays are in the range [0,1] (feature normalization) to avoid vanishing gradients:

samples = sample(logistic(Δᵦ./κ, Δₛ./κ, Δᵣ./κ, I), HMC(0.1, 5), MCMCThreads(), 1000, 3)

The normalization constant κ above takes care of the normalization.

Do you see any conceptual problem with this model implementation? I can provide examples of 3D data arrays in a Google Drive if you need them to debug this further. I asked the same question in the Turing’s Zulip channel and people suggested using I[x,y,t] ~ BernoulliLogit(z) directly as this distribution type from Turing.jl avoids corner cases with the logarithm function. I tried the suggestion, but it didn’t change the results.

Appreciate any help.

2 Likes

I just tried the following:

using Turing

@model function logistic(Δᵦ, Δₛ, Δᵣ, I)
    # Our prior belief
    θₒ ~ Normal(0.0, 5.0)
    θᵦ ~ Normal(0.0, 5.0)
    θₛ ~ Normal(0.0, 5.0)
    θᵣ ~ Normal(0.0, 5.0)

    # Data conditioning
    nx, ny, nt = size(I)
    for x in 1:nx, y in 1:ny, t in 1:nt
        z = θₒ + θᵦ*log(Δᵦ[x,y,t] + 1) +
                 θₛ*log(Δₛ[x,y,t] + 1) +
                 θᵣ*log(Δᵣ[x,y,t] + 1)

        I[x,y,t] ~ BernoulliLogit(z)
    end

    return (; θₒ, θᵦ, θₛ, θᵣ, I)
end

Δᵦ, Δₛ, Δᵣ = rand(2, 2, 2), rand(2, 2, 2), rand(2, 2, 2)

# Just wanted to check if the prior makes sense.
m_prior = logistic(Δᵦ, Δₛ, Δᵣ, fill(missing, size(Δᵦ)))
m_prior().I # <= Accesses the sampel `I`

# Make some observations
I = rand(0:1, size(Δᵦ))
m = logistic(Δᵦ, Δₛ, Δᵣ, I)

chain = sample(m, NUTS(), 1_000)

and it works just fine:

Chains MCMC chain (1000×16×1 Array{Float64, 3}):

Iterations        = 1:1000
Thinning interval = 1
Chains            = 1
Samples per chain = 1000
parameters        = θᵣ, θᵦ, θₒ, θₛ
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, max_hamiltonian_energy_error, n_steps, nom_step_size, numerical_error, step_size, tree_depth

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat 
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64 

          θᵣ    0.0467    3.9219     0.1240    0.0702   662.1581    0.9990
          θᵦ   -6.6730    3.5467     0.1122    0.1166   775.6216    1.0009
          θₒ    1.4792    2.5492     0.0806    0.0778   586.2699    1.0000
          θₛ    3.3406    3.8537     0.1219    0.1541   608.1839    1.0059

Quantiles
  parameters       2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol    Float64   Float64   Float64   Float64   Float64 

          θᵣ    -7.9517   -2.4478    0.0584    2.6041    7.8303
          θᵦ   -13.7322   -9.0594   -6.6033   -4.3794    0.2003
          θₒ    -3.7906   -0.2141    1.5834    3.0902    6.3588
          θₛ    -4.1569    0.7171    3.3497    5.9218   11.2146

So two things:

  1. Try sampling from the prior and see what it produces to see if it makes sense.
  2. Use NUTS instead of HMC. The numerical errors are likely because the stepsize you’re using is too large. NUTS will adaptively try to find the correct stepsize in an initial adaption phase, and then use this for the rest of the run. This will likely help quite a bit.
2 Likes

You were completely right @torfjelde , I just reduced the leapfrog step size from 0.1 to 0.01 and the trajectories started to make more sense. I will try to find good parameters using NUTS.

Also, in my use case the tensors have size of approximately 100x100x100 do you think this will be a problem for inference? I will keep investigating the details. Your comment above is already pretty helpful. I will keep in mind that we can easily sample from the prior by passing missing to some of the inputs.

1 Like

Actually, try to use this instead to “inspect” your prior:

# Create a prior from which we can simulate observations.
m_prior = logistic(Δᵦ, Δₛ, Δᵣ, fill(missing, size(Δᵦ)))

# Create observation where the `θₒ, θᵦ, θₛ, θᵣ`  samples are considered "true" parameters.
θₒ, θᵦ, θₛ, θᵣ, I = m_prior()
I = map(identity, I) # just a trick to convert `Array{Union{Missing, Int}}` to `Array{Int}` for slightly better efficiency.

# Observe.
m = logistic(Δᵦ, Δₛ, Δᵣ, I)

# Sample
chain = sample(m, NUTS(), 1_000)

Then, ideally, you’d approximately be able to recover the true parameters θₒ, θᵦ, θₛ, θᵣ with the mean of chain. If you can’t (and you don’t have convergence issues), then this indicates that there’s something weird going on with your model. E.g. when I tried to generate some data and observe on this, I noticed that only θₒ learned anything useful; the rest of the parameters where simply sampling from the prior (looked like a Normal(0, 5)). It turned out that this was because my observations were all 0 or 1 :slight_smile: Hopefully this isn’t the case in your data, but no matter the above is a useful exercise to inspect your model :+1:

2 Likes

Just to confirm, you meant the issue in your case was because all observations, including those for the inputs \Delta_\beta, \Delta_s, \Delta_r were all 0 or 1, right? For the variable I we want only 0 or 1.

I mean only I was 0 or 1. The others were in the range [0, 1] :slight_smile:

In my case the variables \Delta \in [0,\kappa] and I \in \{0,1\}. That should be ok, right? Note the difference (brackets vs. curly braces).

On a related question, is it true that returning the variables in the model will decrease performance? In my original model I was not returning any value. The last statement in the function body was a for loop. In your modified version you return the parameters with return (; θₒ, θᵦ, θₛ, θᵣ, I). I am running the sampler with this modified version and apparently it is much slower. Or perhaps this has to do with the fact that I am sampling the prior now with missing values for the I variable?

It could potentially be, yes, but difficult to know ahead of time.

BUT it for sure is going to be unnecessarily slow when your write out the for-loop like that. So what I would suggest is instead to either:

  1. Implement a vectorized version of the BernoulliLogit.
  2. Use Turing.@addlogprob! to just add the loglikelihood to the logp-accumulation in Turing. This will make it so that Turing won’t every see I, it will just go “oh something to accumuate”.

For (2) I mean something like:

import Turing: StatsFuns

bernoullilogpdf(p, x) = p * x + (1 - p) * (1 - x)

@model function logistic_vectorized(Δᵦ, Δₛ, Δᵣ, I)
    # Our prior belief
    θₒ ~ Normal(0.0, 1.0)
    θᵦ ~ Normal(0.0, 1.0)
    θₛ ~ Normal(0.0, 1.0)
    θᵣ ~ Normal(0.0, 1.0)

    # Data conditioning
    # `@threads` doesn't support nested for-loops, so we use `CartesianIndices`
    # to linearize it instead.
    z = @. (
        θₒ +
        θᵦ*log(Δᵦ + 1) +
        θₛ*log(Δₛ + 1) +
        θᵣ*log(Δᵣ + 1)
    )

    DynamicPPL.@addlogprob! sum(bernoullilogpdf.(StatsFuns.logistic.(z), I))
    return (; θₒ, θᵦ, θₛ, θᵣ)
end

This is also likely going to be the fastest.
This took ~8mins to obtain 1000 samples on my computer for (100, 100, 100) tensor. In constrast, the “naive” version with the for-loop is still compiling :sweat_smile: Notice that here though you can’t sample I if they’re set to missing, since now Turing.jl doesn’t even “know” that I is an observation: it just assumes that you know what you’re doing with addlogprob!.

We do have the .~ which should cover these cases, i.e.

I .~ BernoulliLogit.(z)

but it seems like this only supports a Vector of distributions, not arbitrary arrays of distributions. This is surprising to me though, so 'll try getting this fixed!

1 Like

Highly doubt this is because of the return-values; that shouldn’t require any more allocations since I isn’t mutated.

But yeah, sampling missing will slow things down because we have to “check” whether I[x, y, t] is missing at every ~ since it’s an array with element-type Union{Missing, Int}. This is why I did the map(identity, I) to convert it into a Array{Int} before doing actual sampling.

Yeah, so that’s what I did too, but with kappa = 1. I would guess that if you have a large variance on your prior + kappa is very large, then this could cause issues.

I did normalize the features with \kappa so at the end all \Delta \in [0,1].

Thank you @torfjelde for the very useful code snippets. The vectorized form will certainly be very useful in this baseline. I am also looking forward for the syntax z .~ BernoulliLogit.(z) that would be great.

I plan to generalize this model and perhaps the vectorized form won’t be possible. I understand that I can still provide a manual logprob function as you did and performance-wise we will be somewhat ok, right?

I will keep investigating this issue. Thank you very much for the amazing support.

1 Like

I think you could also probably get away with a smaller scale on the priors. A sd of 5.0 still allocates a lot of mass on extreme cases due to the logit. Setting that to something like 2.0 might also help sampling a bit while still being reasonable.

1 Like

I plan to generalize this model and perhaps the vectorized form won’t be possible. I understand that I can still provide a manual logprob function as you did and performance-wise we will be somewhat ok, right?

An alternative to vectorizing, is using multithreading.
In Turing.jl, your weclome to use stuff like Threading.@threads inside your model. I just gave your mode la try using 6 threads and the following definition:

@model function logistic_threaded(Δᵦ, Δₛ, Δᵣ, I)
    # Our prior belief
    θₒ ~ Normal(0.0, 1.0)
    θᵦ ~ Normal(0.0, 1.0)
    θₛ ~ Normal(0.0, 1.0)
    θᵣ ~ Normal(0.0, 1.0)

    # Data conditioning
    # `@threads` doesn't support nested for-loops, so we use `CartesianIndices`.
    # Could alternatively use `@spawn`.
    # to linearize it instead.
    Threads.@threads for xyt in CartesianIndices(I)
        z = θₒ + θᵦ*log(Δᵦ[xyt] + 1) +
                 θₛ*log(Δₛ[xyt] + 1) +
                 θᵣ*log(Δᵣ[xyt] + 1)

        I[xyt] ~ BernoulliLogit(z)
    end

    return (; θₒ, θᵦ, θₛ, θᵣ)
end

ProgressMeter.jl is telling me that this will take a couple of hours though, compared to the previous implementation of a couple of minutes. But this can be very useful in cases where you either have access many cores or you really need to write out the for-loop. WARNING: Threads.@threads is only allowed for observations, not sampling (since those statements are inherintely mutating). So you can’t use the above statement to sample from the prior (unless you are using 1 thread). Nor is threading functional with reverse-mode AD frameworks, e.g. Zygote.jl and ReverseDiff.jl.

The reason for this difference in performance likely is due to AD :confused: For-loops are generally quite difficult to AD through, while most AD-frameworks have very fast implementations for vectorized operations, e.g. ForwardDiff.jl (which Turing.jl uses by default).

@torfjelde just passing by to thank you again for the incredible help. I managed to scale the model using your suggestion with the DynamicPPL.@addlogprob! trick.

I will try use NUTS now to figure out what are good leapfrog parameters for HMC. Is it correct to say that NUTS would be the “most advanced” sampler when it comes to self-tuning parameters?

2 Likes

Awesome; glad to hear!:slight_smile:

And yeah, NUTS is the go-to for self-tuning parameters :+1:

1 Like

Is manually adding logprob generally the fastest method? I have model with around 300 parameters, with 8 hyper parameters that all depend on each other and 20000 data points that takes a couple of days to get 5000 samples with NUTS. I’m happy to lose some of the elegance of the standard Turing syntax for a speed to at this point.

It can really help if the likelihood is represented as an array of univariate distributions. E.g. instead of doing something like x[i] ~ Normal(mu[i], sigma[i]), it will be much faster to do @addlogprob! sum(map(StatsFuns.normlogpdf, mu, sigma, x)) or even using something like Tullio.jl for reductions.
This is because Turing.jl can’t automatically “unroll” your for-loop and vectorize the operation for you. By using @addlogprob! directly, you can do that yourself:)

1 Like

Brilliant! Is it possible to use LoopVectorization in this context? I’ve tried, but to no avail as yet.