Logistic regression for data with missing values

:100: ello I tried to adapt logistic regression for my case

@model function logistic_regression(x, y, n, σ)
    intercept ~ Normal(0, σ)

    Gabel ~ Normal(0, σ)
    SUVmax_binned ~ Normal(0, σ)
    MTV_binned ~ Normal(0, σ)

    AHT ~ Normal(0, σ)
    OS_Mx ~ Normal(0, σ)
    Infiltration ~ Normal(0, σ)

    number_operations ~ Normal(0, σ)
    KLIN_P_T ~ Normal(0, σ)
    KLIN_P_M ~ Normal(0, σ)



    for i in 1:n
        v = logistic(intercept .+ Gabel * x[i, 1] + SUVmax_binned * x[i, 2] + MTV_binned * x[i, 3] 
        + AHT * x[i, 4] + OS_Mx * x[i, 5] + Infiltration * x[i, 6] 
        + number_operations * x[i, 7] + KLIN_P_T * x[i, 8] + KLIN_P_M * x[i, 9])
        y[i] ~ Bernoulli(v)
    end
end;

then use it

n, _ = size(train)

# Sample using NUTS.
m = logistic_regression(train, train_label, n, 1)

sum(train_label)

chain = sample(m, NUTS(), MCMCThreads(), 1_500, 3)

train data looks like that

Hovewer I get error

ERROR: MethodError: no method matching logistic(::Missing)

Closest candidates are:
  logistic(::SparseConnectivityTracer.HessianTracer)
   @ SparseConnectivityTracerLogExpFunctionsExt ~/.julia/packages/SparseConnectivityTracer/4CmIb/src/overloads/hessian_tracer.jl:67
  logistic(::SparseConnectivityTracer.GradientTracer)
   @ SparseConnectivityTracerLogExpFunctionsExt ~/.julia/packages/SparseConnectivityTracer/4CmIb/src/overloads/gradient_tracer.jl:29
  logistic(::Tracker.TrackedReal)
   @ Tracker ~/.julia/packages/Tracker/6rnwO/src/lib/real.jl:82
  ...

Stacktrace:
  [1] macro expansion
    @ ./REPL[22]:19 [inlined]
  [2] macro expansion
    @ ~/.julia/packages/DynamicPPL/Awq82/src/compiler.jl:579 [inlined]
  [3] logistic_regression(__model__::DynamicPPL.Model{…}, __varinfo__::DynamicPPL.UntypedVarInfo{…}, __context__::DynamicPPL.DebugUtils.DebugContext{…}, x::Matrix{…}, y::Vector{…}, n::Int64, σ::Int64)
    @ Main ./REPL[22]:18
  [4] _evaluate!!
    @ ~/.julia/packages/DynamicPPL/Awq82/src/model.jl:975 [inlined]
  [5] evaluate_threadunsafe!!
    @ ~/.julia/packages/DynamicPPL/Awq82/src/model.jl:948 [inlined]
  [6] check_model_and_trace(rng::TaskLocalRNG, model::DynamicPPL.Model{…}; varinfo::DynamicPPL.UntypedVarInfo{…}, context::DynamicPPL.SamplingContext{…}, error_on_failure::Bool, kwargs::@Kwargs{})
    @ DynamicPPL.DebugUtils ~/.julia/packages/DynamicPPL/Awq82/src/debug_utils.jl:599
  [7] check_model_and_trace
    @ ~/.julia/packages/DynamicPPL/Awq82/src/debug_utils.jl:582 [inlined]
  [8] #check_model_and_trace#8
    @ ~/.julia/packages/DynamicPPL/Awq82/src/debug_utils.jl:580 [inlined]
  [9] check_model_and_trace
    @ ~/.julia/packages/DynamicPPL/Awq82/src/debug_utils.jl:579 [inlined]
 [10] check_model
    @ ~/.julia/packages/DynamicPPL/Awq82/src/debug_utils.jl:625 [inlined]
 [11] _check_model
    @ ~/.julia/packages/Turing/bUZEC/src/mcmc/Inference.jl:280 [inlined]
 [12] _check_model
    @ ~/.julia/packages/Turing/bUZEC/src/mcmc/Inference.jl:283 [inlined]
 [13] #sample#6
    @ ~/.julia/packages/Turing/bUZEC/src/mcmc/Inference.jl:331 [inlined]
 [14] sample
    @ ~/.julia/packages/Turing/bUZEC/src/mcmc/Inference.jl:321 [inlined]
 [15] #sample#5
    @ ~/.julia/packages/Turing/bUZEC/src/mcmc/Inference.jl:316 [inlined]
 [16] sample(model::DynamicPPL.Model{…}, alg::NUTS{…}, ensemble::MCMCThreads, N::Int64, n_chains::Int64)
    @ Turing.Inference ~/.julia/packages/Turing/bUZEC/src/mcmc/Inference.jl:308
 [17] top-level scope
    @ REPL[27]:1'

How can I modify logistic regression model to support missing values in data ?

This is a methodological question, not really a Julia question. Forgetting about the complications of the logistic function for a second, regression tries to solve a system of equations A = bX by doing A\b - this of course only works if there are actual numbers in A and X, not missing values. Most statistical software will drop observations with missing values, here’s R for example:

> df <- data.frame(y = c(0, 1, 1, 0, 1, 0), x = c(0.5, 0.9, 0.8, NA, 0.82, 0.9))
> glm(y ~ x, family = binomial(link = 'logit'), data = df)

Call:  glm(formula = y ~ x, family = binomial(link = "logit"), data = df)

Coefficients:
(Intercept)            x  
     -5.262        7.227  

Degrees of Freedom: 4 Total (i.e. Null);  3 Residual
  (1 observation deleted due to missingness)
Null Deviance:	    6.73 
Residual Deviance: 5.615 	AIC: 9.615

If you don’t want to drop observations with missing values you will have to either drop a covariate which introduces missingness, or impute.

3 Likes