Anyone developing multinomial logistic regression?

Hi all,

I’m looking to write a multinomial logistic regression function (including conditional logit). From what I can see, there is no extant package that can do this. Given the popularity of such a model, I’m surprised that it doesn’t exist yet in Julia, but would also be interested in collaborating. A good model for such a package would be the R mlogit package.

@Nosferican @mcreel

Not sure what mlogit is but I have one in my package, Haven’t fully tested it yet I’d love feedback

I also have classification measures for multiclass classifications, etc.

If there’s anything missing you’d like to see let me know and I probably have the wherewithall to add it.

Edit - Ah sorry I have a multinomial softmax regression. Sorry thats a different bird.

I will be registering my package in the next few days which has ologit and mlogit. The implementation for mlogit is similar to Stata’s mlogit rather than the R package. The R package allows for outcome specific features (e.g., price of y0, price of y1, etc) while Stata’s and mine only handle features for the unit of observation (e.g., age, education, etc.).


Awesome! That’s definitely a great help. I am basically looking for the equivalent of Stata’s asclogit command but I suspect that I can implement that using your package as a starting point.

I have implemented this in Matlab but will be curious to see how you handle the optimization.

I think softmax regression is the same as multinomial logistic regression. This package is really cool and I will take a look! Thanks for responding.

For nominal response models, it would be nice to add the generalization for conditional multinomial logistic (PR welcome). For ordinal logistic regression, I implemented the proportional odds logistic regression (POLR), but the generalization would also be welcomed as a PR. I am waiting for StatsModels to tag a new release for the Terms 2.0 era and for me to finish up the last final touches.

In terms of the optimization approach, (1) the random effects (Swamy-Arora) is fitted through a partial demeaning framework, (2) absorption of categorical fixed effects is done trough the method of alternating projections, (3) multinomial logistic regression is optimized through Fisher Scoring in a Vector Generalized Linear Model (VGLM) framework, and (4) ordinal logistic regression uses Optim.jl with a Newtonian with AD for the Hessian because I lost so much time trying to write up the analytical Hessian and failed.


Awesome, thanks! My Matlab implementation of Stata’s asclogit command is basically the equivalent of calling Optim.jl. I suspect Fisher scoring is faster, but I’ve never programmed that up before.

I had a peek at your source code this morning and will be in touch with a possible PR in the coming weeks.

1 Like

It might still need a tweak or two (e.g., StatsModels changes in master)… I want to get the beta test out in the coming weeks and then start working on optimizing the code and more robustness checks. That was a chapter in my defense (two weeks ago) so probably would be good to release it soon and have it stable by JuliaCon.


Can you provide the link to your package?


For those interested in regularised regression models including multinomial regressions, I’ve just released MLJLinearModels.jl which will soon be integrated with the MLJ environment; it’s competitive with Sklearn and similar R packages though rough around the edge and without docs (which is why it hasn’t been officially announced yet)


@tlienart - When this matures I’ll likely interface my package with it. Looks like a great start.

I am trying to use your package, but I really have no clue how!
Is there any documentation or example I can follow?


Yeah this is my fault really… there’s a docs branch but since starting a new job I’m struggling a bit to find the time to push it through

The tests provide quite good examples including comparisons with sklearn:

Long story short if you have a n x p matrix X and targets y encoded as 1,2,3,… (indices of the class) then you can just do the following:

mnr = MultinomialRegression(0.5) # 0.5 is the strength of the L2 regularisation
theta = fit(mnr, X, y)

not very dissimilar to how sklearn is called (and actually you can see the comparison code

To do the predictions you can apply X to theta then use a softmax:

preds   = apply_X(Xmatrix, θ, c) # c is the number of classes
preds .= softmax(preds)

for binary, same process but with ±1 labels and using sigmoid instead of softmax.

If you go the MLJ route, you can use the fit/predict machinery, see this example


Ok, Cool.
Indeed I looked at the tests, but still I was quite lost!

And how the matrix theta is structured?

X = rand(100,3)
y = rand([1,2,3],100)
λ = 0.5
mnr = MultinomialRegression(λ; fit_intercept=true)
θ  = fit(mnr, X, y)
X = rand(100,3)
y = rand([1,2,3],100)

parameters = reshape(θ,(3,4))

where parameters[1, 1:3] are the coefficient of the first class and parameters[1, 4] is the intercept?
Maybe I am dumb and not understanding how Logit works, but I am referring to this:

So I think the way the prediction work should help you there, basically (assuming no intercept first) what it does is:

  1. apply_X, this basically means (X * theta) = z
  2. apply a softmax on z

If you check the dimensions you will see that X is (n x p ) where n is the number of records, p is the dimensionality (number of features), and theta effectively corresponds to a (p x c) matrix where c is the number of classes (it’s stored as a single vector though, which is why you see this reshape; each column is one after the other)

Now if s = softmax z, it’s n x c (like z) with the rows summing to 1, basically each row contains the score for each class. For instance for 3 classes you would get predictions with rows like

[0.2, 0.5, 0.3]

meaning that the second class is the “most likely” for that point (most highly scored according to the model)

I hope this clarifies it a bit!

Edit: more specifically, in the no intercept case, each class gets p parameters, these are stacked one after the other in theta so 1:p then p+1:2p etc and theta has total length p*c (change p for p+1 for the case with intercept) if you want to compare it with what sklearn gives, you indeed have to do reshape(theta, (p, c)) or reshape(theta, (p+1, c)) (with intercept)

Edit2: code for apply might be useful too:

1 Like

Thank you!
The Mnist database got classified pretty well :slight_smile:

There was just something ‘odd’

I downloaded MNIST from

using Flux, Flux.Data.MNIST, Statistics
imgs = MNIST.images()
X = Array(transpose(hcat(float.(reshape.(imgs, :))...) ))
labels = MNIST.labels() .+1

The labels go from 0 to 9, but I had to translate them to [1,10] otherwise I got this error.
Looking at the number I had the intuition it was scoring over 1:9, like the class with 0 was ‘not welcome’

For the rest it worked smoothly,

Thanks a lot

DimensionMismatch("new dimensions (785, 10) must be consistent with array size 7056")
(::Base.var"#throw_dmrsa#197")(::Tuple{Int64,Int64}, ::Int64) at reshapedarray.jl:41
reshape at reshapedarray.jl:45 [inlined]
reshape at reshapedarray.jl:116 [inlined]
apply_X!(::Array{Float64,2}, ::Array{Float64,2}, ::Array{Float64,1}, ::Int64) at utils.jl:66
(::MLJLinearModels.var"#102#103"{GeneralizedLinearRegression{MultinomialLoss,ScaledPenalty{LPPenalty{2}}},Array{Float64,2},Array{Int64,1},Int64,Int64,Int64,Float64})(::Float64, ::Array{Float64,1}, ::Array{Float64,1}) at d_logistic.jl:149
(::NLSolversBase.var"#61#62"{NLSolversBase.InplaceObjective{Nothing,MLJLinearModels.var"#102#103"{GeneralizedLinearRegression{MultinomialLoss,ScaledPenalty{LPPenalty{2}}},Array{Float64,2},Array{Int64,1},Int64,Int64,Int64,Float64},Nothing,Nothing,Nothing},Float64})(::Array{Float64,1}, ::Array{Float64,1}) at incomplete.jl:45
value_gradient!!(::NLSolversBase.OnceDifferentiable{Float64,Array{Float64,1},Array{Float64,1}}, ::Array{Float64,1}) at interface.jl:82
initial_state(::Optim.LBFGS{Nothing,LineSearches.InitialStatic{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Optim.var"#19#21"}, ::Optim.Options{Float64,Nothing}, ::NLSolversBase.OnceDifferentiable{Float64,Array{Float64,1},Array{Float64,1}}, ::Array{Float64,1}) at l_bfgs.jl:158
optimize(::NLSolversBase.OnceDifferentiable{Float64,Array{Float64,1},Array{Float64,1}}, ::Array{Float64,1}, ::Optim.LBFGS{Nothing,LineSearches.InitialStatic{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Optim.var"#19#21"}, ::Optim.Options{Float64,Nothing}) at optimize.jl:33
#optimize#93 at interface.jl:116 [inlined]
optimize(::NLSolversBase.InplaceObjective{Nothing,MLJLinearModels.var"#102#103"{GeneralizedLinearRegression{MultinomialLoss,ScaledPenalty{LPPenalty{2}}},Array{Float64,2},Array{Int64,1},Int64,Int64,Int64,Float64},Nothing,Nothing,Nothing}, ::Array{Float64,1}, ::Optim.LBFGS{Nothing,LineSearches.InitialStatic{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Optim.var"#19#21"}, ::Optim.Options{Float64,Nothing}) at interface.jl:115
optimize(::NLSolversBase.InplaceObjective{Nothing,MLJLinearModels.var"#102#103"{GeneralizedLinearRegression{MultinomialLoss,ScaledPenalty{LPPenalty{2}}},Array{Float64,2},Array{Int64,1},Int64,Int64,Int64,Float64},Nothing,Nothing,Nothing}, ::Array{Float64,1}, ::Optim.LBFGS{Nothing,LineSearches.InitialStatic{Float64},LineSearches.HagerZhang{Float64,Base.RefValue{Bool}},Optim.var"#19#21"}) at interface.jl:115
_fit(::GeneralizedLinearRegression{MultinomialLoss,ScaledPenalty{LPPenalty{2}}}, ::LBFGS, ::Array{Float64,2}, ::Array{Int64,1}) at newton.jl:114
#fit#144(::LBFGS, ::typeof(fit), ::GeneralizedLinearRegression{MultinomialLoss,ScaledPenalty{LPPenalty{2}}}, ::Array{Float64,2}, ::Array{Int64,1}) at default.jl:48
fit(::GeneralizedLinearRegression{MultinomialLoss,ScaledPenalty{LPPenalty{2}}}, ::Array{Float64,2}, ::Array{Int64,1}) at default.jl:38
top-level scope at test_LR.jl:159

1 Like

Yes that’s actually a requirements to have the encoding to be 1…c (partly related to if you have a categorical vector and want to encode it to integer)

Cool thanks for reporting this! Would you mind opening an issue with your reproducing code? I might add it as an example in these much needed docs :sweat_smile:

1 Like

Here you go.

I would say that this topic got solved :slight_smile:
Thanks a lot for the help

The example code:

# Classify MNIST digits with a simple multi-layer-perceptron
using Flux.Data.MNIST
using MLJLinearModels
# Get MNIST dataset and transpose for (records, features)
imgs = MNIST.images()
X = Array(transpose(hcat(float.(reshape.(imgs, :))...) )
# MNIST labels: Categorical labels must be 1...c, hence add .+1 to each label
labels = MNIST.labels() .+1
# and the number of classes
n_classes = length(Set(labels))
n_features = size(X,2)
# The MNIST database does not need the intercept
intercept = false
# deploy MultinomialRegression from MLJLinearModels, λ being the strenght of the reguliser
mnr = MultinomialRegression(λ; fit_intercept=intercept)
# Fit the model
θ  = fit(mnr, X, labels)
# The model parameters are organized such we can apply X⋅θ, the following is only to clarify
params = reshape(θ, n_features +Int(intercept), n_classes)
# Get the predictions X⋅θ 
preds = MLJLinearModels.softmax(MLJLinearModels.apply_X(X,θ,n_classes))
# map each vector to its maximal element 
targets = map(x->argmax(x),eachrow(preds))
#and evaluate the model over the labels
scores = 1- sum(targets-labels)/length(preds)