Unexpected Behavior in LogisticClassifier MLJLinearModels

I am experiencing some odd behavior from the Logistic Classifier in MLJLinearModels.jl that maybe somebody can help me to understand. This MWE is not the research problem I am working on (because that is hard to boil down), but nevertheless shows some behavior that is related to what I am seeing in my problem and is confusing to me.

using DataFrames, MLJ, Distributions

pos_samples = [[1 + rand(Normal(0, 0.1)), 1] for _ in 1:1000]
neg_samples = [[-1 + rand(Normal(0, 0.1)), 0] for _ in 1:1000]
samples = vcat(pos_samples, neg_samples)

data = DataFrame(mapreduce(permutedims, vcat, samples), [:sample, :class])
display(scatter(data.sample, data.class, legend = :none))

X_train = data[!, [:sample]]
y_train = coerce(data.class, Multiclass)

LC = @load LogisticClassifier pkg = MLJLinearModels
model = machine(LC(), X_train, y_train)
fit!(model)
println("Fitted Model.")

test_vals = LinRange(-2, 2, 101)
positive_prob = [p.prob_given_ref[2] for p in MLJ.predict(model, test_vals[:, :])]
plot(test_vals, positive_prob)

In this code, I generate noisy data points centered near +1 and near -1. Those centered near +1 are labeled as being in the “positive” class (i.e., with a label 1). The points centered near -1 are the “negative class”, (i.e., with a label 0).

I use these points to train a logistic classifier, and then generate a completely synthetic test set (uniformly spaced points between -2 and +2), expecting that the probability of the positive class to be very close to zero for points to the left of zero, and for the same probability to be close to one for points to the right of zero. However, this is not what I find. I find that the classifier is at best about “70% certain” of the class one way or the other.

Below is a visualization of my training samples and my plot of P(class 1) by test sample.
test_samples
probability_curve

I’d expect the probability curve to look more like a sigmoid function after it’s been trained.

Can somebody tell me if I’m just being silly and missing something, or if this kind of behavior is actually unexpected?

EDIT: I forgot to mention that in my original problem, simply switching the model from the LogisticClassifier to the DecisionTreeClassifier available through the DecisionTree package makes everything work just fine. The same holds for this simplified MWE: Making that change produces an indicator-like probability curve that jumps from 0 to 1 near zero, which is exactly the behavior I’m trying to produce with Logistic Regression.

That does look odd. @tlienart are you able to comment on this?

@liamfdoherty Have you tried, for comparison, the LogisticClassifier provided by scikitlearn? You can load in MLJ with

@iload LogisticClassifier pkg=ScikitLearn

The training data is separable, meaning that x completely separates the 2 distinct y values. Consequently there is no unique solution - there are infinitely many points on the x axis (lines in x space) that separate the 2 distinct y values.

Different packages handle this in different ways. For example, GLM.jl:

using DataFrames, GLM

pos_samples = [[1 + 0.1*randn(), 1] for _ in 1:1000];
neg_samples = [[-1 + 0.1*rand(), 0] for _ in 1:1000];
samples     = vcat(pos_samples, neg_samples);
data        = DataFrame(mapreduce(permutedims, vcat, samples), [:sample, :class])

model = glm(@formula(class ~ 1 + sample), data, Bernoulli(), LogitLink());
coeftable(model)  # Large coefficient, large Std Errors
loglikelihood(model)

testvals = DataFrame(sample=LinRange(-2, 2, 101));
predict(model, testvals)

And MultinomialRegression.jl (which has different default settings to GLM.jl):

using DataFrames, MultinomialRegression

pos_samples = [[1 + 0.1*randn(), 1] for _ in 1:1000];
neg_samples = [[-1 + 0.1*rand(), 0] for _ in 1:1000];
samples     = vcat(pos_samples, neg_samples);
data        = DataFrame(mapreduce(permutedims, vcat, samples), [:sample, :class])

model = fit(@formula(class ~ 1 + sample), data);
coeftable(model)  # Large coefficient, large Std Errors
loglikelihood(model)

testvals = [[1, x] for x in LinRange(-2, 2, 101)];
predict.(Ref(model), testvals)

@ablaom Making that change gives me exactly what I would expect:
probability_curve

@jocklawrie You make a good point. I was expecting that under the hood MLJLinearModels would be doing something along the lines of a maximum likelihood estimator, which I thought would give the result like above, because while yes there are infinitely many separating “planes” for the data, only some choices of the parameters would actually maximize the likelihood function. But, if under the hood all that’s being done is something like gradient descent, I see how perhaps there would be nothing to optimize if the initial parameters were such that the data was already perfectly separated.

Maybe this could describe the difference in behavior that I’m seeing? Again, DecisionTreeClassifier from DecisionTree works perfectly and does exactly what I want, but I saw this peculiarity with the LogisticClassifier and thought I would wave the red flag here.

Appreciate it. I agree this looks off and have raised an issue.

1 Like

It does actually look like a sigmoid, if you plot it on a wider range of input values.
The default regularization constant of the LogisticClassifier is pretty large.
If you change it to smaller values, e.g. LC(lambda = 1e-6), you get the result you expected.

1 Like

Yes, as @jbrea points out, what you are seeing in your plot is an overregularized model. The default value for the regularisation parameter lambda is higher for the MLJLinearModels classifier than the sk-learn one.

This is likely going to be a common gotcha and the package author has changed the default value of lambda in the MLJLinearModels case from 1.0 to eps() in this new release. @liamfdoherty You should now see “expected” behaviour.

Thanks for reporting.

Thanks to @jbrea for spotting the problem.

1 Like

Yes, this was the issue! Thanks, @jbrea ! Thanks also @ablaom for opening the issue and to the package author for offering the new release. All works as expected now!

2 Likes