I’m working with online stats to learn a classification model. I have a sparse array of tfidf features and binary target variables and have figured out how to train my model. The API documentation is lacking and needs to include how to use the regularization functions.
Specifically, I need to know how to adjust the parameter lambda.
I also need to know if a bias term is added automatically or if I need to do this. So far I have the following:
using OnlineStats
o = fit!(StatLearn(length(feature_array), SGD(), L1Penalty()), (train_tfidf, train_y))
StatLearn: SGD | mean(λ)=0.0 | 0.5 * (L2DistLoss) | L1Penalty | nobs=7782 | nvars=2446
I can gather predictions using:
test_y_pred = predict(o, test_tfidf)
1945-element Array{Float64,1}:
0.12856579087723008
-0.013671349299302107
0.13942378280298387
⋮
I can classify them using:
test_y_pred = classify(o, test_tfidf)
1945-element Array{Float64,1}:
1.0
-1.0
1.0
⋮
1 Like
Also, am I approaching this right? I am trying to learn the available interfaces for machine learning in julia. Looking for a bit of a starting point. The sklearn port looks interesting but I’m interested in keeping my work distributable with Julia db tables.
Specifically, I need to know how to adjust the parameter lambda.
You can provide a vector of lambas (parameter-wise penalties), something like
StatLearn(p, .1 * ones(p))
or a single lambda to apply to each parameter:
StatLearn(p, .1)
I also need to know if a bias term is added automatically or if I need to do this
There is no bias term added automatically. You can take a look at BiasVec
, which adds it lazily:
julia> BiasVec(rand(5))
6-element BiasVec{Float64,Array{Float64,1}}:
0.47513715528898093
0.45808733943617064
0.5337189993055129
0.6613951516035794
0.636024656190582
1.0
Bringing it all together, if you have p
predictors, you’d want something like
n, p = 10^6, 10
x = randn(n, p)
y = randn(n)
julia> o = StatLearn(p + 1, L2Penalty(), vcat(.1 * ones(p), 0)) # avoid penalizing bias/intercept
fit!(o, zip((BiasVec(xi) for xi in eachrow(x)), y))
1 Like
Oh great! This is awesome and exactly the kind of explanation I was looking forward to. Any ideas on how I might test on held out data? I have tried the following:
fit!(o, zip((BiasVec(xi) for xi in OnlineStats.eachrow(train_tfidf)), train_y));
StatLearn: SGD | mean(λ)=0.09995913363302009 | 0.5 * (L2DistLoss) | L1Penalty | nobs=7782 | nvars=2447
EDIT:
managed to figure out how to predict. Not sure why but all observations are having the same predicted probability
[predict(o, r) for r in BiasVec.(OnlineStats.eachrow(test_tfidf))]
1945-element Array{Float64,1}:
0.034086370434273434
0.034086370434273434
0.034086370434273434
⋮
All coefficients are getting the same value except the intercept after fit:
coef(o)
2447-element Array{Float64,1}:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
⋮
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.11715854311395045
Hmm, my first guess is that your lambda is set high enough that the lasso penalty is setting everything to zero.
1 Like
Unfortunately, I think I was using the wrong syntax for predicting value outputs… I still am seeing only zeros for coefficients but at least getting different predictions fo each tfidf row:
[predict(o, BiasVec(r)) for r in OnlineStats.eachrow(test_tfidf)]
1945-element Array{Float64,1}:
0.5037992943250542
0.3412787518465366
0.479159463786335
0.3412787518465366
0.4102191078164358
0.3872389891598027
⋮
Also it could be the model is large uninformative. Closing this as you have answered my question