Why is this optimization problem so slow in Julia? In R it took 60s

it’s pretty awesome. Makes my code 60% faster!

I’m confused about the way you’ve formulated your problem. It doesn’t look like you have any actual parameters in your logistic function, for example. I haven’t run the code, but it’s really not clear what part of the data you’re actually using as actual data, as opposed to parameters in a model to fit. I would think that your objective should at the very least have one parameter and look something like this:

const DATA = ...
function negloglik(params)
    out = zero(eltype(params)) # for autodiff
    for row_j in eachrow(DATA)
      p1_m_p2 = row_j[1] - row_j[2]
      out += log(1/(1+exp(-params[1]*p1_m_p2)))
    end
    -out
end

or whatever. My reading of what your current code is doing is that you’re trying to recover (?) the data file from random inits, but not only is that problem ill-posed like @mschauer suggests, but I also wouldn’t really call it a model because it doesn’t really seem to have any “data” in the sense that people typically mean when describing a statistical model, i.e. something where you make some kind of parametric assumption about it and then try to optimize the induced likelihood under that assumed family of distributions or whatever.

Ah, I changed the data. I think the previous data might have this issue. Let me check. so I’ve removed players with any wins or any losses. As those can technically get very high values without a bound, I think. That might be the issue.

There! The weights. I suspect the issue is that u didn’t get my approach. I am just using the logit transform (or softmax as they call it in ML) to turn a value into a probability.

Each player has a strength which is encoded in the weights vector, for example player 1’s strength is weights[1] and player 2’s strength is weights[2] etc.

My dataframe contains player X vs player Y data, where X is the winner of the game.

so my data looks like this

p1 p2
10 11
100 88

the above data means player 10 played player 11 and won; and player 100 played player 88 and won.

I have about 5000 such games. I just want to estimate the strength of each player using the logit transform.

oh oh oh, I’m sorry—you’re right, I was misreading the data. I thought those two numbers were some kind of rating, like an ELO in chess or something, not the literal player index. That’s what I get for not actually running the code and inspecting the objects, sorry. I follow your model now.

To be clear, though, I didn’t write that last comment just to be a jerk about models or something. I could have made this clearer, but I can’t help but think that what’s going on here is that some part of the way that you’ve formulated the likelihood is either making the problem (nearly?) ill-posed or perhaps created a brutally nonconvex likelihood surface. Out of curiosity, have you tried plugging in an AD Hessian and looking at a second order method, like Optims NewtonTrustRegion? While it is nice that R gives an answer, I would be interested to sanity check it to see if Optim with a Hessian doesn’t give back something with a meaningfully lower likelihood. In my field of spatial statistics, a lot of people use R and are very pleased with their optimization because the code runs without an error, but I’ve seen multiple instances of the resulting "MLE"s being seriously incorrect just because the optimizers aren’t using good enough convergence criterion or are silently hitting the maxiter or whatever. Before getting too bummed out over Julia’s timing, I’d definitely want to make sure that the answer the R gets so quickly is actually correct.

EDIT: oh, also, do you constrain the weights to be positive? Because that’s probably necessary for identifiability purposes.

Sorry for the spam here, but I have a clinical problem of playing around with other peoples’ problems instead of doing my own work sometimes.

If I change x=w1-w2 to x=exp(w1)-exp(w2) as a lazy way to constrain positive weights and fit the log-weights and use Optim.NewtonTrustRegion() with AD gradient and Hessians, it looks to converge okay, although I did kill it after a couple of minutes and it isn’t zippy.

This doesn’t really address your issue, but I can’t help but notice that computing both your gradient and Hessian manually would be quite easy and efficient to do in a single pass. It’s neat that you don’t have to do that in R to get a fast answer (although as I mentioned in my earlier post, maybe good to actually check the quality of that result), this seems like a circumstance where you could really easily do true second order optimization. It’s hard to think of a downside for that. Your Hessian is very sparse, so you wouldn’t even need to pay much in terms of allocations.

3 Likes

PS: This iii your regression problem:

sigmoid(x) = inv(one(x) + exp(-x))
d = 600
xtrue = 5*(2rand(d) .- 1)
xtrue[end] = 0.0 # Baseline player to make solution well defined
n = 4000

A = spzeros(n, d)
for i in 1:n
    a, b = sample(1:d, 2, replace=false)
    A[i, a] = 1
    A[i, b] = -1
end
y = vec(rand(n, 1) .< sigmoid.(A*xtrue))

And then GLM

using GLM
res = glm(A[:, 1:end-1], y, Bernoulli(), LogitLink());
norm(xtrue[1:end-1] - coef(res)) # "small"

Maximum likelihood without prior/regularisation doesn’t do it with this amount of observations, but for say less players and more games it works nicely. Speed for above:
(0.077458 seconds (981 allocations: 43.469 MiB))

5 Likes

PS:

julia> 600*log(600)
3838.157793129688

4000 is barely enough games to order the players if the better player is guaranteed to win.

1 Like

I managed to get a proper rating by removing players that have not lost or not won. And also removed players with less than 10 games recursively.

Here’s the results. which I am quite ok about

http://daizj.net/baduk-go-weiqi-ratings/

1 Like

@avx is still the “official” name and still works, which is why there wasn’t any announcement.

@vectorize is there for now though as an undocumented option. It was winning the poll. I’m also considering @brrr and @elrod (plus a bunch of options there). The problem with the @avx name is that it’s misleading; it isn’t AVX-specific.
When I decide on a name change, I’ll make a breaking release, add deprecation warnings to @avx, make an announcement, etc.

7 Likes

Here is the Zulip thread and poll:

Zulip Poll

Edit: fixed link. Thanks @carstenbauer .

I think you meant to link Zulip