If you’re estimating these probabilities from a corpus, then a simple (if a little old-fashioned) way to calculate these can be from an n-gram language model. If you’d like to try it out, I’ve written a little Julia package to do create these easily! I haven’t registered it, but you can install it from the REPL:
julia> ]add https://github.com/dellison/NGrams.jl
Here’s a simple example of how you might use it:
using NGrams
corpus = """
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,
And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.
I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.
"""
# get a vector of lowercased tokens, ignoring punctuation
tokens = lowercase.((m -> m.match).(eachmatch(r"\w+", corpus)))
# bigram (2-gram) language model
model = NGrams.LanguageModel(2)
NGrams.observe!(model, tokens)
# calculate probabilities of words coming after "i"
NGrams.prob(model, "i", "could") # p(could|i) = 2/9 = 0.22
NGrams.prob(model, "i", "never") # p(never|i) = 0/9 = 0.0
To calculate collocation probabilities using word2vec or GloVE.
You do
#usinf pretrained embeddings
input_embedding(w)::Vector = ...
output_embedding(w)::Vector = ...
vocab=...
# rough estimate:
score(w1, w2) = exp(dot(input_embedding(w1)* output_embedding(w2)))
p(w1 around w2) = score(w1, w2)/sum(score(u,v) for u in vocab for v in vocab)
If I recall correctly.
But it’s only for a kinda loose definition of “near”, not for next to.
And you need pretrained input and output embeddings, and most things (including iirc Embeddings.jl) only give you input embeddings.
You can make a worse approximation by assuming output embeddings are equal to the input embeddings.
Which is probably okayish