NLP Help needed. Probability two words to be next to each other

Would somebody help me with NLP?
I am new on this topic.

I have several words: s1,s2,s3.
I need to find which of these words have a higher probability of staying after the root word.

So I want to compare Probability(“some words”,s1) <>Probability(“some words”,s2)…

I guess that I have to use word2vec or GLove, but not sure how to implement it here.

Thank you in advance.

Take a look at
https://juliatext.github.io/TextAnalysis.jl

1 Like

If you’re estimating these probabilities from a corpus, then a simple (if a little old-fashioned) way to calculate these can be from an n-gram language model. If you’d like to try it out, I’ve written a little Julia package to do create these easily! I haven’t registered it, but you can install it from the REPL:

julia> ]add https://github.com/dellison/NGrams.jl

Here’s a simple example of how you might use it:

using NGrams

corpus = """
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.
"""

# get a vector of lowercased tokens, ignoring punctuation
tokens = lowercase.((m -> m.match).(eachmatch(r"\w+", corpus)))

# bigram (2-gram) language model
model = NGrams.LanguageModel(2)

NGrams.observe!(model, tokens)

# calculate probabilities of words coming after "i"
NGrams.prob(model, "i", "could") # p(could|i) = 2/9 = 0.22
NGrams.prob(model, "i", "never") # p(never|i) = 0/9 = 0.0
2 Likes

Thank you. Yes, it’s easy. I don’t have any corpus.
The question is about General English Corpus.

Unfortunately, https://juliatext.github.io/TextAnalysis.jl can’t help here.

You have to calculate probabilities without having any data?

2 Likes

It,s supposed to use word2vec or GLoVe, but I am not very confident.

To calculate collocation probabilities using word2vec or GloVE.
You do

#usinf pretrained embeddings
input_embedding(w)::Vector = ...
output_embedding(w)::Vector = ...
vocab=...

# rough estimate:
score(w1, w2) = exp(dot(input_embedding(w1)* output_embedding(w2)))

p(w1 around w2) =  score(w1, w2)/sum(score(u,v) for u in vocab for v in vocab)

If I recall correctly.
But it’s only for a kinda loose definition of “near”, not for next to.
And you need pretrained input and output embeddings, and most things (including iirc Embeddings.jl) only give you input embeddings.

You can make a worse approximation by assuming output embeddings are equal to the input embeddings.
Which is probably okayish

1 Like