NLP Help needed. Probability two words to be next to each other

BMval · March 14, 2020, 11:42pm

Would somebody help me with NLP?
I am new on this topic.

I have several words: s1,s2,s3.
I need to find which of these words have a higher probability of staying after the root word.

So I want to compare Probability(“some words”,s1) <>Probability(“some words”,s2)…

I guess that I have to use word2vec or GLove, but not sure how to implement it here.

Thank you in advance.

Nosferican · March 15, 2020, 2:01am

Take a look at
https://juliatext.github.io/TextAnalysis.jl

dellison · March 15, 2020, 3:09am

If you’re estimating these probabilities from a corpus, then a simple (if a little old-fashioned) way to calculate these can be from an n-gram language model. If you’d like to try it out, I’ve written a little Julia package to do create these easily! I haven’t registered it, but you can install it from the REPL:

julia> ]add https://github.com/dellison/NGrams.jl

Here’s a simple example of how you might use it:

using NGrams

corpus = """
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.
"""

# get a vector of lowercased tokens, ignoring punctuation
tokens = lowercase.((m -> m.match).(eachmatch(r"\w+", corpus)))

# bigram (2-gram) language model
model = NGrams.LanguageModel(2)

NGrams.observe!(model, tokens)

# calculate probabilities of words coming after "i"
NGrams.prob(model, "i", "could") # p(could|i) = 2/9 = 0.22
NGrams.prob(model, "i", "never") # p(never|i) = 0/9 = 0.0

BMval · March 15, 2020, 3:13am

Thank you. Yes, it’s easy. I don’t have any corpus.
The question is about General English Corpus.

BMval · March 15, 2020, 3:14am

Unfortunately, https://juliatext.github.io/TextAnalysis.jl can’t help here.

dellison · March 15, 2020, 3:20am

You have to calculate probabilities without having any data?

BMval · March 15, 2020, 3:47pm

It,s supposed to use word2vec or GLoVe, but I am not very confident.

oxinabox · March 15, 2020, 4:31pm

To calculate collocation probabilities using word2vec or GloVE.
You do

#usinf pretrained embeddings
input_embedding(w)::Vector = ...
output_embedding(w)::Vector = ...
vocab=...

# rough estimate:
score(w1, w2) = exp(dot(input_embedding(w1)* output_embedding(w2)))

p(w1 around w2) =  score(w1, w2)/sum(score(u,v) for u in vocab for v in vocab)

If I recall correctly.
But it’s only for a kinda loose definition of “near”, not for next to.
And you need pretrained input and output embeddings, and most things (including iirc Embeddings.jl) only give you input embeddings.

You can make a worse approximation by assuming output embeddings are equal to the input embeddings.
Which is probably okayish

Topic		Replies	Views
Natural Language Processing: where do I start? Machine Learning	3	2779	November 10, 2018
Doc2Vec in Julia Machine Learning flux , machine-learning , nlp	2	952	April 4, 2022
Julia Natural Language Processing (NLP) Package for Small Text Topic Modeling General Usage question	0	259	August 1, 2023
Nlp package with lemmatizer New to Julia question , package , nlp	1	781	May 28, 2020
Character n-grams in array General Usage	6	829	November 24, 2020

NLP Help needed. Probability two words to be next to each other

Related topics