Natural Language Processing: where do I start?


#1

I do not know enough about machine learning to ask an informed question. I am not even sure whether this a Machine Learning question. Maybe.

I will tell you what I want to achieve in the end:

In short we want to classify data based on text. The process will be unsupervised machine learning. The data will be abstracts, keywords, titles of scientific publications. We want to develop a meaningful subject classification system based on phrases and distance between words in the data.

I know that Python has a well developed nltk and good tutorials and I still know Python better than Julia, but I will prefer to do this in Julia. Unfortunately a lot of related tools in Julia are not yet usable in v1.0. And then there are so many Julia Packages that I do not know where to start exploring.

I will appreciate some advice.


#2

Did you see this:


Text mining directly from HTML?
#3

No. Thanks for the link.


#4

I was trying a little

and found that matchall was removed from Julia 1.0

HISTORY.md says:
matchall has been deprecated in favor of collect(m.match for m in eachmatch(r, s)) (#26071).

It seems ubelievable design decision at least at first look. :stuck_out_tongue_winking_eye:

But there seems to be mistake in HISTORY.md too:

matchall(r,s) = collect(m.match for m in eachmatch(r, s))  # this didn't work
matchall(r,s) = collect(m for m in eachmatch(r, s))  # this could help to run WebScraping.ipynb

A little problem could be HTTP/1.1 429 Too Many Requests from stackexchange… (which is probably understandable)