Find frequent words

Hello colleagues, I have recently learned Julia based on the book “Getting Started with Julia” and in it there is an example of finding frequent words in a text:

# code in chapter 5\word_frequency.jl:
# 1- read in text file:
str = readall("words1.txt")
# 2- replace non alphabet characters from text with a space:
nonalpha = r"(\W\s?)" # define a regular expression
str = replace(str, nonalpha, ' ')
digits = r"(\d+)"
str = replace(str, digits, ' ')
# 3- split text in words:
word_list = split(str, ' ')
# 4- make a dictionary with the words and count their frequencies:
word_freq = Dict{String, Int64}()
for word in word_list
word = strip(word)
if isempty(word) continue end
haskey(word_freq, word) ?
word_freq[word] += 1 :
word_freq[word] = 1
# 5- sort the words (the keys) and print out the frequencies:
println("Word : frequency \n")
words = sort!(collect(keys(word_freq)))
for word in words
println("$word : $(word_freq[word])")

In the example use cycles to solve my question is:
Is there any way to solve it without cycles?

1 Like

Welcome @IsraelSI!

I’m not sure if what you call “cycles” is the for loop (the part that starts with for word in word_list, etc.). If that’s the case, the answer is yes: there are other ways.

For instance, you can use broadcasting and filtering to move the operations that “clean” the word list out of the loop:

word_list = strip.(word_list)
filter!(!isempty, word_list)

And then transform the loop into a one-line dictionary comprehension like this:

Dict(word => count(isequal(word), word_list) for word in unique(word_list))

You could even use broadcasting for everything, although this is starting to look awkward:

keys = unique(word_list)
values = count.(isequal.(keys), Ref(word_list))
Dict(Pair.(keys, values))

But let aside the fact that there are many ways to do this (and other) things. Why would you want to avoid the loop? I’m asking just in case, because many people think that loops make things slow, but that’s not the case in Julia. Actually a quick testing shows that the loop presented in the book is the cheapest and quickest option, compared with the two alternatives that I suggested – and the worst one is the last I presented.

That usual in Julia: loops with simple instructions are often faster and consume less memory than alternative algorithms that try to do everything in one step (e.g. “vectorizing” operations).


Isn’t the loop version asymptotically faster?

FWIW, you could try use StatsBase.countmap here.


julia> s = "This is a test. A simple test but a test nonetheless."
"This is a test. A simple test but a test nonetheless."

julia> countmap(split(replace(s, '.' => "")," "))
Dict{SubString{String}, Int64} with 8 entries:
  "A"           => 1
  "test"        => 3
  "simple"      => 1
  "is"          => 1
  "This"        => 1
  "nonetheless" => 1
  "a"           => 2
  "but"         => 1