Find frequent words

IsraelSI · September 26, 2021, 8:27pm

Hello colleagues, I have recently learned Julia based on the book “Getting Started with Julia” and in it there is an example of finding frequent words in a text:

# code in chapter 5\word_frequency.jl:
# 1- read in text file:
str = readall("words1.txt")
# 2- replace non alphabet characters from text with a space:
nonalpha = r"(\W\s?)" # define a regular expression
str = replace(str, nonalpha, ' ')
digits = r"(\d+)"
str = replace(str, digits, ' ')
# 3- split text in words:
word_list = split(str, ' ')
# 4- make a dictionary with the words and count their frequencies:
word_freq = Dict{String, Int64}()
for word in word_list
word = strip(word)
if isempty(word) continue end
haskey(word_freq, word) ?
word_freq[word] += 1 :
word_freq[word] = 1
end
# 5- sort the words (the keys) and print out the frequencies:
println("Word : frequency \n")
words = sort!(collect(keys(word_freq)))
for word in words
println("$word : $(word_freq[word])")
end

In the example use cycles to solve my question is:
Is there any way to solve it without cycles?

heliosdrm · September 26, 2021, 9:45pm

Welcome @IsraelSI!

I’m not sure if what you call “cycles” is the for loop (the part that starts with for word in word_list, etc.). If that’s the case, the answer is yes: there are other ways.

For instance, you can use broadcasting and filtering to move the operations that “clean” the word list out of the loop:

word_list = strip.(word_list)
filter!(!isempty, word_list)

And then transform the loop into a one-line dictionary comprehension like this:

Dict(word => count(isequal(word), word_list) for word in unique(word_list))

You could even use broadcasting for everything, although this is starting to look awkward:

keys = unique(word_list)
values = count.(isequal.(keys), Ref(word_list))
Dict(Pair.(keys, values))

But let aside the fact that there are many ways to do this (and other) things. Why would you want to avoid the loop? I’m asking just in case, because many people think that loops make things slow, but that’s not the case in Julia. Actually a quick testing shows that the loop presented in the book is the cheapest and quickest option, compared with the two alternatives that I suggested – and the worst one is the last I presented.

That usual in Julia: loops with simple instructions are often faster and consume less memory than alternative algorithms that try to do everything in one step (e.g. “vectorizing” operations).

jzr · September 27, 2021, 5:53am

Isn’t the loop version asymptotically faster?

carstenbauer · September 27, 2021, 6:19am

FWIW, you could try use StatsBase.countmap here.

Example:

julia> s = "This is a test. A simple test but a test nonetheless."
"This is a test. A simple test but a test nonetheless."

julia> countmap(split(replace(s, '.' => "")," "))
Dict{SubString{String}, Int64} with 8 entries:
  "A"           => 1
  "test"        => 3
  "simple"      => 1
  "is"          => 1
  "This"        => 1
  "nonetheless" => 1
  "a"           => 2
  "but"         => 1

Topic		Replies	Views
Count words challenge Performance	24	2110	March 23, 2021
How to find most frequent element in an array? New to Julia question	7	2034	April 5, 2022
Counting words in StringArray General Usage	12	474	September 29, 2020
Efficiently finding the frequency of patterns in DataFrame columns New to Julia dataframes , dictionaries , splitapplycombine	12	1503	January 1, 2022
Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary Performance question	27	2175	December 15, 2021

Find frequent words

Related topics