Implementation of Norvigs spellchecker (Code critique + performance help)

dunefox · January 6, 2020, 7:23pm

I know python relatively well and wanted to learn Julia, so for my first steps I tried to implement Norvigs spellchecker (How to Write a Spelling Corrector). It’s working as intended and I kept it as close as possible but I’d like to know how to improve the code in general and, since it runs pretty inefficiently, how to get it to the originals speed.
Mine: ~21 - 23 words per second
Python: 41 words per second

Any help is appreciated.

Edit: Should have posted the code as well… GitHub - dunefox/spellcheck-norvig: Naive Julia implementation of Peter Norvigs simple spell checker: https://norvig.com/spell-correct.html

include("spelling_corrector.jl")
@time spelltest(testset(readlines(open("spell-testset1.txt"))));
  74.8 of 270 correct (5.6000000000000005 unknown) at 23.57 words per second
  11.455881 seconds (53.94 M allocations: 3.772 GiB, 10.24% gc time)

Paul_Soderlind · January 6, 2020, 8:20pm

I cannot offer any advice on the performance, but I still want to say that it is nice to see this being done.

Sukera · January 6, 2020, 9:37pm

I’ve played around with this years ago but never got around to testing the speed since whatever I did counted occurrences of the and some others wrong - but you’ve made me look at it again and I’ve put it on github now. Here’s some timings on my machine:

Running spell-testset1.txt
74.4% of 270 correct (5.56% unknown) at 40.5 words per second        
  7.415867 seconds (59.03 M allocations: 3.677 GiB, 2.79% gc time)   
Running spell-testset2.txt
67.2% of 400 correct (10.8% unknown) at 43.9 words per second        
  9.109207 seconds (93.48 M allocations: 6.078 GiB, 3.25% gc time)

The output is produced via julia runTests.jl in the root directory of this repository.

Your timings on my machine:

Running spell-testset1.txt
0.748 of 270 correct (0.056 unknown) at 21.774 words per second
 13.394927 seconds (59.13 M allocations: 4.016 GiB, 12.41% gc time)
Running spell-testset2.txt
0.675 of 400 correct (0.108 unknown) at 26.249 words per second
 15.240278 seconds (93.31 M allocations: 6.619 GiB, 11.82% gc time)

Seems like my version is using the same allocations over all, but is somehow able to reduce GC time by a factor of ~4. Very interesting.

That being said, the code I wrote was more or less a one to one translation from Python to Julia - I’m somewhat certain that there’s still some speed left on the table, especially considering the humongous number of allocations.

Sukera · January 6, 2020, 9:44pm

One more thing: I’ve made WORDS a const variable in my code. If I leave it as non-const and time it again, I get these timings:

Running spell-testset1.txt
74.4% of 270 correct (5.56% unknown) at 33.5 words per second
  8.948835 seconds (73.19 M allocations: 3.890 GiB, 2.92% gc time)
Running spell-testset2.txt
67.2% of 400 correct (10.8% unknown) at 32.7 words per second
 12.237088 seconds (117.68 M allocations: 6.439 GiB, 3.46% gc time)

So a reduction of words per second by almost 20%! Mark your globals as const, yo!

dunefox · January 6, 2020, 9:48pm

Thanks for posting your solution. The numbers I got from the extracted words were off by a bit as well, which is why I commented out two unit tests and changed a couple others to see which ones were failing and why. So, overall the unit_tests aren’t passing for me as well which I find quite strange…
Also, I tried to stay as close to the Python version as I could but since I’m not fluent in Julia I believe my code is far from optimal…

➜  spellcheck-norvig git:(master) ✗ julia spelling_corrector.jl 
Running spell-testset1.txt
0.748 of 270 correct (0.056 unknown) at 28.562 words per second
 10.262803 seconds (52.28 M allocations: 3.912 GiB, 9.05% gc time)
Running spell-testset2.txt
0.675 of 400 correct (0.108 unknown) at 25.997 words per second
 15.386978 seconds (81.48 M allocations: 6.443 GiB, 11.29% gc time)

You’re right, the change to const WORDS gave me a couple of words per second! Still not enough though.

Edit: By just running both runs twice I get a speed up of 3 wps for the first dataset, but not for the second:

➜  spellcheck-norvig git:(master) ✗ julia spelling_corrector.jl
Running warmup1
0.748 of 270 correct (0.056 unknown) at 27.152 words per second
 10.763162 seconds (52.28 M allocations: 3.912 GiB, 9.18% gc time)
Running warmup2
0.675 of 400 correct (0.108 unknown) at 24.841 words per second
 16.102615 seconds (81.48 M allocations: 6.443 GiB, 11.29% gc time)
Running spell-testset1.txt
0.748 of 270 correct (0.056 unknown) at 30.208 words per second
  8.938255 seconds (47.14 M allocations: 3.670 GiB, 10.62% gc time)
Running spell-testset2.txt
0.675 of 400 correct (0.108 unknown) at 24.81 words per second
 16.122854 seconds (81.48 M allocations: 6.443 GiB, 11.37% gc time)

klaff · January 6, 2020, 9:59pm

I don’t have time to play with this right now but I am watching the thread. My virtual bet is that someone will get a julia version to between 200 and 1000 words per second.

Sukera · January 6, 2020, 10:11pm

I suspect those are cache loading effects - I get

Running spell-testset1.txt
74.4% of 270 correct (5.56% unknown) at 49.0 words per second
  5.512463 seconds (54.73 M allocations: 3.448 GiB, 4.79% gc time)
Running spell-testset2.txt
67.2% of 400 correct (10.8% unknown) at 44.0 words per second
  9.099374 seconds (93.84 M allocations: 6.053 GiB, 5.21% gc time)

if I run my code multiple times in the same session.

Unfortunately for you, I’m not able to locate the source of the slowdown in your code, even with copy pasting parts of my code into yours At this point, profiling and looking where the time is really spent would be best.

ffevotte · January 7, 2020, 10:39pm

Thanks @dunefox for this fun challenge! It’s interesting (IMO) to have relatively small codes like this one, which are simple enough to be studied in their globality, while still complex enough that there are interesting things to say about them.

Let’s try and fulfill @klaff’s prophecy. For reference, here is how the original code performs on my system (this is a second run):

julia> using BenchmarkTools
julia> @btime spelltest($(testset(readlines(open("spell-testset1.txt")))))
0.748 of 270 correct (0.056 unknown) at 27.933 words per second
0.748 of 270 correct (0.056 unknown) at 28.19 words per second
0.748 of 270 correct (0.056 unknown) at 28.487 words per second
0.748 of 270 correct (0.056 unknown) at 28.255 words per second
  9.556 s (53934856 allocations: 3.77 GiB)

Let’s start with a little profiling:

julia> using ProfileView

julia> @profview spelltest(testset(readlines(open("spell-testset1.txt"))))
0.748 of 270 correct (0.056 unknown) at 24.012 words per second

We see on the flame graph that there are a few occurrences of dynamic dispatch (displayed in red), but that dos not seem to be very much of a problem here: all of the red “bars” are stacked with “bars” nearly covering their length (i.e. the inner calls still account for nearly all the run time)

Overall, the candidates function accounts for almost all the computing time. Above its bar, the flame graph is decomposed into 3 “stacks”. From left to right:

unique, called from edits2
collect, called from edits2
known, directly called from candidates

More quantitative information can be obtained from the textual profile output:

julia> using Profile
julia> Profile.print(format=:flat, sortedby=:count)
Count File                                    Line  Function
[...]
 2458 spellcheck-norvig/spelling_corrector.jl   45  edits1(::String)                                             
 2639 ./dict.jl                                546  in(::String, ::Base.KeySet{SubString{String},Dict{SubStrin...
 2809 ./dict.jl                                545  haskey                                                       
 2809 ./set.jl                                  47  in                                                           
 2809 ./set.jl                                 145  unique_from                                                  
 3071 ./iterators.jl                           429  iterate                                                      
 3151 spellcheck-norvig/spelling_corrector.jl   35  known(::Array{String,1})                                     
 3157 ./generator.jl                            44  iterate                                                      
 3238 ./array.jl                               620  collect                                                      
 3760 ./dict.jl                                289  ht_keyindex(::Dict{SubString{String},Int64}, ::String)       
 4269 ./iterators.jl                           903  iterate                                                      
 4419 ./array.jl                               596  _collect(::UnitRange{Int64}, ::Base.Iterators.Flatten{Base...
 4419 ./array.jl                               560  collect(::Base.Iterators.Flatten{Base.Generator{Array{Stri...
 5422 ./set.jl                                 125  unique(::Array{String,1})                                    
 5540 ./multidimensional.jl                   1431  #unique#432                                                  
 5540 ./multidimensional.jl                   1433  _unique_dims                                                 
 5540 ./multidimensional.jl                   1431  unique                                                       
 5556 ./generator.jl                            47  iterate                                                      
 6366 ./array.jl                               713  grow_to!(::Array{String,1}, ::Base.Iterators.Flatten{Base....
 7539 spellcheck-norvig/spelling_corrector.jl   48  edits2(::SubString{String})                                  
 8077 ./array.jl                               686  grow_to!(::Array{String,1}, ::Base.Generator{Base.Iterator...
10634 spellcheck-norvig/spelling_corrector.jl   28  candidates(::SubString{String})                              
11071 spellcheck-norvig/spelling_corrector.jl   91  #spelltest#38(::Bool, ::typeof(spelltest), ::Array{Tuple{S...
11071 spellcheck-norvig/spelling_corrector.jl   19  correction(::SubString{String})                              
11072 spellcheck-norvig/spelling_corrector.jl   87  spelltest(::Array{Tuple{SubString{String},SubString{String...

If we focus on the functions from our code, roughly 75% of the time is spent in edits2, among which 25% come from edits1. The remaining 25% are spent in known.

But we know from the flame graph that these functions don’t do much on their own: all their run time is spent in calls to standard library functions. Prominent among these are unique and collect.

To recap, we spend most of our time building large arrays, then removing duplicate elements from
them. There should be a way to avoid that…

ffevotte · January 7, 2020, 10:41pm

For the reasons explained in the previous post, let’s focus first on edits1, edits2 and known, trying to avoid building large arrays and removing duplicates.

This PR is an attempt at fixing the most obvious issues:

https://github.com/dunefox/spellcheck-norvig/pull/1

Let’s focus on edits1 first since it’s the innermost function. From our vantage point focusing on collect and unique, edits1 can be described by 3 main stages:

Comprehensions are used to build several arrays containing the results of different types of modifications applied to one word. Each of these comprehensions has to allocate some space to store its results.
All subarrays are concatenated into one big array. This is more allocations, and data copies.
Duplicates are removed from the big array.

Let’s instead try to build the end array from scratch, by only adding candidates if they aren’t already present. Such a collection is essentially a set of unique elements, so let’s use a Set datastructure for that. Instead of comprehensions, we iterate with simple for loops (a collateral benefit is that some of the loops can be mutualized), build the candidates, and add them to the set. Due to the way sets work, this means only adding them if they are not duplicate.

function edits1(word)
    splits = Set([("", word)])
    for i in 1:length(word)
        push!(splits, (word[1:i], word[i + 1:end]))
    end

    edits = Set(String[])

    for (L, R) in splits
        if R != ""
            push!(edits, L * R[2:end])
            for c in letters
                push!(edits, L * c * R[2:end])
            end
        end

        length(R) > 1 && push!(edits, L * R[2] * R[1] * R[3:end])

        for c in letters
            push(edits, L * c * R)
        end
    end
    edits
end

Now, edits2. This function uses a big comprension to:

compute the set of edits produced by edits1 on the given word,
combine all sets of edits produced by calling edits1 on each word in the previous set.

Again, the idea would be to build this set from scratch, rather than building several subsets, concatenating them, and removing duplicates. We can modify edits1 so that it can add to an existing set instead of building a brand new set:

# This keeps the same semantics as before
edits1(word) = edits1!(Set(String[]), word)

# This one can modify an existing set
function edits1!(edits, word)
    # pushes new candidates to the provided set
    edits
end

edits2 can then be rewritten as:

function edits2(word)
    # this set will be expanded by each call to edits1!
    edits = Set(String[])
    
    for e1 in edits1(word)
        edits1!(edits, e1)
    end
    
    edits
end

Finally, known. Again, all this function does is use a comprehension to

build a new set of words (more allocations),
filtering out unknown words (again evidencing that we previously stored too many candidates).

We could easily tell edits1 to filter unknown words and not add them in the first place. Let’s add a filtering option to edits1!:

# The default filter always returns true
#   => no change in behaviour by default
function edits1!(edits, word, filter=w->true)
    # [...]

    # use this instead of push!(edits, ...):
    # candidates are added only if the predicate is true
    push_word(w) = filter(w) && push!(edits, w)
    
    for (L, R) in splits
        # [...] 
        length(R) > 1 && push_word(L * R[2] * R[1] * R[3:end])
        # [...]
    end
    edits
end

We can define a predicate that tests whether a word is known (a set of all known words is built beforehand).

const WORDS = countmap(words(read(open("./big.txt"), String)))
const KNOWN = Set(keys(WORDS))
known(word) = word in KNOWN

This predicate can be used to filter results from edits1. For example:

function edits2(word)
    edits = Set(String[])
    for e1 in edits1(word)
        edits1!(edits, e1, known)
    end
    edits
end

Putting all this together, we get the code proposed in the PR and yielding the following performances:

julia> using BenchmarkTools
julia> @btime spelltest($(testset(readlines(open("spell-testset1.txt")))))
0.748 of 270 correct (0.056 unknown) at 95.094 words per second
0.748 of 270 correct (0.056 unknown) at 95.808 words per second
0.748 of 270 correct (0.056 unknown) at 95.282 words per second
0.748 of 270 correct (0.056 unknown) at 93.151 words per second
0.748 of 270 correct (0.056 unknown) at 94.579 words per second
0.748 of 270 correct (0.056 unknown) at 96.582 words per second
0.748 of 270 correct (0.056 unknown) at 95.515 words per second
  2.796 s (35667972 allocations: 1.29 GiB)

We’re still a factor 2 short of @klaff’s most pessimistic bet. But there is still hope: a significant fraction of the time is now spent in string manipulations, which I am confident could be optimized further.

dunefox · January 7, 2020, 11:55pm

Thank you! That’s quite the speed increase.
I tried to use the Profiler on my version as well, but got an “out of memory error” for some reason.
This begs the question why the original Python version is so much faster than my (almost) equivalent version - it uses list comprehensions and the same edit1 function.
I’m kind of tempted to try this in other languages as well.

LaurentPlagne · January 8, 2020, 7:50am

This begs the question why the original Python version is so much faster than my (almost) equivalent version - it uses list comprehensions and the same edit1 function.
I’m kind of tempted to try this in other languages as well.

This is an interesting question (the so much is a little less than a factor of 2 right ?)
Does the @Sukera implementation give some hint ? It seems to offer the same level of performance as the Python’s implementation and maybe close enough to the original code.
What is clear is that array creation/destruction takes time in Julia and is not advisable in performance critical codes.

I do not know Python enough, but I once had a similar performance surprise while translating a code from Matlab (which I do not know very well neither) : In particular, it appeared that the array (matrix) concatenation seems to be rather idiomatic in Matlab (and Python ?) and performs better than its direct translation in Julia where it was (way) better to build the complete array directly (which is totally expected from a C++ programmer point of view). I did not investigate this but at the time I speculated about some kind of composite array structure (composed of pieces of continuous arrays) that could be natively handled by Matlab…

Another hypothesis is that the Julia string manipulation could be slower here. I did not check if this hypothesis is coherent with @ffevotte flame graph.

Obviously, the best way to go is to profile the Python code and compare it with the Julia’s one.

Sukera · January 8, 2020, 11:01am

I can take another look at this in the evening, but I suspect the main differences lie in the fact that I used some generators instead of creating all intermediate arrays and that i’ve use string interpolation instead of string concatenation (even though copy pasting those methods into OPs code did not seem to make up all the difference).

Skoffer · January 8, 2020, 11:10am

Actually, your implementation is not the same, you are using unique instead of Set. These two functions have different performance depending on the size of array. Compare these benchmarks

using BenchmarkTools
words = [join(rand('a':'z', rand(3:14))) for _ in 1:200]
x = rand(words, 300)

b1 = @benchmark unique($x)
b2 = @benchmark Set($x)

println(IOContext(stdout, :compact => false), b1)
println(IOContext(stdout, :compact => false), b2)

# BenchmarkTools.Trial: 
#   memory estimate:  17.48 KiB
#   allocs estimate:  21
#   --------------
#   minimum time:     36.176 μs (0.00% GC)
#   median time:      39.681 μs (0.00% GC)
#   mean time:        63.175 μs (1.93% GC)
#   maximum time:     9.635 ms (0.00% GC)
#   --------------
#   samples:          10000
#   evals/sample:     1

# BenchmarkTools.Trial: 
#   memory estimate:  5.09 KiB
#   allocs estimate:  8
#   --------------
#   minimum time:     19.840 μs (0.00% GC)
#   median time:      21.497 μs (0.00% GC)
#   mean time:        26.581 μs (0.00% GC)
#   maximum time:     4.033 ms (0.00% GC)
#   --------------
#   samples:          10000
#   evals/sample:     1

You may see that Set performs much better than unique. But with increased vector length situation is different

using BenchmarkTools
words = [join(rand('a':'z', rand(3:14))) for _ in 1:200]
x = rand(words, 10000)

b1 = @benchmark unique($x)
b2 = @benchmark Set($x)

println(IOContext(stdout, :compact => false), b1)
println(IOContext(stdout, :compact => false), b2)

# BenchmarkTools.Trial: 
#   memory estimate:  17.48 KiB
#   allocs estimate:  21
#   --------------
#   minimum time:     6.199 ms (0.00% GC)
#   median time:      7.467 ms (0.00% GC)
#   mean time:        7.656 ms (0.00% GC)
#   maximum time:     18.427 ms (0.00% GC)
#   --------------
#   samples:          652
#   evals/sample:     1
# 
# BenchmarkTools.Trial: 
#   memory estimate:  1.13 MiB
#   allocs estimate:  8
#   --------------
#   minimum time:     7.479 ms (0.00% GC)
#   median time:      8.879 ms (0.00% GC)
#   mean time:        8.820 ms (0.52% GC)
#   maximum time:     17.806 ms (0.00% GC)
#   --------------
#   samples:          567
#   evals/sample:     1

In case of spellchecker you are mostly in first situation, so unique performs badly and you fall behind.

Here is some tips.

Change unique to Set. This change mostly affects candidates function, so you may use something like this

function candidates(word)
    r = known([word])
    if !isempty(r) return r end
    r = known(edits1(word))
    if !isempty(r) return r end
    r = known(edits2(word))
    if !isempty(r) return r end
    Set([word])
end

Wrap your functions in module
Declare WORDS to be const
First testset is not quite correct, since you are adding compile time, so you may want to add warm up before main calculations (of course, generally it’s better to use BenchmarkTools)

# Warmup
spelltest(testset(readlines(open("spell-testset1.txt")))[1:2])

With these changes, the same code is now giving me

# Julia
Running spell-testset1.txt
0.748 of 270 correct (0.056 unknown) at 19.072 words per second
 14.157568 seconds (55.46 M allocations: 3.542 GiB, 7.32% gc time)
Running spell-testset2.txt
0.675 of 400 correct (0.108 unknown) at 18.014 words per second
 22.205295 seconds (92.66 M allocations: 5.976 GiB, 6.67% gc time)

# Python
75% of 270 correct (6% unknown) at 18 words per second 
68% of 400 correct (11% unknown) at 16 words per second

so, there is small performance improvement.

dunefox · January 8, 2020, 12:07pm

@LaurentPlagne Yeah, less than 2 is right - about 20 wps. When I have some free time I’ll have to compare Julia and Python more directly with some snippets. Also, I wonder how fast Python can be here if optimised. I’ll try that as well some time.

@Skoffer True, I actually used Sets first but I thought since I’m not really adding elements it wouldn’t make much of a difference. So much for that. Also, I had two test runs to account for compile time but I think I reset it when merging the PR.

kristoffer.carlsson · January 8, 2020, 12:23pm

P(word; N=sum(values(WORDS))) = get(WORDS, word, 0.0) / N

recomputes the sum every time P is called. So make that something like

const sumwords =  sum(values(WORDS))
P(word; N=sumwords) = get(WORDS, word, 0.0) / N

dunefox · January 8, 2020, 1:54pm

Good point. Your version gets around 100 wps.

➜  spellcheck-norvig git:(master) julia spelling_corrector.jl
Running spell-testset1.txt
0.748 of 270 correct (0.056 unknown) at 100.352 words per second
  3.579360 seconds (37.67 M allocations: 1.383 GiB, 5.78% gc time)
Running spell-testset2.txt
0.675 of 400 correct (0.108 unknown) at 86.333 words per second
  4.633582 seconds (61.76 M allocations: 2.189 GiB, 4.87% gc time)
Running spell-testset1.txt
0.748 of 270 correct (0.056 unknown) at 100.937 words per second
  2.675193 seconds (35.67 M allocations: 1.287 GiB, 4.81% gc time)
Running spell-testset2.txt
0.675 of 400 correct (0.108 unknown) at 86.169 words per second
  4.642394 seconds (61.76 M allocations: 2.189 GiB, 4.65% gc time)

Edit: For a moment I mistakenly thought countmap already maps tokens to relative frequencies, but of course I still need to normalise the values in the countmap.

Skoffer · January 9, 2020, 10:04pm

It was quite an interesting task, and it turns out that it is really possible to speed up, starting from @ffevotte brilliant solution.

The main idea is the same: remove as many unnecessary allocations as possible.

The first one is simple: we can remove redundant splits array in edits1! because what we really need is just a pair of two substrings without complete collection.
So, instead of

splits = Set([("", word)])
for i in 1:length(word)
    push!(splits, (word[1:i], word[i + 1:end]))
end

for (L, R) in splits

we can simply write

for i in 0:length(word)
    L = word[1:i]
    R = word[i+1:end]

This gave me +4 w/s. Good, but not too interesting.

The second one is more elaborated. Let’s took at this part of edits1! function

for c in 'a':'z'
    push_word(L * c * R)
end

What is wrong with this code? We are doing lots of concatenations (which is rather costly) and most of them unnecessary because corresponding words do not exist. Instead, we can do another way around, we can precompute all possible correct pairs of substrings.

function pairs(text)
    d = Dict{Tuple{String, String}, String}()
    for word in text
        for i in 1:length(word)
            L = word[1:(i-1)]
            R = word[(i+1):end]
            (P(get(d, (L, R), "")) < P(word)) && (d[(L, R)] = word)
		end
    end
    d
end
const PAIRS = pairs(keys(WORDS))

Also this function choose best correction from for all triplets (L, c, R), additionally minimizing calls to push_words.

After this change, I was finally able to get these numbers (also memory footprint was sufficiently minimized)

Running spell-testset1.txt
0.748 of 270 correct (0.056 unknown) at 1064.467 words per second
  0.253835 seconds (2.84 M allocations: 95.227 MiB, 6.72% gc time)
Running spell-testset2.txt
0.675 of 400 correct (0.108 unknown) at 956.837 words per second
  0.418432 seconds (4.94 M allocations: 164.120 MiB, 3.94% gc time)

dunefox · January 10, 2020, 2:56pm

Alright, 1000 wps. Nice, thanks!

klaff · January 10, 2020, 7:09pm

These are the most educational threads! Thanks to all the contributors!

dunefox · January 11, 2020, 2:00pm

Yep, thanks from me as well, this was very interesting!
Since I’m slowly going through algorithms like that in Julia I might post something like this once or twice if that’s alright.

Topic		Replies	Views
Count words challenge Performance	24	2180	March 23, 2021
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	4619	August 6, 2021
Best way to optimize code Performance question	29	2287	May 21, 2020
Rust / Julia Comparison Post Performance rust	56	5208	February 1, 2022
Text Mining: Detect Strings: Word Lookup in a Large Corpus of Phrases Using a Large Dictionary Performance question	27	2199	December 15, 2021

Implementation of Norvigs spellchecker (Code critique + performance help)

Related topics