Implementation of Norvigs spellchecker (Code critique + performance help)

This whole thread was as if I was reading a really engaging story, an obstacle, an omen of performance, the journey to reach that promised land of hundreds of words per seconds.
And the ending, the journey overshot and ended up in another fucking galaxy.
Talk about emotional rollercoasters.

4 Likes

Another galaxy? How about another universe? But seriously, this thread needs some closure. The previous solution despite being fast is rather ugly and incomplete, so I’ve tried to push it to the maximum. Disclaimer: I am not an expert in this domain, so it is quite possible that it was a reinvention of the wheel. All I was trying to do is apply common sense.

My approach is the following, we have three components to this problem:

  1. An appropriate data structure that should contain the dictionary in one form or another.
  2. An appropriate data searching algorithm that efficiently utilizes this data structure
  3. The general idea that in order to search through the data we should generate all possible modifications of the source word and check their existence in the dictionary.

Let’s start with the data structure. We started from the simple dictionary, then added a dictionary of all (prefix, suffix) pairs. E.g. for word “world” we can generate pairs ("w", "orld"), ("wo", "rld") and so on, or alternatively we can use Dict of Dict and present it in the form ["w"]["orld"], ["wo"]["rld"]. How can it be generalized to a higher-order transformation (edit2 function in original code)? More or less obviously we can generate all possible triplets: ["w"]["o"]["rld"], ["wo"]["r"]["ld"] and so on. Repeating this procedure we finally arrive to a well-known Trie structure, where "world" is represented as a sequence ['w']['o']['r']['l']['d']

Now, we should evolve our search procedure in order to efficiently use our newly generated trie dictionary. One way is to build LevenshteinDistance matrix pruning paths if we deviate to far away from source word. Instead, we can apply our principle of generating new modifications of the source word. For example if we want to check whether word “worrld” can be corrected, we can do the following:

  1. split "worrld" to "wor r ld"
  2. verify the existence of the prefix “wor” in the dictionary
  3. skip letter “r”
  4. verify that “ld” exists in subtrie

It looks rather simple and easily generalizes to the N = 2, 3 and so on and on different kinds of operations, for example, transposition at position i is just a validation that corresponding subtrie starts with word[i+1], word[i] sequence.

In the end, we have the following algorithm

  1. Define the number of possible mutations of the source word
  2. At each step either synchronously step forward in source word and subtrie, or generate mutation in source word, validate its existence in subtrie and decrease the number of available mutations by 1.
  3. If the number of available mutations is zero, verify the existence of word suffix.
  4. To speed up a process one may cache last found word and update it with the new word if it has a higher probability.

It turns out, that such an algorithm is rather easy to implement NorvigsTrieSpellchecker.jl and it has surprisingly good performance

> julia norvigs.jl

0.748 of 270 correct (0.056 unknown) at 5382.885 words per second
  0.426627 seconds (1.21 M allocations: 52.142 MiB, 7.34% gc time)
0.675 of 400 correct (0.108 unknown) at 3664.988 words per second
  0.109498 seconds (475.77 k allocations: 14.612 MiB, 33.29% gc time)
Benchmark testset1, average speed: 5929.07 words per second
Benchmark testset2, average speed: 5665.096 words per second

Actually speed became so high, that I hit garbage collector, so I’ve added BecnhmarkTool estimations of the function time.

As a side note, what I like about this approach is the fact, that with small adaptation it can produce set of words that are exactly distance d away from source word, since most algorithms that can be easily googled usually generate set of words that are "<= d` distance away from the source word.

As for further speed up, one can try at least two approaches

  1. Add maximum probability of all leaf words to each trie node and early stop paths that go in the way of probability which is less than the probability of already found word.
  2. Since this problem of tree traversal, maybe some version of A* algorithm can be adopted to perform efficient choice of paths.
16 Likes