For the reasons explained in the previous post, let’s focus first on edits1
, edits2
and known
, trying to avoid building large arrays and removing duplicates.
This PR is an attempt at fixing the most obvious issues:
https://github.com/dunefox/spellcheck-norvig/pull/1
Let’s focus on edits1
first since it’s the innermost function. From our vantage point focusing on collect
and unique
, edits1
can be described by 3 main stages:
- Comprehensions are used to build several arrays containing the results of different types of modifications applied to one word. Each of these comprehensions has to allocate some space to store its results.
- All subarrays are concatenated into one big array. This is more allocations, and data copies.
- Duplicates are removed from the big array.
Let’s instead try to build the end array from scratch, by only adding candidates if they aren’t already present. Such a collection is essentially a set of unique elements, so let’s use a Set
datastructure for that. Instead of comprehensions, we iterate with simple for loops (a collateral benefit is that some of the loops can be mutualized), build the candidates, and add them to the set. Due to the way sets work, this means only adding them if they are not duplicate.
function edits1(word)
splits = Set([("", word)])
for i in 1:length(word)
push!(splits, (word[1:i], word[i + 1:end]))
end
edits = Set(String[])
for (L, R) in splits
if R != ""
push!(edits, L * R[2:end])
for c in letters
push!(edits, L * c * R[2:end])
end
end
length(R) > 1 && push!(edits, L * R[2] * R[1] * R[3:end])
for c in letters
push(edits, L * c * R)
end
end
edits
end
Now, edits2
. This function uses a big comprension to:
- compute the set of edits produced by
edits1
on the given word,
- combine all sets of edits produced by calling
edits1
on each word in the previous set.
Again, the idea would be to build this set from scratch, rather than building several subsets, concatenating them, and removing duplicates. We can modify edits1
so that it can add to an existing set instead of building a brand new set:
# This keeps the same semantics as before
edits1(word) = edits1!(Set(String[]), word)
# This one can modify an existing set
function edits1!(edits, word)
# pushes new candidates to the provided set
edits
end
edits2
can then be rewritten as:
function edits2(word)
# this set will be expanded by each call to edits1!
edits = Set(String[])
for e1 in edits1(word)
edits1!(edits, e1)
end
edits
end
Finally, known
. Again, all this function does is use a comprehension to
- build a new set of words (more allocations),
- filtering out unknown words (again evidencing that we previously stored too many candidates).
We could easily tell edits1
to filter unknown words and not add them in the first place. Let’s add a filtering option to edits1!
:
# The default filter always returns true
# => no change in behaviour by default
function edits1!(edits, word, filter=w->true)
# [...]
# use this instead of push!(edits, ...):
# candidates are added only if the predicate is true
push_word(w) = filter(w) && push!(edits, w)
for (L, R) in splits
# [...]
length(R) > 1 && push_word(L * R[2] * R[1] * R[3:end])
# [...]
end
edits
end
We can define a predicate that tests whether a word is known (a set of all known words is built beforehand).
const WORDS = countmap(words(read(open("./big.txt"), String)))
const KNOWN = Set(keys(WORDS))
known(word) = word in KNOWN
This predicate can be used to filter results from edits1
. For example:
function edits2(word)
edits = Set(String[])
for e1 in edits1(word)
edits1!(edits, e1, known)
end
edits
end
Putting all this together, we get the code proposed in the PR and yielding the following performances:
julia> using BenchmarkTools
julia> @btime spelltest($(testset(readlines(open("spell-testset1.txt")))))
0.748 of 270 correct (0.056 unknown) at 95.094 words per second
0.748 of 270 correct (0.056 unknown) at 95.808 words per second
0.748 of 270 correct (0.056 unknown) at 95.282 words per second
0.748 of 270 correct (0.056 unknown) at 93.151 words per second
0.748 of 270 correct (0.056 unknown) at 94.579 words per second
0.748 of 270 correct (0.056 unknown) at 96.582 words per second
0.748 of 270 correct (0.056 unknown) at 95.515 words per second
2.796 s (35667972 allocations: 1.29 GiB)
We’re still a factor 2 short of @klaff’s most pessimistic bet. But there is still hope: a significant fraction of the time is now spent in string manipulations, which I am confident could be optimized further.