I’d like to announce a small, dependency-less new package, SuffixAutomata.jl. You can read about the algorithm here if you’re not familiar with it.
The big draws of this algorithm are:
- allows fast online updates
- very fast
O(m)
checking for the presence of a pattern/substring in the automaton, wherem
is the length of the search pattern/substring. So even if you have millions or billions of characters or tokens in your text (or whatever sort of collection your automaton represents), checking for the existence of a small pattern or substring is almost instantaneous.
This algorithm is traditionally used for strings, but it doesn’t need to be. A draw of my implementation is that a SuffixAutomaton
can operate on a Vector{T}
where T
is any iterable type. For example, I’m going to be using it for pattern detection in streaming audio data (which is discretized using another package that I have under development, SymbolicDiscretizers.jl).
In coming days and weeks I will address some shortcomings:
- make it more suitable for working with big out-of-core and streaming data
- the current
findall
implementation is pretty fast, but not super fastO(m)
likeoccursin
is; I’ll see what I can do to improve that - allow optimized representation for small, fixed vocabularies
A quick toy demo (lcs
stands for longest common substring):
a = SuffixAutomaton("ababc")
typeof(a) # SuffixAutomaton{Char}
eltype(a) # Char
occursin("ab", a) # true
findall("ab", a) # [1:2, 3:4]
lcs("abcdef", a) # ("abc", 1)
append!(a, "def")
lcs("abcdef", a) # ("abcdef", 1)