[ANN] SuffixAutomata.jl for fast, online updating and querying of generic collections

I’d like to announce a small, dependency-less new package, SuffixAutomata.jl. You can read about the algorithm here if you’re not familiar with it.

The big draws of this algorithm are:

  • allows fast online updates
  • very fast O(m) checking for the presence of a pattern/substring in the automaton, where m is the length of the search pattern/substring. So even if you have millions or billions of characters or tokens in your text (or whatever sort of collection your automaton represents), checking for the existence of a small pattern or substring is almost instantaneous.

This algorithm is traditionally used for strings, but it doesn’t need to be. A draw of my implementation is that a SuffixAutomaton can operate on a Vector{T} where T is any iterable type. For example, I’m going to be using it for pattern detection in streaming audio data (which is discretized using another package that I have under development, SymbolicDiscretizers.jl).

In coming days and weeks I will address some shortcomings:

  • make it more suitable for working with big out-of-core and streaming data
  • the current findall implementation is pretty fast, but not super fast O(m) like occursin is; I’ll see what I can do to improve that
  • allow optimized representation for small, fixed vocabularies

A quick toy demo (lcs stands for longest common substring):

a = SuffixAutomaton("ababc")
typeof(a)                    # SuffixAutomaton{Char}
eltype(a)                    # Char

occursin("ab", a)            # true
findall("ab", a)             # [1:2, 3:4]
lcs("abcdef", a)             # ("abc", 1)

append!(a, "def")
lcs("abcdef", a)             # ("abcdef", 1)
3 Likes