[ANN] SymbolicApproximators.jl: methods for symbolic time series discretization

Announcing SymbolicApproximators.jl, a Julia package implementing Symbolic Aggregate approXimation (SAX) and related methods for symbolic discretization and dimension reduction. These techniques allow you to take continuous-valued sequences (including but not necessarily limited to time series) and then use things like string algorithms and machine learning algorithms for categorical data.

(Note: it has a couple hours left before registration completes so you will not be able to install it just yet from the General Registry.)

So far I’ve only implemented a few algorithms. More will follow soon. I’ve aimed for this to be the fastest implementation of SAX and SAX-family algorithms available, and some informal tests so far have shown that it may be. Basic usage of SAX seems to be about 8x faster than in R’s jmotif package, and > 3000x faster than Python’s saxpy.

Here’s a quick demo. (See here for the full code.) If we have two continuous signals, let’s say a sine wave and a cosine wave, we can use this package to:

  • convert them to reduced-size discrete representations
    • specifically here we’ll use the symbols a through j and a “word” size of 50
  • we’ll pass those time series into a suffix automaton structure, with which we can compute the longest common substring (lcs) between the reduced time series
using SymbolicApproximators
using SuffixAutomata
using StatsBase
using GLMakie

signal1 = (sin.(range(0, 4Ď€, length=100)) |> x -> (x .- mean(x)) ./ std(x))
signal2 = (cos.(range(0, 4Ď€, length=100)) |> x -> (x .- mean(x)) ./ std(x))

sax = SAX(50, 10)  # we'll reduce with SAX to a "word" size of 50, alphabet size of 10
word1 = encode(sax, signal1)
word2 = encode(sax, signal2)
string1 = join(values(word1))
string2 = join(values(word2))

# find the longest common substring
automaton = SuffixAutomaton(string2)
pattern, position = lcs(string1, automaton)

You can also use these methods to calculate distance between approximated time series. Another demo here shows that the SAX distance approaches the true Euclidean distance of the original time series, as you increase the parameters word size and alphabet size.

5 Likes

Hi, and thank you very much for the package.

If I may ask, I am wondering:

If we have two continuous signals, let’s say a sine wave and a cosine wave […]

Shouldn’t there be two identical LCSs in the 100-long original signal data?

Taking this opportunity, would you have a recommendation, please: assuming I have very long signals, two or even more of them, is it possible to use the package to find, say, the 10 longest common substrings and to provide the starting and ending points (times) in the original time series?

Shouldn’t there be two identical LCSs in the 100-long original signal data?

The two patterns are offset from each other by an amount that’s not necessarily a multiple of the segment size, so the elements of the resulting word don’t completely align. If on the other hand you were to deliberately make the lengths of the time series and the words all multiples of 4, like this:

signal1 = (sin.(range(0, 4Ď€, length=256)) |> x -> (x .- mean(x)) ./ std(x))
signal2 = (cos.(range(π/2, 4π+π/2, length=256)) |> x -> (x .- mean(x)) ./ std(x))  # Offset by π/2
sax = SAX(32, 7)

Then you do get practically the whole shape matching as the least common substring:

(Maybe this version would have been better for the demo. I will probably update it later.)

As for your other question, about finding the 10 longest common substrings. As a first step you could certainly use this pkg, SymbolicApproximators, to generate discretized versions of the time series at whatever resolution you feel is appropriate. (Or you could try several.) Then, although I also wrote the SuffixAutomata package that’s demonstrated here (which actually does the job of finding the lcs), I’m not sure that it’s the best choice for your task. I don’t think there’s a trivial way to find the top n common substrings.

But what you could do is something like this. Slide a window over the time series and iteratively find the lcs:

using SuffixAutomata

# two "time series" with common patterns
ts1 = "the quick brown fox jumps over the lazy dog"
ts2 = "my quick brown cat and the lazy fox went over the hill"

# build automaton from first series
sa = SuffixAutomaton(ts1)

# slide windows through second series and find matches
for window_size in [20, 15, 10], start in 1:5:length(ts2)-window_size
    window = ts2[start:start+window_size-1]
    match, pos_in_window = lcs(window, sa)
    if !isnothing(match) && length(match) >= 5
        pos_in_ts1 = findall(match, sa)[1]  # Get first occurrence in ts1
        pos_in_ts2 = start + pos_in_window - 1
        println("'$(match)' found at ts1[$pos_in_ts1] and ts2[$pos_in_ts2]")
    end
end

I’m not certain that this is the best way to go for you. But the point of this package is, it’s a way to get a discrete representation of your time series – so it opens up whole classes of algorithms that would otherwise be unavailable for continuous data. So there’s probably a better way to go than with the automata lcs that I demonstrated, I just don’t know right now what.

Thank you very much for providing additional information. I noticed all the changes. I’m sorry for the slight delay in reply. I wanted to take some time to gather relevant papers and literature on the topic. Although my available time was limited, I was able to develop a somewhat slightly better understanding of the subject, though I am still very much at a beginner level. That said, I find this package really interesting and I’m looking forward to understanding it better. At this point, if I may, I would like to ask three additional questions:

i) I noticed on the GitHub page that the roadmap includes support for streaming. May I ask how you are planning to implement this functionality?

ii) Do you think there is potential for this package to work in tandem with TransitionsInTimeseries.jl?

iii) Do you think it might make sense to have the word size, alphabet size, or even the rolling window adjust in a more dynamic way, for example, with more automation, to better capture dynamical behaviors with variable timing?

And one more thought, would you consider presenting SymbolicApproximators.jl at one of the JuliaDynamics monthly meetings? I think it would be a great chance for me (and probably many others in the Julia Community) to learn more about this topic

Thanks, I think it’s very interesting too, I didn’t know about these algorithms until recently.

Re: streaming functionality. I’m not entirely sure yet, but it’s next on my to-do list, and it’s important because it’s exactly the reason I made this package. I need to use it to detect recurring patterns in a live audio stream. Probably over the next few days I will finish working that out.

You could use it for streaming data already in the package’s current state, although not optimally. For example with IterTools:

using SymbolicApproximators, IterTools, Statistics

signal = sin.(range(0, 20Ď€, length=2000))
sax = SAX(10, 5)

# stream through subsequences of length 200, stride 50
for win in partition(signal, 200, 50)
    normed = (win .- mean(win)) ./ std(win, corrected=false)
    word = encode(sax, normed)
    @show join(values(word))
end

Or with a CircularBuffer:

using SymbolicApproximators, DataStructures, Statistics

buf = CircularBuffer{Float64}(200)   # subsequence length = 200
sax = SAX(10, 5)

for _ in 1:1000   # simulate 1000 incoming samples
    push!(buf, randn())
    if length(buf) == 200
        normed = (buf .- mean(buf)) ./ std(buf, corrected=false)
        word = encode(sax, normed)
        println(keys(word))
    end
end

What’s less than optimal here is cache locality in the case of generating a great number of Words. I haven’t yet fully thought through how that’s going to work. But I think it will involve introducing a mutating encode!() method that will let you pass in a big preallocated container (maybe a Matrix, maybe a CircularBuffer) that will store the actual data from approximation/discretization, and then the Word structs themselves will each just have a view into its respective data in the container.

TransitionsInTimeseries.jl looks interesting and similar in focus to this package, I will take a closer look.

Re: flexible parameters, e.g. for dynamical behaviors … Yes I think so. For one thing, some of the algorithms are adaptive or flexible, either in terms of segment width or in the alphabet size (e.g. the iSAX algorithm which I will be implementing next). I haven’t even finished reading the literature on these yet so there’s some work to do. But I believe the package’s current structure can accommodate them.

Basically my goal is to implement the basic framework and operations of these algorithms, in a domain-agnostic manner, and then I hope you will have enough flexibility to adapt it to your use case. I think I’m just about there. Once I implement the mutating encode!() and support for iSAX, I will make some demos that should clarify a lot of this, especially the streaming part.

Yes I would be happy to present, feel free to message me here or let me know when the next one is.

1 Like