Refactoring searching through a time-series


#1

I’ve written a short script to search a time-series for a specific sub-sequence. I’m having trouble refactoring the code. I’ve linked to question on Code Review StackExchange if someone wants to do a complete review. However, I was hoping I could get some general refactoring advice here for this type of code smell.


#2

Yours is a non - trivial refactoring question. Perhaps http://www.catb.org/esr/writings/taoup/html/optimizationchapter.html may be of help, not on optimizing your problem but on philosophically reconsidering it.


#3

That’s fair and has been something in the back of mind this whole time. My personal reason for posting this is it’s actually a reduced example of a library which has this pattern of looping and searching replicated/copy-pasted across several functions at this point. I’d really like to generalize the iteration pattern and re-use it somehow? Maybe with these “macros” which I don’t understand very clearly?


#4

I did not give your code a very close reading so it is possible I missed something, but I would do the following:

  1. write a function that returns the subsequences (using view) that the original one breaks up to,
  2. write two functions that normalize a subsequence, one by the mean, one by the standard deviation,
  3. write a function that does the comparisons.

#5

Macros do syntax transformations, I don’t think you need them here.


#6

That’s a very healthy attitude! I would not recommend macros for this. The way I usually solve it is to have a single function with your iteration pattern, and extract any usage-specific code to functions that you pass as arguments. Example:

function pairwise_iteration(op, vector::AbstractVector)
    s = zero(eltype(vector))
    for i = 1:2:length(vector)
        s += op(vector[i], vector[i+1])
    end
    s
end

Which can be called like this:

julia> pairwise_iteration(+, [1,2,3,4])
10

julia> pairwise_iteration(*, [1,2,3,4])
14

julia> my_op(a, b) = a^2 + b^3
my_op (generic function with 1 method)

julia> pairwise_iteration(my_op, [1,2,3,4])
82

Btw, also note the zero(eltype(vector)) for type stability. I saw that you didn’t use this in your euc_dist function. It’s not required, but may increase performance.


#7

I ended up making an iterator, since I was repeatedly calculating the same view on an array.

"""
Duplicate data and indexing arithmetic to avoid
using a LIFO queue or negative indexing.
"""
mutable struct t_iter
    data::Vector{Float64}
    t::Vector{Float64}
    length::Int
    q_len::Int

    function t_iter(data::Vector{Float64}, q_len::Int)
        return new(data, zeros(Float64, 2*length(data)), length(data), q_len)
    end
end


function Base.iterate(data::t_iter, d_i=1)
    if d_i >= data.length
        return nothing
    end

    dat = data.data[d_i]

    t_idx = ((d_i - 1) % data.q_len) + 1
    data.t[t_idx] = dat
    data.t[t_idx + data.q_len] = dat

    return ((d_i, dat, data.t), d_i+1)
end

#8

Is IterTools.partition what you need? There is also Base.Iterators.partition but I guess you need overlapping sliding windows. Note that IterTools.partition returns a tuple while Base.Iterators.partition returns a view. If your sub-sequence is small enough, IterTools.partition sounds like a way to go. (It would be nice if Base.Iterators.partition had the step argument as in IterTools.partition…)


#9

Maybe? I’m still not at the point with my code (which is adapted from a C library) where I can confidently hold all the operations in my head.


#10

Just in case you don’t know this approach, here the simple way for your ignore_bias using the convolution product.

Let m=length(query), N=length(data) and D[k]=data[k:k+m], let Q be the normalized query and DM[k] the constant vector such that the mean of D[k]-DM[k] is zero. Then, you need to compute for each k the following:

obj[k] = (D[k] - DM[k] -Q[k])^2 = D[k]^2 + DM[k]^2 + Q[k]^2 - 2<D[k],DM[k]> - 2 <D[k],Q> - 2 <DM[k],Q>

Now, <DM[k], Q> = <D[k], DM[k]> = 0 for all k, and hence your objective function is

obj[k] = D[k]^2 +Q^2 - 2<D[k],Q>

Now Q^2 is independent of the offset, and <D[k], Q>=convolution(D,Q)[k], and D[k]^2 can be computed for all k in O(N)

Computing the convolution: https://github.com/aamini/FastConv.jl might work, if you locally merge the PRs that fix this for julia 1.0. Otherwise, someone here will find you a package that does the job (I am not up to date with that part of the ecosystem).

If m is large, then you win a complexity class (via fast fourier transform for the convolution); if m is small, then you still win, because somebody else wrote fast simd/gpu code for convolutions with small kernels.

Homework exercise: Represent ignore_scale via convolutions.


#11

I think this is what is done in MASS. I had forgotten about it. Thank you for reminding me.