Refactoring searching through a time-series

I’ve written a short script to search a time-series for a specific sub-sequence. I’m having trouble refactoring the code. I’ve linked to question on Code Review StackExchange if someone wants to do a complete review. However, I was hoping I could get some general refactoring advice here for this type of code smell.

Yours is a non - trivial refactoring question. Perhaps Chapter 12. Optimization may be of help, not on optimizing your problem but on philosophically reconsidering it.

That’s fair and has been something in the back of mind this whole time. My personal reason for posting this is it’s actually a reduced example of a library which has this pattern of looping and searching replicated/copy-pasted across several functions at this point. I’d really like to generalize the iteration pattern and re-use it somehow? Maybe with these “macros” which I don’t understand very clearly?

I did not give your code a very close reading so it is possible I missed something, but I would do the following:

  1. write a function that returns the subsequences (using view) that the original one breaks up to,
  2. write two functions that normalize a subsequence, one by the mean, one by the standard deviation,
  3. write a function that does the comparisons.
1 Like

Macros do syntax transformations, I don’t think you need them here.

2 Likes

That’s a very healthy attitude! I would not recommend macros for this. The way I usually solve it is to have a single function with your iteration pattern, and extract any usage-specific code to functions that you pass as arguments. Example:

function pairwise_iteration(op, vector::AbstractVector)
    s = zero(eltype(vector))
    for i = 1:2:length(vector)
        s += op(vector[i], vector[i+1])
    end
    s
end

Which can be called like this:

julia> pairwise_iteration(+, [1,2,3,4])
10

julia> pairwise_iteration(*, [1,2,3,4])
14

julia> my_op(a, b) = a^2 + b^3
my_op (generic function with 1 method)

julia> pairwise_iteration(my_op, [1,2,3,4])
82

Btw, also note the zero(eltype(vector)) for type stability. I saw that you didn’t use this in your euc_dist function. It’s not required, but may increase performance.

I ended up making an iterator, since I was repeatedly calculating the same view on an array.

"""
Duplicate data and indexing arithmetic to avoid
using a LIFO queue or negative indexing.
"""
mutable struct t_iter
    data::Vector{Float64}
    t::Vector{Float64}
    length::Int
    q_len::Int

    function t_iter(data::Vector{Float64}, q_len::Int)
        return new(data, zeros(Float64, 2*length(data)), length(data), q_len)
    end
end


function Base.iterate(data::t_iter, d_i=1)
    if d_i >= data.length
        return nothing
    end

    dat = data.data[d_i]

    t_idx = ((d_i - 1) % data.q_len) + 1
    data.t[t_idx] = dat
    data.t[t_idx + data.q_len] = dat

    return ((d_i, dat, data.t), d_i+1)
end

Is IterTools.partition what you need? There is also Base.Iterators.partition but I guess you need overlapping sliding windows. Note that IterTools.partition returns a tuple while Base.Iterators.partition returns a view. If your sub-sequence is small enough, IterTools.partition sounds like a way to go. (It would be nice if Base.Iterators.partition had the step argument as in IterTools.partition…)

2 Likes

Maybe? I’m still not at the point with my code (which is adapted from a C library) where I can confidently hold all the operations in my head.

Just in case you don’t know this approach, here the simple way for your ignore_bias using the convolution product.

Let m=length(query), N=length(data) and D[k]=data[k:k+m], let Q be the normalized query and DM[k] the constant vector such that the mean of D[k]-DM[k] is zero. Then, you need to compute for each k the following:

obj[k] = (D[k] - DM[k] -Q[k])^2 = D[k]^2 + DM[k]^2 + Q[k]^2 - 2<D[k],DM[k]> - 2 <D[k],Q> - 2 <DM[k],Q>

Now, <DM[k], Q> = <D[k], DM[k]> = 0 for all k, and hence your objective function is

obj[k] = D[k]^2 +Q^2 - 2<D[k],Q>

Now Q^2 is independent of the offset, and <D[k], Q>=convolution(D,Q)[k], and D[k]^2 can be computed for all k in O(N)

Computing the convolution: GitHub - aamini/FastConv.jl: Fast Convolutions in Julia might work, if you locally merge the PRs that fix this for julia 1.0. Otherwise, someone here will find you a package that does the job (I am not up to date with that part of the ecosystem).

If m is large, then you win a complexity class (via fast fourier transform for the convolution); if m is small, then you still win, because somebody else wrote fast simd/gpu code for convolutions with small kernels.

Homework exercise: Represent ignore_scale via convolutions.

1 Like

I think this is what is done in MASS. I had forgotten about it. Thank you for reminding me.