I’ve written a short script to search a time-series for a specific sub-sequence. I’m having trouble refactoring the code. I’ve linked to question on Code Review StackExchange if someone wants to do a complete review. However, I was hoping I could get some general refactoring advice here for this type of code smell.
Yours is a non - trivial refactoring question. Perhaps Chapter 12. Optimization may be of help, not on optimizing your problem but on philosophically reconsidering it.
That’s fair and has been something in the back of mind this whole time. My personal reason for posting this is it’s actually a reduced example of a library which has this pattern of looping and searching replicated/copy-pasted across several functions at this point. I’d really like to generalize the iteration pattern and re-use it somehow? Maybe with these “macros” which I don’t understand very clearly?
I did not give your code a very close reading so it is possible I missed something, but I would do the following:
- write a function that returns the subsequences (using
view
) that the original one breaks up to, - write two functions that normalize a subsequence, one by the mean, one by the standard deviation,
- write a function that does the comparisons.
Macros do syntax transformations, I don’t think you need them here.
That’s a very healthy attitude! I would not recommend macros for this. The way I usually solve it is to have a single function with your iteration pattern, and extract any usage-specific code to functions that you pass as arguments. Example:
function pairwise_iteration(op, vector::AbstractVector)
s = zero(eltype(vector))
for i = 1:2:length(vector)
s += op(vector[i], vector[i+1])
end
s
end
Which can be called like this:
julia> pairwise_iteration(+, [1,2,3,4])
10
julia> pairwise_iteration(*, [1,2,3,4])
14
julia> my_op(a, b) = a^2 + b^3
my_op (generic function with 1 method)
julia> pairwise_iteration(my_op, [1,2,3,4])
82
Btw, also note the zero(eltype(vector))
for type stability. I saw that you didn’t use this in your euc_dist
function. It’s not required, but may increase performance.
I ended up making an iterator, since I was repeatedly calculating the same view on an array.
"""
Duplicate data and indexing arithmetic to avoid
using a LIFO queue or negative indexing.
"""
mutable struct t_iter
data::Vector{Float64}
t::Vector{Float64}
length::Int
q_len::Int
function t_iter(data::Vector{Float64}, q_len::Int)
return new(data, zeros(Float64, 2*length(data)), length(data), q_len)
end
end
function Base.iterate(data::t_iter, d_i=1)
if d_i >= data.length
return nothing
end
dat = data.data[d_i]
t_idx = ((d_i - 1) % data.q_len) + 1
data.t[t_idx] = dat
data.t[t_idx + data.q_len] = dat
return ((d_i, dat, data.t), d_i+1)
end
Is IterTools.partition
what you need? There is also Base.Iterators.partition
but I guess you need overlapping sliding windows. Note that IterTools.partition
returns a tuple while Base.Iterators.partition
returns a view
. If your sub-sequence is small enough, IterTools.partition
sounds like a way to go. (It would be nice if Base.Iterators.partition
had the step
argument as in IterTools.partition
…)
Maybe? I’m still not at the point with my code (which is adapted from a C library) where I can confidently hold all the operations in my head.
Just in case you don’t know this approach, here the simple way for your ignore_bias using the convolution product.
Let m=length(query), N=length(data) and D[k]=data[k:k+m], let Q be the normalized query and DM[k] the constant vector such that the mean of D[k]-DM[k] is zero. Then, you need to compute for each k the following:
obj[k] = (D[k] - DM[k] -Q[k])^2 = D[k]^2 + DM[k]^2 + Q[k]^2 - 2<D[k],DM[k]> - 2 <D[k],Q> - 2 <DM[k],Q>
Now, <DM[k], Q> = <D[k], DM[k]> = 0 for all k, and hence your objective function is
obj[k] = D[k]^2 +Q^2 - 2<D[k],Q>
Now Q^2 is independent of the offset, and <D[k], Q>=convolution(D,Q)[k], and D[k]^2 can be computed for all k in O(N)
Computing the convolution: GitHub - aamini/FastConv.jl: Fast Convolutions in Julia might work, if you locally merge the PRs that fix this for julia 1.0. Otherwise, someone here will find you a package that does the job (I am not up to date with that part of the ecosystem).
If m is large, then you win a complexity class (via fast fourier transform for the convolution); if m is small, then you still win, because somebody else wrote fast simd/gpu code for convolutions with small kernels.
Homework exercise: Represent ignore_scale via convolutions.
I think this is what is done in MASS. I had forgotten about it. Thank you for reminding me.