Finding patterns in vectors


I’m trying to write a Julia implementation of Lempel-Ziv Complexity. Specifically I’m stuck at the pattern-finding part of the algorithm. Let’s say I have a vector: [1, 0, 0, 1]. It contains the pattern [1, 0] but doesn’t contain the pattern [1,1] [0,1,1]; so it should start reading from the left and go to the right. I need an operation that gives a logical true when I give the [1, 0] and the aforementioned vector and false when I give it [1,1]. The “issubset” function still gives a true when I type issubset([1, 0, 0, 1], [0, 1, 1]).

Any help is greatly appreciated,

What issubset(a, b) tests is “whether every element of a is also in b”, because this is the definition of subset, it disregards the order the elements are stored in the Vector because “order” does not exist in sets.

You are not entirely clear on what you want. Do you want to check if b is a prefix of a? In this case use: is_prefix(a, b) = @view(a[keys(b)]) == b; if you want to check if there is a sub-sequence inside a that matches b then you can use any((x -> is_prefix(x, b)), (a[i:(i+length(b)-1)] for i in firstindex(a):(lastindex(a) - length(b) + 1)))

1 Like

I was actually looking for the second one. Works like a charm, thanks so much! It seems should delve more into higher order functions.

1 Like

Could we do this with a simpler for loop?

# is y subarray of x?
function issubarray(y,x)
    ny = length(y)
    for i in eachindex(x)[begin:end-ny+1]
        y == view(x,i:i+ny-1) && (return true)
    return false


using IterTools
binc(b,c)=Tuple(b) ∈ IterTools.partition(c,length(b),1)

Also consider using the findfirst function if this is right for you

findfirst(pattern::AbstractVector{<:Union{Int8,UInt8}}, A::AbstractVector{<:Union{Int8,UInt8}})

@rocco_sprmnt21 is right. I did use an old version of Julia so I did not see this:

help?> findfirst([0x0, 0x01, 0x0], [0x0, 0x1])

  Find the first occurrence of sequence pattern in vector A.

  │ Julia 1.6
  │  This method requires at least Julia 1.6.


  julia> findfirst([0x52, 0x62], [0x40, 0x52, 0x62, 0x63])

You can just call findfirst(b, a) !== nothing then.

Thanks so much for your very helpful comments!!!

The approaches proposed so far have complexity O(m n) when you are searching for a pattern of length m in a vector of length n. When m is large, you may want to consider correlation in the Fourier domain that would result to O((m+n) \log(m+n)).

Here is a shameless plug to our work to compute the Pearson correlation coefficient in the Fourier domain and accurately locate patterns even when they have been scaled and translated: the registered package FastLocalCorrelationCoefficients.jl

A very similar problem is finding similar subsequences in a vector, or finding approximate matches of a pattern in a vector.
The package

contains several methods to look for such matches.