Finding position of a sequence in a UInt8 array?

Hello!

Suppose I have:

using Random
rand(MersenneTwister(0),UInt8,10)

10-element Array{UInt8,1}:
 0x40
 0x52
 0x99
 0x35
 0x90
 0xea
 0x00
 0x13
 0xdf
 0x18


Say I want to find the position of sequence “0x90 0xea 0x00”, how would I go about this?

Finding the position of “0x90” is not suitable, this is just a simplified example.

Kind regards

I have tried to use findfirst, skipchars etc. but only been able to use one char at a time currently

You can try the KMP algorithm.

Thanks for the suggestion, that might be what I have to do

Is what I am wanting “unreasonable” though? I thought that perhaps one would already be able to do this in native Julia

Kind regards

Certainly not unreasonable, just mostly used for strings.

A slightly perverse way is using findfirst with arrays converted to strings:

julia> a = String(UInt8[
       0x40
        0x52
        0x99
        0x35
        0x90
        0xea
        0x00
        0x13
        0xdf
        0x18
       ])
"@R\x995\x90\xea\0\x13\xdf\x18"

julia> findfirst(String(UInt8[0x90, 0xea, 0x00]), a)
5:7

I think @Vasily_Pisarev was suggesting implementing KMP in native Julia. But probably what you really want is something that is already implemented.

As it turns out, searching for subsequences of byte arrays is already implemented in Julia Base in order to implement substring searching. You can call it via the undocumented function:

Base._searchindex(a, [0x90,0xea,0x00], 1)

to search for the starting index of the subsequence 0x90,0xea,0x00 in a byte array a starting at the beginning of the array, for example. It returns 0 if the subsequence was not found.

It might be nice to add a high-level interface for this via findfirst and findnext — should be a pretty easy PR since the main code is already implemented.

6 Likes

Thanks for the try, this was what I hoped to avoid, but of course I did not specify that in my question, so that was my fault

Kind regards

EDIT: Included a quote by mistake

This is so awesome thanks!

I just tested it for my purposes and before I was using ‘readuntil’ which allocated more and took longer time, so using this gave me a 10x improvement in timing (in my initial fast test).

I think your suggestions of making this more accessible is good and thanks for putting a PR up or what it is called :slight_smile:

Lastly could you explain to me what the last “1” does in your example? I tried changing it, since I thought it would find an occurence two times if say it was “2” but I couldn’t figure it out. Maybe it is axis of search direction?

Kind regards

It’s the starting index for the search, the same as the start argument of findnext — it says to search starting at byte 1. Passing 2 means to search starting at byte 2.

In this way, you can find multiple occurrences:

i = Base._searchindex(a, [0x90,0xea,0x00], 1)
Base._searchindex(a, [0x90,0xea,0x00], i+1) # finds the second occurrence, if any
1 Like

Makes total sense, thanks!

I filed an issue suggesting the feature, but I didn’t create a pull request (PR) — a PR occurs when someone actually implements the feature and requests for it to be merged.

2 Likes

behold, there exists an actual PR now

4 Likes