Bad performance: using structs to find substrings

In case anyone is interested: my most efficient solution (old code would have taken ~50min, new solution does it in 90s - 10000 short strings, 17000 long strings) is the following:

  1. throw all short strings into a regex seperate by “|”
    (I couldnt fit more than 3000 short strings into one regex, so you creat e.g. 3 regex for 8000 short strings. I assume this is a bug, but it is cirmuventable)
  2. use findall.(regex, long_string[:]) where regex is constant and long_string[:] an array with all long strings you’d like to have checked.

I would have like to turn all strings into BioSequences with BioSequences.jl, but I didnt manage to find the equivalent of Regex() for biore which I’d need since all my sequences are saved as variables. I also don’t know if findall is compatible with biores as it is with “normal” regular expressions

1 Like