In case anyone is interested: my most efficient solution (old code would have taken ~50min, new solution does it in 90s - 10000 short strings, 17000 long strings) is the following:
- throw all short strings into a regex seperate by “|”
(I couldnt fit more than 3000 short strings into one regex, so you creat e.g. 3 regex for 8000 short strings. I assume this is a bug, but it is cirmuventable) - use
findall.(regex, long_string[:])where regex is constant and long_string[:] an array with all long strings you’d like to have checked.
I would have like to turn all strings into BioSequences with BioSequences.jl, but I didnt manage to find the equivalent of Regex() for biore which I’d need since all my sequences are saved as variables. I also don’t know if findall is compatible with biores as it is with “normal” regular expressions