Bad performance: using structs to find substrings

There is not, and unfortunately all these pieces of software are executables and expose no libraries (as is typical of bioinformatics, everything is done through the shell). However, bwa and samtools standardized the BAM file format, which is parsed by BioJulia’s XAM.jl. So your best bet is to shell out to these executables, then parse their output.

2 Likes

In case anyone is interested: my most efficient solution (old code would have taken ~50min, new solution does it in 90s - 10000 short strings, 17000 long strings) is the following:

  1. throw all short strings into a regex seperate by “|”
    (I couldnt fit more than 3000 short strings into one regex, so you creat e.g. 3 regex for 8000 short strings. I assume this is a bug, but it is cirmuventable)
  2. use findall.(regex, long_string[:]) where regex is constant and long_string[:] an array with all long strings you’d like to have checked.

I would have like to turn all strings into BioSequences with BioSequences.jl, but I didnt manage to find the equivalent of Regex() for biore which I’d need since all my sequences are saved as variables. I also don’t know if findall is compatible with biores as it is with “normal” regular expressions

1 Like

Findall is indeed not implemented for biosequences (though a PR implementing it was created yesterday).

2 Likes

Even standard command line tools for matching strings, like grep suffer from this. Standard grep operates on unicode under most conditions, but if you restrict it to ASCII only via:

LANG=C grep ...

You can see significant speedups.

2 Likes

If I had to write such a piece of code, I would program building a tree or a state machine from the short string list identifying the end of a short string then travel through it with the long strings; probably what the aforementioned projects do. This is to try to get the most efficiency; grep might be sufficient depending on amount of data and turnaround time.