Bad performance: using structs to find substrings

jakobnissen · January 7, 2022, 7:49pm

There is not, and unfortunately all these pieces of software are executables and expose no libraries (as is typical of bioinformatics, everything is done through the shell). However, bwa and samtools standardized the BAM file format, which is parsed by BioJulia’s XAM.jl. So your best bet is to shell out to these executables, then parse their output.

candidaorelmex · January 10, 2022, 8:10am

In case anyone is interested: my most efficient solution (old code would have taken ~50min, new solution does it in 90s - 10000 short strings, 17000 long strings) is the following:

throw all short strings into a regex seperate by “|”
(I couldnt fit more than 3000 short strings into one regex, so you creat e.g. 3 regex for 8000 short strings. I assume this is a bug, but it is cirmuventable)
use findall.(regex, long_string[:]) where regex is constant and long_string[:] an array with all long strings you’d like to have checked.

I would have like to turn all strings into BioSequences with BioSequences.jl, but I didnt manage to find the equivalent of Regex() for biore which I’d need since all my sequences are saved as variables. I also don’t know if findall is compatible with biores as it is with “normal” regular expressions

jakobnissen · January 11, 2022, 3:24pm

Findall is indeed not implemented for biosequences (though a PR implementing it was created yesterday).

woolstar · January 14, 2022, 11:08pm

Even standard command line tools for matching strings, like grep suffer from this. Standard grep operates on unicode under most conditions, but if you restrict it to ASCII only via:

LANG=C grep ...

You can see significant speedups.

amarcotte59 · January 17, 2022, 10:51pm

If I had to write such a piece of code, I would program building a tree or a state machine from the short string list identifying the end of a short string then travel through it with the long strings; probably what the aforementioned projects do. This is to try to get the most efficiency; grep might be sufficient depending on amount of data and turnaround time.

Topic		Replies	Views
Bad performance: using OOP with for loops to identify substring New to Julia performance	8	585	January 7, 2022
CSV woes and SubString documentation New to Julia	9	1485	December 24, 2017
How do I make the julia code efficient? General Usage question	3	299	September 21, 2022
Community string benchmark suite General Usage	3	161	October 1, 2024
Substring function? New to Julia strings , unicode	42	3953	July 18, 2022

Bad performance: using structs to find substrings

Related topics