Bad performance: using OOP with for loops to identify substring

candidaorelmex · January 6, 2022, 10:53pm

Hi Julia community,

I have two long lists: one contains long strings, the other one short strings. I need to identify which short strings can be found in which long strings - we need to test all possible “long string”-“short string” pairings. then we need to register where we found them, the coordination of these “mapped sites” are important.

My approach:

create a mutable struct “long string map” with attributes where I save the long string’s character sequence and which short strings I can map to it.
write a function to match a short string to a “long string map” object, save all relevant information in the attributes
iterate through all long and short string pairings

I did this with either 2 for loops and and also with 1 for loop and using f.(array_with_long_string_objects), both were too slow. do you guys have ideas which functions or functionalities i could use. I’m kinda disappointed because I expected Julia to be super fast, especially since it says everywhere we shouldnt be as shy with for loops as with R for example.

Here some code:

mutable struct long_string_object
    long_string_id::SubString{String}
    long_string_sequence::SubString{String}
    mapped_short_strings::Vector{String}
    mapped_short_strings_coordinates::Vector{UnitRange{Int64}}
    short_peptide_coverage
end

function map_short_string_to_long_string(short_string, long_string)
    if occursin(short_string, long_string)
        finder = findall(short_string, long_string_object.long_string_sequence)
        for i = 1:length(finder)
            long_string_object.short_peptide_coverage[finder[i],:] += ones(Int64, length(finder[i]), 1) 
        end
        push!(long_string_object.mapped_short_strings, short_string)
        append!(long_string_object.mapped_short_strings_coordinates, finder)
    end
    nothing
end

function map_array_of_short_strings_to_array_of_long_strings(short_string_array, long_string_object_array)
    for c = size(short_string_array,1)
        for d = 1:size(long_string_object_array,1)
            map_short_string_to_long_string(short_string_array[c], long_string_object_array[d])
        end
    end
    nothing
end

Thanks for your help!

PS: I modified all methods and functions - I need this for bioinformatics but I didnt want to confuse you with technicalities. if it helps: the long strings are 100-2000 characters long, the short ones are 5-20 characters long.

nilshg · January 7, 2022, 4:15pm

What do you mean by “bad performance” and “super fast”? How fast is your program and how fast should it be? How fast is your R implementation?

I know very little about this (you probably want to talk to @jakobnissen who I think has done work with strings in bioinformatics), but String can be hard on the garbage collector. Is your program spending a lot of time in GC? If so it might be worth trying https://github.com/JuliaStrings/InlineStrings.jl or something similar (it seems InlineString only supports up to 255 bytes, so your long strings might be too long).

jakobnissen · January 7, 2022, 4:30pm

@candidaorelmex So for this problem, there are a variety of approaches:

Use a dedicated biological sequence search tool like BLAST. There are no such tools implemented in Julia at the moment, but what you can do is to call BLAST from the shell within Julia, then use Julia to parse the result
You could use pairwise alignment to solve this. BioAlignments.jl has Needleman-Wunch and Smith-Waterman implemented in Julia. It’s not super optimized (i would estimate maybe 10 ms per pair?), so if performance is a concern, this might not be the way to go. But it’s certainly the most sensitive for biological sequences.
Represent both your subject and query sequences as BioSequence from BioSequences.jl. Then use the find functions like findnext to find the location of the smaller sequences. This is certainly faster than SW alignment, but not blazingly fast either.
For each length L of the small sequences, create a set of kmers of type DNAMer{L} from BioSequences.jl version 2 (very important w. version 2 - we don’t have good kmer functionality in v3!). Then for each long set, loop over each long sequence and use each to generate kmers and check for membership in each set.

candidaorelmex · January 7, 2022, 4:31pm

It’s half the speed of what my friend managed the same task in half the time with python - hence I think Im slow.

gc is like 3% I think… I will check out InlineStrings.jl, thanks for your helpful inputs!

candidaorelmex · January 7, 2022, 4:36pm

I’ll test that!
performance is the concer atm…I can give it a try though:)
Again, I’ll try that! I’m using findall atm, so that should work
I’m working with protein sequences, but I guess there’s an analog for that. stil version 2 if it’s proteins?

Thanks a lot!

jakobnissen · January 7, 2022, 4:39pm

Yeah - LongAminoAcidSeq for proteins. It looks like you want to estimate coverage of the short reads on the long sequences. For this purpose, in order to get good speed, you will need to use a dedicated short-read mapper like bwa or kma or bowtie2 or minimap2 - I don’t recommend rolling your own. These tools will able to handle imperfect matches.

candidaorelmex · January 7, 2022, 4:41pm

this might be what im looking for. I really appreciate your help!

are these functions in the BioSequences.jl package?

jakobnissen · January 7, 2022, 4:52pm

These are command-line tools written by other bioinformaticians. You will need to install and launch them separately from the shell.

candidaorelmex · January 7, 2022, 5:38pm

Ok gotcha. Thx!

Topic		Replies	Views
Bad performance: using structs to find substrings New to Julia performance	24	1788	January 17, 2022
How do I make the julia code efficient? General Usage question	3	296	September 21, 2022
Community string benchmark suite General Usage	3	159	October 1, 2024
CSV woes and SubString documentation New to Julia	9	1481	December 24, 2017
Substring function? New to Julia strings , unicode	42	3897	July 18, 2022

Bad performance: using OOP with for loops to identify substring

Related topics