Bad performance: using OOP with for loops to identify substring

Hi Julia community,

I have two long lists: one contains long strings, the other one short strings. I need to identify which short strings can be found in which long strings - we need to test all possible “long string”-“short string” pairings. then we need to register where we found them, the coordination of these “mapped sites” are important.

My approach:

  1. create a mutable struct “long string map” with attributes where I save the long string’s character sequence and which short strings I can map to it.
  2. write a function to match a short string to a “long string map” object, save all relevant information in the attributes
  3. iterate through all long and short string pairings

I did this with either 2 for loops and and also with 1 for loop and using f.(array_with_long_string_objects), both were too slow. do you guys have ideas which functions or functionalities i could use. I’m kinda disappointed because I expected Julia to be super fast, especially since it says everywhere we shouldnt be as shy with for loops as with R for example.

Here some code:

mutable struct long_string_object
    long_string_id::SubString{String}
    long_string_sequence::SubString{String}
    mapped_short_strings::Vector{String}
    mapped_short_strings_coordinates::Vector{UnitRange{Int64}}
    short_peptide_coverage
end
function map_short_string_to_long_string(short_string, long_string)
    if occursin(short_string, long_string)
        finder = findall(short_string, long_string_object.long_string_sequence)
        for i = 1:length(finder)
            long_string_object.short_peptide_coverage[finder[i],:] += ones(Int64, length(finder[i]), 1) 
        end
        push!(long_string_object.mapped_short_strings, short_string)
        append!(long_string_object.mapped_short_strings_coordinates, finder)
    end
    nothing
end

function map_array_of_short_strings_to_array_of_long_strings(short_string_array, long_string_object_array)
    for c = size(short_string_array,1)
        for d = 1:size(long_string_object_array,1)
            map_short_string_to_long_string(short_string_array[c], long_string_object_array[d])
        end
    end
    nothing
end

Thanks for your help!

PS: I modified all methods and functions - I need this for bioinformatics but I didnt want to confuse you with technicalities. if it helps: the long strings are 100-2000 characters long, the short ones are 5-20 characters long.

What do you mean by “bad performance” and “super fast”? How fast is your program and how fast should it be? How fast is your R implementation?

I know very little about this (you probably want to talk to @jakobnissen who I think has done work with strings in bioinformatics), but String can be hard on the garbage collector. Is your program spending a lot of time in GC? If so it might be worth trying GitHub - JuliaStrings/InlineStrings.jl: Fixed-width string types for Julia or something similar (it seems InlineString only supports up to 255 bytes, so your long strings might be too long).

@candidaorelmex So for this problem, there are a variety of approaches:

  1. Use a dedicated biological sequence search tool like BLAST. There are no such tools implemented in Julia at the moment, but what you can do is to call BLAST from the shell within Julia, then use Julia to parse the result

  2. You could use pairwise alignment to solve this. BioAlignments.jl has Needleman-Wunch and Smith-Waterman implemented in Julia. It’s not super optimized (i would estimate maybe 10 ms per pair?), so if performance is a concern, this might not be the way to go. But it’s certainly the most sensitive for biological sequences.

  3. Represent both your subject and query sequences as BioSequence from BioSequences.jl. Then use the find functions like findnext to find the location of the smaller sequences. This is certainly faster than SW alignment, but not blazingly fast either.

  4. For each length L of the small sequences, create a set of kmers of type DNAMer{L} from BioSequences.jl version 2 (very important w. version 2 - we don’t have good kmer functionality in v3!). Then for each long set, loop over each long sequence and use each to generate kmers and check for membership in each set.

It’s half the speed of what my friend managed the same task in half the time with python - hence I think Im slow.

gc is like 3% I think… I will check out InlineStrings.jl, thanks for your helpful inputs!

  1. I’ll test that!

  2. performance is the concer atm…I can give it a try though:)

  3. Again, I’ll try that! I’m using findall atm, so that should work

  4. I’m working with protein sequences, but I guess there’s an analog for that. stil version 2 if it’s proteins?

Thanks a lot!

Yeah - LongAminoAcidSeq for proteins. It looks like you want to estimate coverage of the short reads on the long sequences. For this purpose, in order to get good speed, you will need to use a dedicated short-read mapper like bwa or kma or bowtie2 or minimap2 - I don’t recommend rolling your own. These tools will able to handle imperfect matches.

this might be what im looking for. I really appreciate your help!

are these functions in the BioSequences.jl package?

These are command-line tools written by other bioinformaticians. You will need to install and launch them separately from the shell.

Ok gotcha. Thx!