Bad performance: using structs to find substrings

Hi everyone,

I have the following problem:
I have a list of long strings and a list of short strings. My goal: find which short strings are found within which long strings. for this i need to go through all short string- long string pairings. I then also need to store some information of these pairings such as which short strings were found in a long strings and the coordinated of these substrings

My approach:
1.) create a mutable struct long_string_object

mutable struct long_string_object
    long_string_identifier::SubString{String}
    long_string_sequence::SubString{String}
    found_short_strings::Vector{String}
    found_short_strings_coordinates::Vector{UnitRange{Int64}}
    short_string_coverage
end

2.) write a function to test a short string-long string pairing

function map_short_string_to_long_string(short_string, long_string_object)
    if(occursin(short_string, long_string_object.long_string_sequence))
        finder = findall(short_string, long_string_object.long_string_sequence) 
        for i = 1:length(finder)
            long_string_object.short_string_coverage[finder[i],:] += ones(Int64, length(finder[i]), 1) # add 1 to every position covered by ranges in finder
        end
        push!(long_string_object.mapped_short_strings, short_string) 
       append!(long_string_object.mapped_short_strings_coordinates, finder) 
    end
    nothing
end

3.) loop through every short string-long string pairing

function map_array_of_short_strings_to_array_of_long_string_objects(short_string_array, long_string_object_array)
    for c = 1:size(short_string_array,1)
        for d = 1:sizelong_string_object_array,1)
            map_short_string_to_long_string(short_string_array[c], long_string_object_array[d,1])
        end
    end
    nothing
end

I also tried removing one for loop and replacing it with a f.(long_string_object_array) functionality (which is the same as using map if I understand correctly, didnt work.

The question: is there a more efficient way of doing this? Im not happy with the performance, especially because
a) a friend managed to do it in half the time using python (by putting all short strings in to one string, seperated by a “|”, then using the function iterfind in python 3)
b) everyone says I shouldnt be shy with for loops in Julia, but I was hoping for more…

Also, do you have general criticism for my code?

1 ) Judging by the code below, you have within the struct either strings that don’t have to change or mutable objects which don’t have to change identity, hence you don’t need mutable:

struct LongStringObject
    long_string_identifier::SubString{String}
    long_string_sequence::SubString{String}
    found_short_strings::Vector{String}
    found_short_strings_coordinates::Vector{UnitRange{Int64}}
    short_string_coverage #::Matrix{Int}?
end

2 ) findall repeats some work already done by occursin, so I’d probably go with

function map_short_to_long(short, long::LongStringObject)
    seq = long.long_string_sequence
    coverage = long.short_string_coverage
    finder = findall(short, seq) 
    for inds in finder
        coverage[inds,:] .+= 1 # add 1 to every position covered by ranges in finder
    end
    if !isempty(finder)
        push!(long.mapped_short_strings, short) 
        append!(long.mapped_short_strings_coordinates, finder) 
    end
    return long
end

The low performance could originate from coverage[inds,:] += ones(...) which has to allocate a new array for the right-hand part each time and drop it immediately. Using broadcast coverage[inds,:] .+= 1 should improve the performance.
Also, note that Julia uses column-major storage for Matrix, so coverage[:,inds] .+= 1 is more performant than coverage[inds,:] .+= 1, so you may consider changing the order in the coverage array.
3 ) Nothing too bad, but for i in 1:size(array, 1); f(array[i]); end could be replaced by for x in array; f(x); end just like in Python.

3 Likes

Thank you for your feedback. I have some follow-up questions:

  1. Is there a performance advantage of struct vs mutable struct

  2. I first test if the short string is present in the long string with occursin. am I right in understanding that you’re telling me that this doesn’t improve the performance because if findall doesnt find anything it doesn’t take any longer than occursin

the += 1 and the column-major storage are also helpful, thanks!

  1. Ill test that, thanks!

Do you think an OOP approach is appropriate in this case or should start working with complex data frame/array types?

  1. structs are slightly faster because they can be stack allocated, while mutable structs are in fact references to the actual data stored on heap. In your case, that performance difference might be negligible because the actual work done on the data takes much more time than getting the data from reference. I prefer enforcing immutability for clarity – to show that I don’t expect any attribute to change its identity (note that arrays within the immutable structs can still change contents).

  2. It does improve performance if short strings are unlikely to be found because findall still has to allocate an empty array. If you expect most of the time substrings to be found, then your initial approach has some duplication of effort. Maybe some hybrid approach is more fruitful: check the occurence via findfirst, then you know exactly the index of first occurence, so that you can then pass a shorter string to findall:

firstoccur = findfirst(short, seq)
if firstoccur !== nothing
    finder = findall(short, @view seq[firstoccur.start:end])
    for inds in finder
        ...
    end
    push!(...)
    append!(...)
end

Also I cannot understand why you used two indices for short_string_coverage. Is it supposed to be a 2D array or you copy-pasted a code snippet from somewhere and it’s supposed to be a 1D array (i.e., Vector{Int}) in fact?
For broadcast, mind the dot in .+= 1, without it you’ll get an error (which in latest Julia versions reminds you to not forget the dot :slight_smile: ).

I can’t say if dataframes are better or worse for this task. If that’s the only problem you’d consider a dataframe for, then it’ll probably just add more overhead compared to a fairly small specialized type. If you see how you can leverage the existing libraries for your processing needs, then of course it makes sense to move to the types they use.

2 Likes

sorry, what the term OOP here stands for?

Sorry if I’m off-topic, but it’s worth emphasizing that this approach is not OOP. Using a data structure does not make an approach OOP, it is just a structure as in C since being OOP requires more stuff than encapsulating fields. Yes, it is clear why you use this term here but it may cause misunderstanding for someone else who has just started to work with Julia.

6 Likes
  1. The type will remain, but i will manipulate the content of the object attributes. using immutabale structs are a legitimate choice then.

  2. of 10000 short strings I expect 0 to 5 to be found in most cases. my database of short strings will hopefully increase to 100’000+ at some point, I wont ever find more than maybe 20 per long string (except for very few exceptions). Concerning the two indices for short_string_coverage is indeed unnecessary since its a Vector{Int64}.
    Do you think I can improve performance by using Int8 or Int32 instead of Int64? the numbers covered by an element in the vector will never go beyond 500

I’ll be sure to include the dot:)

I prefer defining a new type with attributes over a df for understandability’s sake, I used to do it in a df, it got difficult to look at and deal with quickly. Unless there’s a significant advantage, like performance penalties because it’s costly to call or change object attributes, I’d like to stick to the current approach.

Thank you for pointing that out. Would you mind elaborating on what keeps this from being OOP? I know that all I’m doing is creating a customized struct; I thought that’s pretty much OOP means.

How could I enhance my code with OOP? Or does the advantage of OOP only become obvious when the project that this issue is a part of grows further?

I use the term wrongly because I still understanding in what it means in practical terms. Any help would be appreciated :slight_smile: (I know there are resources online, but I dont quite get what OOP means in terms of practical implementation)

1 Like

Yes, you are encapsulating your fields within an object but it is not OOP. Since it is a long issue, I recommend following other topics that are dedicated for this. In short

  • OOP languages have more features besides encapsulation (polymorphism, inheritance, etc.)
  • In OOP languages, the object itself is automatically passed to the object method (or function) but in Julia, it is not the case.

Since Julia is not an OOP language, we generally encapsulate fields in structures and the method parameters help to select the correct version of the function, for example

object.method(some parameters)
other_object.method(some parameters)
something_else.method(some parameters)

becomes

method(object, some parameters)
method(other_object, some parameters)
method(something_else, some parameters)

and thank to multiple dispatch, the Julia compiler selects the correct method depending on the type of the object. That is the point in a nutshell.

By the way, C and Rust support user-defined data types through structs as in the same way that Julia uses -and of course- C and Rust are not OOP languages, but C++ and Java are, as they support the other features of OOP.

Not a big fault, but it is open to possible misunderstandings as I mentioned above.

2 Likes

I think it is more correct to write

function(object, some parameters)

since you don’t actually specify the method in the call. The correct method is chosen by the compiler based on the function name and input arguments.

1 Like

sure, just wanted to be consistent for comparing paradigms.

1 Like

depending on how long are your shorts + how often are you going to search and how similar they are you may also think about Rabin-Karp, or an automaton (Knuth-Morris-Pratt) based string matching.

2 Likes

Still allocates unless you use @views. See Allocation when assigning to a slice - #3 by stevengj

6 Likes

Do you want this for Biology problems? If so, I would highly recommend BioSequences. It has a lot of highly optimized algorithms for sequence matching that are much more efficient than solutions using strings. String represents unicode data, so a lot of operations on them are slower than if you are working with ACTG (or other fixed sets of characters), where you can store the information more efficiently and use more efficient algorithms. Specifically, this has the documentation for some highly optimized substring finding algorithms.

ok, gotcha. so it’s more a Julia thing than my style of solving this

never heard of it, but I’ll try! thanks for your suggestion

indeed, it’s all about finding peptide sequences in protein sequences. I’ll take a look, maybe that’s where my solution lies. thanks you!

For all who are interested: This is short read mapping, a common and well-studied task in bioinformatics. Lots of good programs have been developed that does this - bwa, minimap2, bowtie and kma come to mind.

Because this is a well-studied and important problem with a lot of effort put into it, creating software that can compete with the above programs in terms of speed and sensitivity is a major undertaking where you need to have at least some familiarity of previous approaches in order to be able to compete.

5 Likes

Are there bindings to the good libraries in Julia?