Memory issues generating a search space from protein sequences

Hi,

I wrote two functions in order to generate a peptide search space from protein sequences.
In the end, I need a dictionary that tells me in which protein I can find a certain peptide in.
Now I have two questions about them.

a) When I generate a search space my memory usage increases A LOT. like 20gb (with 17’000 protein sequences). when the functions are done RAM usage remains this high, without running anything. How can I fix this, as in: reducing RAM usage while the functions are running and after they’re done? I tried looking into optimized memory allocation and function efficiency, but to no avail. This is quite the obstacle for my current pipeline.

b) In general, how would you change the functions to make them run faster, more smoothly, look more “Julian”, improve ithem stylistically etc. I implemented type declarations and excluded global variables, but those are the low hanging fruit…

Function 1 - finding subsequences in one protein sequence:

function find_all_subsequences_of_length_x(protein_name::String, protein_sequence::String, length_x::Int, dic::Dict) # the dictionary is a parameter here because I want to have the option of expanding an existing dictionary. to make a new one I just use Dict() as an input
    for i in 1:(length(sequence) - length_x + 1)
        substringy = protein_sequence[i:length_x + i - 1]
        if substringy in keys(dic)
            push!(dic[substringy], peptide(substringy, protein_name, length_x, i:(i + length_x - 1))) #peptide is an object with the attributes (peptide sequence, source protein name, peptide length, peptide coordinates in protein)
        else 
            dic[substringy] = [peptide(substringy, protein_name, length_x, i:(i + length_x - 1))] 
        end
    end
    return dic
end

Function 2 - use function 1 to iterate over a dataframe of proteins with columns “uniprot_id” and “seq” for the protein sequence

function create_peptide_dictionary_with(fasta_df::DataFrame, min_peplen::Int, max_peplen::Int)  # minlen and maxlen define the range of peptide lengths I want to generate
    dicy = Dict()
    for length in min_peplen:max_peplen
        println("generating search space peptides of length $length")
        for protein in 1:nrow(fasta_df)
            find_all_subsequences_of_length_x(fasta_df.uniprot_id[protein],
                                                                                 fasta_df.seq[protein],
                                                                                 length,
                                                                                 dicy)
        end
    end
    return dicy
end

As a side note: using dictionaires for this is a practical decision, for the next steps in my pipeline I will look for a set of peptides, and it’s very easy to look them up in the dictionary and retrieve their information.

Thanks a lot for your help!

How much does your RAM usage increase by, roughly, and how big is your returned dictionary? I ask to verify my suspicion that the dictionary is the thing taking up so much RAM after the method is done. The dictionary is on the heap, so it’s not involved in the stack memory released when methods are done. And if that dictionary retains a reference, such as a global variable, the garbage collector won’t free that memory.

RAM usage increases by ~20gb, dictionary has 129’154’624 entries.

How come the dictionary is saved on the RAM?

I tried to look up what heap and stack are and how this comes into play here, but I dont understand how it relates to the dictionary being entirely saved on RAM space…

You could think about using BioSequences.jl, and there is also Kmers.jl under development.

Thanks for the tip, I was considering using BioSequences - I will benchmark the difference.

Is Kmers also suitable for peptide or just nucleotide sequences? In the readme it only talks about nucleotides and the docs link leads to a 404 :frowning:

Edit: BioSequences worsens the performance bi ~25% in terms of time it takes for the function to execute and memory allocations. I better stick to regular strings

Hey - Kmers.jl is under development. It will have support for aminoacid kmers of arbitrary length. This should speed up your code significantly. You can probably run the code already now with Kmers, but be warned, the package may not be properly tested and documented yet…

In general, this sounds like a mapping problem. You will probably get much, much better performance using a dedicated alignment tool like DIAMOND or mmseqs2.

That’s great to hear, looking forward to Kmers.jl!

I would like to generally improve RAM usage in my codes - this is an extreme example of what resources my Julia scripts consume. I was hoping to learn some tricks that can help me avoid such issues.

Perhaps SubString would help here.

help?> SubString
search: SubString SubstitutionString

SubString(s::AbstractString, i::Integer, j::Integer=lastindex(s))
SubString(s::AbstractString, r::UnitRange{<:Integer})

Like getindex, but returns a view into the parent string s within range i:j or r respectively instead of making a copy.

If I’m remembering it correctly, basically all the data a program actively works on is stored in RAM, so when people talk about memory usage in programs, they mean RAM. I mentioned heap and stack to talk about how memory is freed (or not freed, in your case); both heap and stack memory are in RAM. You can store data on disk as files, but retrieving that data is much slower, so data is loaded into RAM to be worked with. Some packages make objects to “contain” disk data because the data is too big to be stored in RAM at once, but even those objects will load partial data into RAM to actually work on. Dictionaries are not that sophisticated, they fully live on the heap.

It’s not surprising you’re using so much memory with 129 million entries, a sequence can generate a LOT of subsequences and you have multiple sequences. 20GB over that many entries is about 154 bytes per entry on average, which is actually sensible for a String representing a protein segment.

At the minimum, you could represent a subsequence as two indices (i, length_x + i - 1) to save memory at the cost of having to index the full sequence on demand. Let’s see, Google says the largest protein is ~35k aa, so the smallest integer type to index that is UInt16, which goes from 0 to 65,535. Two UInt16 indices = 4 bytes per entry. Your entries would add up to 517MB. Bear in mind this is only the minimum; integer pairs alone aren’t useful keys for searching sequences like “ASDF”. SubString however can stand in for String and scales better in memory by containing 2 Int indices and a (Int-sized) pointer to a parent String (full sequence would be a good choice).

P.S. I think you can use Base.summarysize to measure the full memory usage of an object.

2 Likes

I’m going to give it a try. I remember avoiding SubStrings because I found harder to work with than regular Strings, but Ill give it a try

That was VERY insightful, thanks for taking the time to give such an elaborate answer!

1 Like

If Substrings are hard to work with, then it may be beneficial to keep Substrings in the long-lived Dict, but make temporary String’s for short-term operations.

I later thought that this wouldn’t allow quickly looking up a key that is a SubString when you only have a String, but it seems to work:

julia> d=Dict(SubString("hcraoeuhrcaohrcu",3,7)=>3.0)
Dict{SubString{String}, Float64} with 1 entry:
  "raoeu" => 3.0

julia> d["raoeu"]
3.0
1 Like

I cannot confirm it quickly, but it’s plausible that a SubString with equal value to a String can be used to compute the same hash value.

Also bear in mind that if two objects are isequal, they are treated as equivalent keys in the case of a hash collision (equal hash value), meaning they will overwrite each other. Here are some examples of 1-entry dictionaries with attempts at 2 entries:

x = Dict(SubString("ASDFx", 1:4) => 1, "ASDF" => 2)
y = Dict([2,3] => 1, view([1,2,3,4], 2:3) => 2)
1 Like

Hey, you could consider using a different associative data structure which does not store all the substrings. For example Prefix Trees / Tries could reduce the space overhead. However, I dont know if there already exists an efficient implementation for your purpose, but you can look into Tries.jl or the trie implementation from DataStrucutures.jl.

You could alternatively redesign your program so that it doesnt require the entire search space in memory at once but can lazily compute it when needed.

3 Likes

That’s a good idea. I’ll see how I can implement a transition from Strings to SubStrings.

Yeah this is actually an issue. Two substrings that have the same character sequence are very different in my case - they might have the same amino acid sequence but originate from 2 different proteins. Overwriting would cause issues.

I looked into DataStructures - Tries won’t do it I think but the RobinDict structure decreased allocation by ~20%!

A redesign as suggested is difficult in my case. I tried other strategies, the search space version proved to be the least error-prone approach without trading off speed. the Issue here is more memory allocation which I’m sure is solveable:)

1 Like

Great goal!

  1. I second @Benny 's suggestion to store indices instead of sequences. If I understand your code correctly, if you have the sequence QWWQWW and length 2, you end up storing the full sequence, plus each substring multiple times, plus the indexes (as 64bit ints). Eg, you’ll have effectively:
"QW"=> [("name1", "QW", 1:2), ("name1, "QW", 4:5)]

That’s a lot of duplicated info. If this was just a little toy script with a few thousand things, it wouldn’t be so bad, but given the size of your dataset, I’m not surprised with the memory issues.

  1. Even if you’re not using Kmers.jl, definitely use BioSequences.jl, since it has much more memory-efficient encoding than String.

  2. Depending on how you’re using the Peptide objects downstream, storing them in memory-mapped files might end up being a better bet. Eg, you could use Arrow.jl to store a table for each protein sequence or something. As already mentioned, disk access is slower than RAM access, but if it keeps your system from maxing out memory, that could end up being faster.

  1. I agree that having the sequence itself as a key and then again in the tuple is redundant. I benchmarked the difference and in my tests it was negligible, but of course you’re right that this occupies unnecessary resources.

  2. I tried using BioSequences - it resulted in HIGHER memory use:( I assume I must have been using it wrong? It was the exact same code as above, but using LongAminoAcidSeqs. Are there caveats in how to use them? I went through the documentation several times but I didn’t find my answer.

  3. Protein sequence storag isn’t an issue, I import them from a fasta file - uses little memory and is quick. I would have loved to store my peptide dictionary as a JLD file, but it was so large I couldnt even get it saved, and I’ve had trouble importing it into RStudio even though it’s supposed to be a generic HDF5 format.

Thanks for your help!

1 Like

A LongSequence from BioSequences stores the sequence in a Vector{UInt64}, and is itself a mutable struct. This has larger memory overhead than a String. To reduce memory usage, either use LongSubSeq, which is analogous to SubArray, or use kmers from Kmers.jl (analogous to StaticArrays, not a ready package, so expect rough edges).