Hi,
I wrote two functions in order to generate a peptide search space from protein sequences.
In the end, I need a dictionary that tells me in which protein I can find a certain peptide in.
Now I have two questions about them.
a) When I generate a search space my memory usage increases A LOT. like 20gb (with 17’000 protein sequences). when the functions are done RAM usage remains this high, without running anything. How can I fix this, as in: reducing RAM usage while the functions are running and after they’re done? I tried looking into optimized memory allocation and function efficiency, but to no avail. This is quite the obstacle for my current pipeline.
b) In general, how would you change the functions to make them run faster, more smoothly, look more “Julian”, improve ithem stylistically etc. I implemented type declarations and excluded global variables, but those are the low hanging fruit…
Function 1 - finding subsequences in one protein sequence:
function find_all_subsequences_of_length_x(protein_name::String, protein_sequence::String, length_x::Int, dic::Dict) # the dictionary is a parameter here because I want to have the option of expanding an existing dictionary. to make a new one I just use Dict() as an input
for i in 1:(length(sequence) - length_x + 1)
substringy = protein_sequence[i:length_x + i - 1]
if substringy in keys(dic)
push!(dic[substringy], peptide(substringy, protein_name, length_x, i:(i + length_x - 1))) #peptide is an object with the attributes (peptide sequence, source protein name, peptide length, peptide coordinates in protein)
else
dic[substringy] = [peptide(substringy, protein_name, length_x, i:(i + length_x - 1))]
end
end
return dic
end
Function 2 - use function 1 to iterate over a dataframe of proteins with columns “uniprot_id” and “seq” for the protein sequence
function create_peptide_dictionary_with(fasta_df::DataFrame, min_peplen::Int, max_peplen::Int) # minlen and maxlen define the range of peptide lengths I want to generate
dicy = Dict()
for length in min_peplen:max_peplen
println("generating search space peptides of length $length")
for protein in 1:nrow(fasta_df)
find_all_subsequences_of_length_x(fasta_df.uniprot_id[protein],
fasta_df.seq[protein],
length,
dicy)
end
end
return dicy
end
As a side note: using dictionaires for this is a practical decision, for the next steps in my pipeline I will look for a set of peptides, and it’s very easy to look them up in the dictionary and retrieve their information.
Thanks a lot for your help!