I’m importing 6 text files that are each between 5 and 8 GB each using BioJulia’s FASTX.
I’m using the code below to read in the data I want to an array for further analysis and processing.
function getsequences(file,WT)
open(FASTQ.Reader, file) do reader
s = []
q = ExactSearchQuery(dna"N")
l = length(WT)
for r in reader
# Check if no undetermined and long enough
if FASTX.FASTQ.seqlen(r) ≥ l && !occursin(q, sequence(r))
push!(s, convert(LongRNA{2},sequence(r)))
end
end
return s
end
end
I’m not overly concerned about time - but in looking at the allocations (see below) I’m allocating way more space than the size of the actual files.
I’m still trying to get some ideas squared in my mind - Is this because I’m not preallocating space? And how can I do this algorithmically recognizing that file lines and size change - I know there is a way to pass a hint about size that might improve performance? I’m also thinking that this issue comes along with IOing files generally?
Anybody who can share some wisdom on this topic?
JB