Large allocations for imported text files

I’m importing 6 text files that are each between 5 and 8 GB each using BioJulia’s FASTX.

I’m using the code below to read in the data I want to an array for further analysis and processing.

function getsequences(file,WT)
    open(FASTQ.Reader, file) do reader
        s = []
        q = ExactSearchQuery(dna"N")
        l = length(WT)
        for r in reader
            # Check if no undetermined and long enough
            if  FASTX.FASTQ.seqlen(r) ≥ l && !occursin(q, sequence(r))
                push!(s, convert(LongRNA{2},sequence(r))) 
            end 
        end
        return s
    end
end

I’m not overly concerned about time - but in looking at the allocations (see below) I’m allocating way more space than the size of the actual files.

I’m still trying to get some ideas squared in my mind - Is this because I’m not preallocating space? And how can I do this algorithmically recognizing that file lines and size change - I know there is a way to pass a hint about size that might improve performance? I’m also thinking that this issue comes along with IOing files generally?

Anybody who can share some wisdom on this topic?

JB

I know of at least two options to check where allocations originate. First the easy one, use a profiler (I prefer this one). If you need to know more details you could check track–allocations.

1 Like

Changing this to an array with a concrete type of element may help. As it seems, you could initialize it with:

s = LongRNA{2}[]

(seems from the code that the type of element that will be added has that type signature).

1 Like