Large allocations for imported text files

Jdbeck66 · March 21, 2022, 7:40pm

I’m importing 6 text files that are each between 5 and 8 GB each using BioJulia’s FASTX.

I’m using the code below to read in the data I want to an array for further analysis and processing.

function getsequences(file,WT)
    open(FASTQ.Reader, file) do reader
        s = []
        q = ExactSearchQuery(dna"N")
        l = length(WT)
        for r in reader
            # Check if no undetermined and long enough
            if  FASTX.FASTQ.seqlen(r) ≥ l && !occursin(q, sequence(r))
                push!(s, convert(LongRNA{2},sequence(r))) 
            end 
        end
        return s
    end
end

I’m not overly concerned about time - but in looking at the allocations (see below) I’m allocating way more space than the size of the actual files.

I’m still trying to get some ideas squared in my mind - Is this because I’m not preallocating space? And how can I do this algorithmically recognizing that file lines and size change - I know there is a way to pass a hint about size that might improve performance? I’m also thinking that this issue comes along with IOing files generally?

Anybody who can share some wisdom on this topic?

JB

goerch · March 22, 2022, 10:05am

I know of at least two options to check where allocations originate. First the easy one, use a profiler (I prefer this one). If you need to know more details you could check track–allocations.

lmiq · March 22, 2022, 12:17pm

Changing this to an array with a concrete type of element may help. As it seems, you could initialize it with:

s = LongRNA{2}[]

(seems from the code that the type of element that will be added has that type signature).

Topic		Replies	Views
BioJulia Fastx import long{4} Biology, Health, and Medicine biology	1	493	March 14, 2023
Optimizing performance with FASTX I/O stream and Codon Counting Biology, Health, and Medicine question	5	810	April 26, 2023
Read lines from file without new allocations Performance question	1	469	August 19, 2022
Performant reading of .tar.xz files Performance question , speed-optimization	15	705	September 18, 2023
Streaming gziped file to FASTQ.Reader - where to add method? General Usage question , biology , input-output	2	1037	March 20, 2020

Large allocations for imported text files

Related topics