Missing records in my FASTA file when writing new FASTAs

Lamma · January 4, 2022, 2:15pm

I have a co-assembly and some alignment data telling me if a reads from a sample that helped spawn that co-assembly align to a contig. I am using this to write a fasta file for that sample.

    t1 = Threads.@spawn DataFrame(CSV.File(depth, delim="\t"))
    abundf = fetch(t1)
    rename!(abundf, [:Contig,:trimmed_mean])

    present = @subset!(abundf, :trimmed_mean .!= 0)

    passContig = Set(present[:,"Contig"])

Here I am reading in and removing all rows that have a depth of 0 indicating that no reads from that sample aligned to that contig (where each row is a contig in the co assembly)

reader = FASTA.Reader(GzipDecompressorStream(open(assembly))) #if gzipped
    # Generating the writer to push the output into
    writer = open(FASTA.Writer, "D:/OneDrive - University of Copenhagen/PhD/Projects/Termite_metagenomes/assembly/matriline/bySample/"*sample*".fasta")
    println("Writing to file")
    # Looping over the contigs
    record = FASTA.Record()
    while !eof(reader)
        read!(reader, record)
        if in(FASTA.identifier(record), passContig)
            write(writer, record)
        end
    end

    close(reader)

Next I am opening the co-assembly and going over its headers and seeing if that header is in the passContig list for the given sample.

However the length of passContig is not equal to the resulting number of reads in the fasta being written. This should not be the case as any contig present in the passContig is in the co-assembly.

> length(passContig)
121964

whilst the number of reads in the fasta is: 121955 (generated from grep -c ">" sample.fasta)

Therefore I next tried to identify what contigs were being missed:

reader = FASTA.Reader(GzipDecompressorStream(open(assembly))) #if gzipped
headers= Set([FASTA.identifier(rec) for rec in reader])
close(reader)

for r in passContig
    if !(r in headers)
        println(r)
    elseif r in headers
        println("it is here")
    end
end

However this results in nothing but “it is here”. Would anyone know why this is happening?

kevbonham · January 4, 2022, 4:49pm

Not knowing how the CSV file was generated, my only guess is that it has some headers that aren’t in your fasta. Was there a trimming or filtering step that happened after the CSV was generated, but before you’re doing this filtering?

If it were me debugging, I’d do a setdiff() to find which headers are in the dataframe but not in the fasta file, then start digging around in the file.

Lamma · January 5, 2022, 8:41am

I did many variation of checking (including how you suggested) but in the end it was because I did close the writer!!

jonathanBieler · January 5, 2022, 10:27am

As a side note you could give BioRecordsProcessing a try, I wrote it exactly to reduce the boilerplate when doing this kind of file operations :

out_path = "D:/OneDrive - University of Copenhagen/PhD/Projects/Termite_metagenomes/assembly/matriline/bySample/"

BioRecordsProcessing.process(FASTX.FASTA, assembly, out_path) do record
    in(FASTA.identifier(record), passContig) && return record
    return nothing
end

Topic		Replies	Views
Conditional statement stops file writer New to Julia	2	325	February 23, 2022
Indexing a fasta file with FASTX.jl Biology, Health, and Medicine question , package	1	1085	December 2, 2021
Optimizing performance with FASTX I/O stream and Codon Counting Biology, Health, and Medicine question	5	812	April 26, 2023
Reading/parsing .ab1 sequencing files Biology, Health, and Medicine question	10	3410	May 29, 2019
Possible to subset XAM BAM reader object by contig? Biology, Health, and Medicine question	5	356	July 18, 2023

Missing records in my FASTA file when writing new FASTAs

Related topics