Using readlines with large text files

I’m trying to read a genome string file into Julia using readlines() and getting some very strange results. The genome file consists of a gunzipped text file with a descriptive first line followed by a large number of strings of length 60 characters consisting of only ‘A’,‘G’,‘C’ and ‘T’. It should not contain a ‘Y’ and if I read the file into joe editor in Bash and perform a search there are no ‘Y’ characters.

So I read this file into Julia using:

function opengenome(fafile::String)
	pot = open(fafile)
	potgen = readlines(pot)
	return potgen[2:end]
end

The head of the result is like:

gen = opengenome(genome_athal)
1994471-element Vector{String}:
 "CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT"
 "CTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTT"
 "CTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTA"
 "TTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT"

and all looks good.

But there is a problem. If I now do a search for ‘Y’ like so:

for i in 1:length(gen);
 if count('Y',gen[i]) > 0; 
    println(gen[i]); 
 end; 
end

The head of the output (there are 86 occurrences in the stored vector) is like:

GGTGACAAAGTTCCCGGCCAGTGYGTTTGCGGGTAACGACTGTCTTTGTGGCTCTCCACT
NNNNNNNNNNNNNNMTTWTKKCSNYTCYASTTWTTKMRWYTSWAKGWTWWWMWAMWSAWY
AAKWMAMWWWRSAYTAMRWMAAYWYRAACCAMGMWWMYTCAWRMYTCTCWWYKYTWTGAT
KSTSAACSCKWWGWTCTTAAMMSYKWWKKKYTTWRMAKYKTWTRRYWWKGAAKCRYTWMW
YWASKMWKCGSKAYYTYRYTSWKSKKSWWSKYKTWKAKKYWMTMRKWYWMWWKWCAWYYA

It looks like Julia might be stumbling over itself at some point. Maybe I am missing something simple?

Is it possible that joe does not search through the whole file but only through some loaded buffer? Have you tried grep Y yourfile?

Thanks, grep Y shows the “bad” sections in the original file.

In the interests of completeness, I should add that in fact the original file appears to be correct. I downloaded the original file a few more times and produced the same answer each time. Y, along with many other characters, are valid in a genome string in addition to the ACGT. See the Wikipedia entry for FASTA file for further information.

You might be interested in GitHub - BioJulia/FASTX.jl: Parse and process FASTA and FASTQ formatted files of biological sequences. and other packages from the BioJulia org

1 Like