I’m trying to read a genome string file into Julia using readlines() and getting some very strange results. The genome file consists of a gunzipped text file with a descriptive first line followed by a large number of strings of length 60 characters consisting of only ‘A’,‘G’,‘C’ and ‘T’. It should not contain a ‘Y’ and if I read the file into joe editor in Bash and perform a search there are no ‘Y’ characters.
So I read this file into Julia using:
function opengenome(fafile::String)
	pot = open(fafile)
	potgen = readlines(pot)
	return potgen[2:end]
end
The head of the result is like:
gen = opengenome(genome_athal)
1994471-element Vector{String}:
 "CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT"
 "CTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTT"
 "CTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTA"
 "TTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT"
and all looks good.
But there is a problem. If I now do a search for ‘Y’ like so:
for i in 1:length(gen);
 if count('Y',gen[i]) > 0; 
    println(gen[i]); 
 end; 
end
The head of the output (there are 86 occurrences in the stored vector) is like:
GGTGACAAAGTTCCCGGCCAGTGYGTTTGCGGGTAACGACTGTCTTTGTGGCTCTCCACT
NNNNNNNNNNNNNNMTTWTKKCSNYTCYASTTWTTKMRWYTSWAKGWTWWWMWAMWSAWY
AAKWMAMWWWRSAYTAMRWMAAYWYRAACCAMGMWWMYTCAWRMYTCTCWWYKYTWTGAT
KSTSAACSCKWWGWTCTTAAMMSYKWWKKKYTTWRMAKYKTWTRRYWWKGAAKCRYTWMW
YWASKMWKCGSKAYYTYRYTSWKSKKSWWSKYKTWKAKKYWMTMRKWYWMWWKWCAWYYA
It looks like Julia might be stumbling over itself at some point. Maybe I am missing something simple?