I’m trying to read a genome string file into Julia using readlines()
and getting some very strange results. The genome file consists of a gunzipped text file with a descriptive first line followed by a large number of strings of length 60 characters consisting of only ‘A’,‘G’,‘C’ and ‘T’. It should not contain a ‘Y’ and if I read the file into joe editor in Bash and perform a search there are no ‘Y’ characters.
So I read this file into Julia using:
function opengenome(fafile::String)
pot = open(fafile)
potgen = readlines(pot)
return potgen[2:end]
end
The head of the result is like:
gen = opengenome(genome_athal)
1994471-element Vector{String}:
"CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT"
"CTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTT"
"CTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTA"
"TTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT"
and all looks good.
But there is a problem. If I now do a search for ‘Y’ like so:
for i in 1:length(gen);
if count('Y',gen[i]) > 0;
println(gen[i]);
end;
end
The head of the output (there are 86 occurrences in the stored vector) is like:
GGTGACAAAGTTCCCGGCCAGTGYGTTTGCGGGTAACGACTGTCTTTGTGGCTCTCCACT
NNNNNNNNNNNNNNMTTWTKKCSNYTCYASTTWTTKMRWYTSWAKGWTWWWMWAMWSAWY
AAKWMAMWWWRSAYTAMRWMAAYWYRAACCAMGMWWMYTCAWRMYTCTCWWYKYTWTGAT
KSTSAACSCKWWGWTCTTAAMMSYKWWKKKYTTWRMAKYKTWTRRYWWKGAAKCRYTWMW
YWASKMWKCGSKAYYTYRYTSWKSKKSWWSKYKTWKAKKYWMTMRKWYWMWWKWCAWYYA
It looks like Julia might be stumbling over itself at some point. Maybe I am missing something simple?