Using readlines with large text files

Colin_Beckingham · October 15, 2024, 5:20pm

I’m trying to read a genome string file into Julia using readlines() and getting some very strange results. The genome file consists of a gunzipped text file with a descriptive first line followed by a large number of strings of length 60 characters consisting of only ‘A’,‘G’,‘C’ and ‘T’. It should not contain a ‘Y’ and if I read the file into joe editor in Bash and perform a search there are no ‘Y’ characters.

So I read this file into Julia using:

function opengenome(fafile::String)
	pot = open(fafile)
	potgen = readlines(pot)
	return potgen[2:end]
end

The head of the result is like:

gen = opengenome(genome_athal)
1994471-element Vector{String}:
 "CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAAT"
 "CTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTT"
 "CTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTA"
 "TTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT"

and all looks good.

But there is a problem. If I now do a search for ‘Y’ like so:

for i in 1:length(gen);
 if count('Y',gen[i]) > 0; 
    println(gen[i]); 
 end; 
end

The head of the output (there are 86 occurrences in the stored vector) is like:

GGTGACAAAGTTCCCGGCCAGTGYGTTTGCGGGTAACGACTGTCTTTGTGGCTCTCCACT
NNNNNNNNNNNNNNMTTWTKKCSNYTCYASTTWTTKMRWYTSWAKGWTWWWMWAMWSAWY
AAKWMAMWWWRSAYTAMRWMAAYWYRAACCAMGMWWMYTCAWRMYTCTCWWYKYTWTGAT
KSTSAACSCKWWGWTCTTAAMMSYKWWKKKYTTWRMAKYKTWTRRYWWKGAAKCRYTWMW
YWASKMWKCGSKAYYTYRYTSWKSKKSWWSKYKTWKAKKYWMTMRKWYWMWWKWCAWYYA

It looks like Julia might be stumbling over itself at some point. Maybe I am missing something simple?

barucden · October 15, 2024, 5:23pm

Is it possible that joe does not search through the whole file but only through some loaded buffer? Have you tried grep Y yourfile?

Colin_Beckingham · October 15, 2024, 5:34pm

Thanks, grep Y shows the “bad” sections in the original file.

In the interests of completeness, I should add that in fact the original file appears to be correct. I downloaded the original file a few more times and produced the same answer each time. Y, along with many other characters, are valid in a genome string in addition to the ACGT. See the Wikipedia entry for FASTA file for further information.

barucden · October 15, 2024, 6:37pm

You might be interested in GitHub - BioJulia/FASTX.jl: Parse and process FASTA and FASTQ formatted files of biological sequences. and other packages from the BioJulia org

Topic		Replies	Views
Something like `grep -q String Filename` in Julia? General Usage	6	368	November 30, 2023
Trying to wrap my head around julia's file handling (still!) New to Julia	4	429	August 26, 2022
Tests success on local but fails in Travis CI General Usage	4	628	July 15, 2018
Julia simple String process and file I/O is slow Performance performance , parsing	12	2941	December 23, 2020
Reading a file from line x to line y General Usage csv	28	741	May 22, 2024

Using readlines with large text files

Related topics