I have a fast file that I loaded with FASTX and it seems that the data is stored as a LongDNA{4}. There are no ambiguities in the reads and I’d like to work with them as 2 bit sequences but I get an error when I try to convert these using BioSequences.
Is there someway to store them as 2bit using the FASTQ.Reader?
Hi may be you can try:
using BioSequences, FASTX # import required Bioinformatics packages
# import .fastq file
seqfile = FASTX.FASTQ.Reader(open("test.fastq","r"))
# get DNA sequences from file
for record in seqfile
dna_seq = FASTX.FASTQ.sequence(record)
end
#create a function to convert LongSequence{DNAAlphabet{4}} to 2 bit using a dictionary
function convert_to_2bit(dna_seq::LongSequence{DNAAlphabet{4}})
# Define a dictionary to map DNA bases to 2-bit codes
bit_dict = Dict('A' => 0b00, 'C' => 0b01, 'G' => 0b10, 'T' => 0b11)
# Convert the DNA sequence to a string and split into individual bases
dna_str = string(dna_seq)
bases = collect(dna_str)
# Convert each base to a 2-bit code and concatenate into a bitstring
bits = BitVector(undef, 2*length(bases))
for (i, base) in enumerate(bases)
code = bit_dict[base]
bits[2*(i-1) + 1] = code >> 1
bits[2*(i-1) + 2] = code & 0b01
end
return bits
end
#Calling the function to convert dna_seq and store it in two_bit_seq
two_bit_seq = convert_to_2bit(dna_seq)
two_bit_seq