FASTQ is a file format used for DNA sequencing, and we have a handy reader in the FASTX package.
using FASTX
testfile = "some/path.fastq"
for record in FASTQ.Reader(open(testfile))
seq = FASTQ.sequence(record)
# ...
end
These files are often quite large (I’m currently working with several hundred files that are ~500Mb each), but usually come gzipped (and are ~10x smaller). Unfortunately, the FASTQ reader from FASTX doesn’t seem to be able to take the stream from GZip.jl:
testfilegz = "some/path.fastq.gz"
for record in FASTQ.Reader(GZip.open(testfilegz))
seq = FASTQ.sequence(record)
# ...
end
gives:
ERROR: LoadError: MethodError: no method matching isopen(::GZipStream)
Closest candidates are:
isopen(::Mmap.Anonymous) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Mmap/src/Mmap.jl:41
isopen(::Base.Filesystem.File) at filesystem.jl:94
isopen(::Base.BufferStream) at stream.jl:1217
...
full stacktrace
Stacktrace:
[1] TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,GZipStream}(::TranscodingStreams.Noop, ::GZipStream, ::TranscodingStreams.State, ::Bool) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/stream.jl:25
[2] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream, ::TranscodingStreams.State; initialized::Bool) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/stream.jl:39
[3] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream, ::TranscodingStreams.State) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/stream.jl:39
[4] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream; bufsize::Int64, sharedbuf::Bool) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:41
[5] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:34
[6] TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,S} where S<:IO(::GZipStream; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:28
[7] TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,S} where S<:IO(::GZipStream) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:28
[8] FASTX.FASTQ.Reader(::GZipStream; fill_ambiguous::Nothing) at /home/kevin/.julia/packages/FASTX/wcfDB/src/fastq/reader.jl:25
[9] FASTX.FASTQ.Reader(::GZipStream) at /home/kevin/.julia/packages/FASTX/wcfDB/src/fastq/reader.jl:19
[10] top-level scope at /augusta/students/danielle/resampling/subsample.jl:33
Based on the stacktrace, it seems to be an issue with TranscodingStreams.jl
which FASTQ uses under the hood, but I’m not familiar enough with the I/O packages to know for sure. Where should this method be added and would it be straightforward to add? Both packages are pretty lightweight on their own, and I’m guessing neither would want to take the dependency on the other.
Or is there a way to do this in FASTX.jl without committing type piracy?