Streaming gziped file to FASTQ.Reader - where to add method?

FASTQ is a file format used for DNA sequencing, and we have a handy reader in the FASTX package.

using FASTX
testfile = "some/path.fastq"

for record in FASTQ.Reader(open(testfile))
    seq = FASTQ.sequence(record)
    # ...
end

These files are often quite large (I’m currently working with several hundred files that are ~500Mb each), but usually come gzipped (and are ~10x smaller). Unfortunately, the FASTQ reader from FASTX doesn’t seem to be able to take the stream from GZip.jl:

testfilegz = "some/path.fastq.gz"

for record in FASTQ.Reader(GZip.open(testfilegz))
    seq = FASTQ.sequence(record)
    # ...
end

gives:

ERROR: LoadError: MethodError: no method matching isopen(::GZipStream)
Closest candidates are:
  isopen(::Mmap.Anonymous) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Mmap/src/Mmap.jl:41
  isopen(::Base.Filesystem.File) at filesystem.jl:94
  isopen(::Base.BufferStream) at stream.jl:1217
  ...
full stacktrace

Stacktrace:
 [1] TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,GZipStream}(::TranscodingStreams.Noop, ::GZipStream, ::TranscodingStreams.State, ::Bool) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/stream.jl:25
 [2] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream, ::TranscodingStreams.State; initialized::Bool) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/stream.jl:39
 [3] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream, ::TranscodingStreams.State) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/stream.jl:39
 [4] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream; bufsize::Int64, sharedbuf::Bool) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:41
 [5] TranscodingStreams.TranscodingStream(::TranscodingStreams.Noop, ::GZipStream) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:34
 [6] TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,S} where S<:IO(::GZipStream; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:28
 [7] TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,S} where S<:IO(::GZipStream) at /home/kevin/.julia/packages/TranscodingStreams/MsN8d/src/noop.jl:28
 [8] FASTX.FASTQ.Reader(::GZipStream; fill_ambiguous::Nothing) at /home/kevin/.julia/packages/FASTX/wcfDB/src/fastq/reader.jl:25
 [9] FASTX.FASTQ.Reader(::GZipStream) at /home/kevin/.julia/packages/FASTX/wcfDB/src/fastq/reader.jl:19
 [10] top-level scope at /augusta/students/danielle/resampling/subsample.jl:33

Based on the stacktrace, it seems to be an issue with TranscodingStreams.jl which FASTQ uses under the hood, but I’m not familiar enough with the I/O packages to know for sure. Where should this method be added and would it be straightforward to add? Both packages are pretty lightweight on their own, and I’m guessing neither would want to take the dependency on the other.

Or is there a way to do this in FASTX.jl without committing type piracy?

Have you tried CodecZlib? I use it all the time to read gzip files directly as if I were reading an uncompressed file.

A side benefit is that it is the fastest reader I’ve found for gzip compressed files.

Oops, may not work for you as it is also based on TranscodingStreams. Would still be worth trying it.

3 Likes

Nope! Didn’t know about it - thanks!

for record in FASTQ.Reader(GzipDecompressorStream(open(testfilegz)))
    #...
end

works :smiley:

2 Likes