I figured out what the extra three bytes are. They are the first three bytes of an Adler-32 big-endian checksum. The byte 0x4a got truncated. Additionally, if we add the bytes 0x78 0x9c to the beginning of the stream, this is now a proper zlib stream.
using CodecZlib
data = read("deflate_bug_demo.bin")
zlibstream = vcat(UInt8[0x78, 0x9c], data, UInt8[0x4a]) # header + file + missing Adler byte
out = transcode(ZlibDecompressor, zlibstream) # 16572 bytes
We can compute the Adler-32 checksum:
using Zlib_jll
# Pure-Julia Adler-32 trailer: 4 bytes, big-endian, over the *uncompressed* data.
function adler32_trailer(data)
a, b = 1, 0
for byte in data
a = (a + byte) % 65521
b = (b + a) % 65521
end
checksum = (b << 16) | a
return [UInt8(checksum >> s & 0xff) for s in (24, 16, 8, 0)] # big-endian
end
# zlib-backed Adler-32 trailer: 4 bytes, big-endian, over the *uncompressed* data.
function adler32_trailer_zlib(data)
checksum = ccall((:adler32, libz), Culong,
(Culong, Ptr{UInt8}, Cuint), 1, data, length(data))
return [UInt8(checksum >> s & 0xff) for s in (24, 16, 8, 0)] # big-endian
end
# Read the decompressed output and compute the trailer with both implementations.
out = read("deflate_bug_demo.out")
t_julia = adler32_trailer(out)
t_zlib = adler32_trailer_zlib(out)
println("input: deflate_bug_demo.out (", length(out), " bytes)")
println("adler32_trailer (pure Julia) = ", bytes2hex(t_julia))
println("adler32_trailer_zlib (Zlib_jll) = ", bytes2hex(t_zlib))
println("match: ", t_julia == t_zlib)
Here is the output:
Output:
input: deflate_bug_demo.out (16572 bytes)
adler32_trailer (pure Julia) = f31b334a
adler32_trailer_zlib (Zlib_jll) = f31b334a
match: true
Is this a bug? Or is it a bug that other implementations did not report the extra three bytes?