Julia CodecZLib fails while python succeeds

Hi all,

I have been trying to read avro files to Julia using Avro.jl, and I have received an error. I have used Claude to find the error and create a MWE and he thinks that the problem is actually in CodecZLib. I have asked to extract MWE and it it indeed shows the difference.

The following is the python code

import zlib
compressed_file = "deflate_bug_demo.bin"
with open(compressed_file, "rb") as f:
    data = f.read()
result = zlib.decompress(data, -15)

And the corresponding julia code

using CodecZlib
compressed_file = "deflate_bug_demo.bin"
data = read(compressed_file)
result = transcode(DeflateDecompressor, data)

The data in deflate_bug_demo.bin are in below “option”. Does anyone (@quinnj) knows, what could go wrong?

I use Julia 12.6 with the latest versions CodecZlib v0.7.8.

I have put the data to this google drive https://drive.google.com/file/d/1BLExE_2pk_XLS3IhQe5LEM6kMzXczT-g/view?usp=sharing , since I cannot paste the file here.

Thanks a lot for help

The CodecZlib call looks fine and I can independently verify that the data is good:

julia> import Inflate

julia> length(Inflate.inflate(read("/home/gunnar/Downloads/deflate_bug_demo.bin")))
16572

I can also reproduce the problem, but I have no idea where or why things go wrong.

Please report the output of versioninfo(). I’ll look into it.

That -15 is a big clue. This is a raw stream without proper gzip or zlib headers. That file also has an extra three bytes at the end. CodecZlib correctly can decompress the first 6881 bytes. CodecZlib also supports concatenated streams so it then tries to decompress the remaining three bytes which is invalid.

  using CodecZlib

  data = read("deflate_bug_demo.bin")          # 6884 bytes total

  # The real raw-DEFLATE stream is only the first 6881 bytes; the last
  # 3 bytes (f3 1b 33) are trailing junk that CodecZlib mis-reads as a
  # second concatenated stream, throwing "invalid literal/length code".
  stream = data[1:6881]

  out = transcode(DeflateDecompressor, stream)  # succeeds → 16572 bytes
  write("deflate_bug_demo.out", out)

Claude Code assisted with the evaluation of the problem and generation fo the code example.

I figured out what the extra three bytes are. They are the first three bytes of an Adler-32 big-endian checksum. The byte 0x4a got truncated. Additionally, if we add the bytes 0x78 0x9c to the beginning of the stream, this is now a proper zlib stream.

  using CodecZlib
  data = read("deflate_bug_demo.bin")
  zlibstream = vcat(UInt8[0x78, 0x9c], data, UInt8[0x4a])   # header + file + missing Adler byte
  out = transcode(ZlibDecompressor, zlibstream)             # 16572 bytes

We can compute the Adler-32 checksum:

  using Zlib_jll

  # Pure-Julia Adler-32 trailer: 4 bytes, big-endian, over the *uncompressed* data.
  function adler32_trailer(data)
      a, b = 1, 0
      for byte in data
          a = (a + byte) % 65521
          b = (b + a)    % 65521
      end
      checksum = (b << 16) | a
      return [UInt8(checksum >> s & 0xff) for s in (24, 16, 8, 0)]  # big-endian
  end

  # zlib-backed Adler-32 trailer: 4 bytes, big-endian, over the *uncompressed* data.
  function adler32_trailer_zlib(data)
      checksum = ccall((:adler32, libz), Culong,
                       (Culong, Ptr{UInt8}, Cuint), 1, data, length(data))
      return [UInt8(checksum >> s & 0xff) for s in (24, 16, 8, 0)]  # big-endian
  end

  # Read the decompressed output and compute the trailer with both implementations.
  out = read("deflate_bug_demo.out")
  t_julia = adler32_trailer(out)
  t_zlib  = adler32_trailer_zlib(out)

  println("input: deflate_bug_demo.out  (", length(out), " bytes)")
  println("adler32_trailer       (pure Julia) = ", bytes2hex(t_julia))
  println("adler32_trailer_zlib  (Zlib_jll)   = ", bytes2hex(t_zlib))
  println("match: ", t_julia == t_zlib)

Here is the output:

  Output:

  input: deflate_bug_demo.out  (16572 bytes)
  adler32_trailer       (pure Julia) = f31b334a
  adler32_trailer_zlib  (Zlib_jll)   = f31b334a
  match: true

Is this a bug? Or is it a bug that other implementations did not report the extra three bytes?

Issue created: `DeflateDecompressor`/`ZlibDecompressor` error on trailing bytes after a complete stream (concatenated-stream policy) · Issue #107 · JuliaIO/CodecZlib.jl · GitHub