Julia CodecZLib fails while python succeeds

Hi all,

I have been trying to read avro files to Julia using Avro.jl, and I have received an error. I have used Claude to find the error and create a MWE and he thinks that the problem is actually in CodecZLib. I have asked to extract MWE and it it indeed shows the difference.

The following is the python code

import zlib
compressed_file = "deflate_bug_demo.bin"
with open(compressed_file, "rb") as f:
    data = f.read()
result = zlib.decompress(data, -15)

And the corresponding julia code

using CodecZlib
compressed_file = "deflate_bug_demo.bin"
data = read(compressed_file)
result = transcode(DeflateDecompressor, data)

The data in deflate_bug_demo.bin are in below “option”. Does anyone (@quinnj) knows, what could go wrong?

I use Julia 12.6 with the latest versions CodecZlib v0.7.8.

I have put the data to this google drive https://drive.google.com/file/d/1BLExE_2pk_XLS3IhQe5LEM6kMzXczT-g/view?usp=sharing , since I cannot paste the file here.

Thanks a lot for help

The CodecZlib call looks fine and I can independently verify that the data is good:

julia> import Inflate

julia> length(Inflate.inflate(read("/home/gunnar/Downloads/deflate_bug_demo.bin")))
16572

I can also reproduce the problem, but I have no idea where or why things go wrong.

Please report the output of versioninfo(). I’ll look into it.

That -15 is a big clue. This is a raw stream without proper gzip or zlib headers. That file also has an extra three bytes at the end. CodecZlib correctly can decompress the first 6881 bytes. CodecZlib also supports concatenated streams so it then tries to decompress the remaining three bytes which is invalid.

  using CodecZlib

  data = read("deflate_bug_demo.bin")          # 6884 bytes total

  # The real raw-DEFLATE stream is only the first 6881 bytes; the last
  # 3 bytes (f3 1b 33) are trailing junk that CodecZlib mis-reads as a
  # second concatenated stream, throwing "invalid literal/length code".
  stream = data[1:6881]

  out = transcode(DeflateDecompressor, stream)  # succeeds → 16572 bytes
  write("deflate_bug_demo.out", out)

Claude Code assisted with the evaluation of the problem and generation fo the code example.

I figured out what the extra three bytes are. They are the first three bytes of an Adler-32 big-endian checksum. The byte 0x4a got truncated. Additionally, if we add the bytes 0x78 0x9c to the beginning of the stream, this is now a proper zlib stream.

  using CodecZlib
  data = read("deflate_bug_demo.bin")
  zlibstream = vcat(UInt8[0x78, 0x9c], data, UInt8[0x4a])   # header + file + missing Adler byte
  out = transcode(ZlibDecompressor, zlibstream)             # 16572 bytes

We can compute the Adler-32 checksum:

  using Zlib_jll

  # Pure-Julia Adler-32 trailer: 4 bytes, big-endian, over the *uncompressed* data.
  function adler32_trailer(data)
      a, b = 1, 0
      for byte in data
          a = (a + byte) % 65521
          b = (b + a)    % 65521
      end
      checksum = (b << 16) | a
      return [UInt8(checksum >> s & 0xff) for s in (24, 16, 8, 0)]  # big-endian
  end

  # zlib-backed Adler-32 trailer: 4 bytes, big-endian, over the *uncompressed* data.
  function adler32_trailer_zlib(data)
      checksum = ccall((:adler32, libz), Culong,
                       (Culong, Ptr{UInt8}, Cuint), 1, data, length(data))
      return [UInt8(checksum >> s & 0xff) for s in (24, 16, 8, 0)]  # big-endian
  end

  # Read the decompressed output and compute the trailer with both implementations.
  out = read("deflate_bug_demo.out")
  t_julia = adler32_trailer(out)
  t_zlib  = adler32_trailer_zlib(out)

  println("input: deflate_bug_demo.out  (", length(out), " bytes)")
  println("adler32_trailer       (pure Julia) = ", bytes2hex(t_julia))
  println("adler32_trailer_zlib  (Zlib_jll)   = ", bytes2hex(t_zlib))
  println("match: ", t_julia == t_zlib)

Here is the output:

  Output:

  input: deflate_bug_demo.out  (16572 bytes)
  adler32_trailer       (pure Julia) = f31b334a
  adler32_trailer_zlib  (Zlib_jll)   = f31b334a
  match: true

Is this a bug? Or is it a bug that other implementations did not report the extra three bytes?

Issue created: `DeflateDecompressor`/`ZlibDecompressor` error on trailing bytes after a complete stream (concatenated-stream policy) · Issue #107 · JuliaIO/CodecZlib.jl · GitHub

Thanks all attending the issue.
Here is the versioninfo

julia> versioninfo()
Julia Version 1.12.6
Commit 15346901f00 (2026-04-09 19:20 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 14 Ă— Apple M4 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, apple-m4)
  GC: Built with stock GC
Threads: 1 default, 1 interactive, 1 GC (on 10 virtual cores)

Though not sure it is still needed.

The array of bytes to decompress was extracted by Avro.jl. I (claude) have found this issue while Avro.jl fails to read a
file.

The byte range selection of Avro.jl is off by three bytes. Either it selected three extra bytes or it removed three extra bytes. Without knowing more of how you are using Avro.jl this is difficult to establish. Could you produce a MWE of your Avro.jl usage?

I put the example data here

but note that it is about 200mb. MWE is

using Avro
Avro.readtable("alice-apr2026.part-0000.opus-optimized.avro")

Thanks for help.

It’s a DEFLATE stream without headers. It’s neither less nor more proper than the ZLIB and GZIP wrapper formats.

The Avro specification says

deflate

The “deflate” codec writes the data block using the deflate algorithm as specified in RFC 1951, and typically implemented using the zlib library. Note that this format (unlike the “zlib format” in RFC 1950) does not have a checksum.

So it’s supposed to be a raw DEFLATE stream. Whatever wrote the three bytes of a partial Adler checksum probably made a mistake, but it would require more careful reading of the Avro specification to determine whether the extra bytes should be ignored or considered an error.

One hint from the specification is

A file data block consists of:

  • A long indicating the count of objects in this block.
  • A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
  • The serialized objects. If a codec is specified, this is compressed by that codec.
  • The file’s 16-byte sync marker.

It sounds like the reader should know quite well what to expect, so my guess, without looking at the MWE data, is that Avro.jl is not careful enough with what it passes to CodecZlib. The extra bytes could be junk filling out the file to the “16-byte sync marker”.

Alternatively the file does not follow the specification and other readers are ignoring the extra bytes by chance.

Edit: I looked at the MWE which, by the way, I think would still be a valid MWE if it is truncated to the first 11492 bytes.

As far as I can tell the file is not following the specification since it adds junk bytes to the compressed serialized objects. There is no way to explain them as arbitrary valid padding. It could be debated whether Avro.jl should be more lenient, but the error is definitely on the side of the writer.

I think one should check the behavior of the “reference” implementations GitHub - apache/avro: Apache Avro is a data serialization system. · GitHub

If the reference implementations all agree, then the spec is just badly written, and avro.jl OUGHT TO match their behavior.

If the reference implementations diverge, then the apache people need to get their shit together.

writers of avro files WOULD PROBABLY not include trailing junk; readers of avro files SHOULD CONSIDER checking for trailing junk.

(cf RFC 6919 - Further Key Words for Use in RFCs to Indicate Requirement Levels :wink: )