Inefficient compression of very large GeoTIFFs with ArchGDAL compared to gdal_* binaries

I’m using ArchGDAL to save climate data as large 3D GeoTIFFs, but I’ve noticed that files created in ArchGDAL are roughly 3x larger than those created with gdal_* binaries. As a diagnostic I eliminated all my calculations and wrote an ArchGDAL function that just resaves its input file using a given compression method. Then I wrapped this in a test function that also recompresses the ArchGDAL output using gdal_translate and compares resulting file sizes.

Spoiler: small input GeoTIFFs have the same output size, but very large input files (4 GB) become much larger after passing through ArchGDAL.

First my test code:

using ArchGDAL, GDAL_jll

function copy_gtiff(infile, outfile; compressmethod="ZSTD")
    ArchGDAL.readraster(infile) do origdataset
        nbands = ArchGDAL.nraster(origdataset)
        band1 = ArchGDAL.getband(origdataset, 1)
        nodata = ArchGDAL.getnodatavalue(band1)
        isfile(outfile) && rm(outfile)
        origdata = origdataset[:,:,:]
        ArchGDAL.create(outfile,
            driver = ArchGDAL.getdriver("GTiff"),
            width = ArchGDAL.width(origdataset),
            height = ArchGDAL.height(origdataset),
            nbands = nbands,
            dtype = ArchGDAL.pixeltype(band1),
            options = ["BIGTIFF=YES", "COMPRESS=$compressmethod"]
        ) do dataset
            geotransform = ArchGDAL.getgeotransform(origdataset)
            proj = ArchGDAL.getproj(origdataset)
            ArchGDAL.setgeotransform!(dataset, geotransform)
            ArchGDAL.setproj!(dataset, proj)
            for b = 1:nbands
                band = ArchGDAL.getband(dataset, b)
                ArchGDAL.setnodatavalue!(band, nodata)
                ArchGDAL.write!(band, origdata[:,:,b])
            end
        end
        return origdata
    end
end

function recompress_gtiff(infile, outfile; compressmethod="ZSTD")
    gdal_translate_path() do gdal_translate
        run(`$gdal_translate -q -co BIGTIFF=YES -co COMPRESS=$compressmethod $infile $outfile`)
    end
end

filesizeMB(filename) = round(filesize(filename)/1024^2, digits=1)

function readalldata(filename)
    ArchGDAL.readraster(filename) do dataset
        return dataset[:,:,:]
    end
end

function testcopy(compressmethod="ZSTD")
    for nbands in [3, 30, 300, 1000, 2920]
        infile = "test_ZSTD_$(nbands)bands.tif"
        outfile1, outfile2 = tempname(), tempname()
        println("\nTesting with $nbands bands, compression method = $compressmethod:")
        print("  copying $infile with ArchGDAL...")
        time = round(@elapsed origdata = copy_gtiff(infile, outfile1; compressmethod); digits=1)
        println(" ($time s). Original file $(filesizeMB(infile)) MB, copy $(filesizeMB(outfile1)) MB.")
        print("  recompressing with gdal_translate...")
        time = round(@elapsed recompress_gtiff(outfile1, outfile2; compressmethod); digits=1)
        println(" ($time s). Recompressed file $(filesizeMB(outfile2)) MB.")
        print("  reading copied file...")
        time = round(@elapsed newdata = readalldata(outfile1); digits=1)
        println(" ($time s). Identical to original file: $(all(origdata .== newdata)).")
        print("  reading recompressed file...")
        time = round(@elapsed newdata = readalldata(outfile2); digits=1)
        println(" ($time s). Identical to original file: $(all(origdata .== newdata)).")
        rm(outfile1)
        rm(outfile2)
    end
end

Running the tests using ZSTD compression:

julia> testcopy("ZSTD")

Testing with 3 bands, compression method = ZSTD:
  copying test_ZSTD_3bands.tif with ArchGDAL... (1.3 s). Original file 3.9 MB, copy 3.9 MB.
  recompressing with gdal_translate... (1.2 s). Recompressed file 3.9 MB.
  reading copied file... (0.0 s). Identical to original file: true.
  reading recompressed file... (0.0 s). Identical to original file: true.

Testing with 30 bands, compression method = ZSTD:
  copying test_ZSTD_30bands.tif with ArchGDAL... (2.1 s). Original file 40.2 MB, copy 40.2 MB.
  recompressing with gdal_translate... (2.1 s). Recompressed file 40.2 MB.
  reading copied file... (0.2 s). Identical to original file: true.
  reading recompressed file... (0.3 s). Identical to original file: true.

Testing with 300 bands, compression method = ZSTD:
  copying test_ZSTD_300bands.tif with ArchGDAL... (8.6 s). Original file 405.8 MB, copy 405.8 MB.
  recompressing with gdal_translate... (8.4 s). Recompressed file 405.8 MB.
  reading copied file... (1.6 s). Identical to original file: true.
  reading recompressed file... (1.6 s). Identical to original file: true.

Testing with 1000 bands, compression method = ZSTD:
  copying test_ZSTD_1000bands.tif with ArchGDAL... (23.1 s). Original file 1355.7 MB, copy 1355.7 MB.
  recompressing with gdal_translate... (22.3 s). Recompressed file 1355.7 MB.
  reading copied file... (5.6 s). Identical to original file: true.
  reading recompressed file... (5.8 s). Identical to original file: true.

Testing with 2920 bands, compression method = ZSTD:
  copying test_ZSTD_2920bands.tif with ArchGDAL... (171.5 s). Original file 3968.5 MB, copy 10906.3 MB.
  recompressing with gdal_translate... (61.3 s). Recompressed file 3968.5 MB.
  reading copied file... (15.1 s). Identical to original file: true.
  reading recompressed file... (15.6 s). Identical to original file: true.

My 5 test files are available in a Box folder here. I get similar results for other compression methods. Some valid values are DEFLATE, LZW and NONE. My test results for these are in a log file in that Box folder. My installed versions: ArchGDAL v0.7.4 and GDAL_jll v300.202.100+0.

I suspect the problem is some kind of interaction with the BIGTIFF setting, which is required for creating GeoTIFFs larger than 4 GB. I’ll open an issue in the ArchGDAL repo for this, but first I wanted to make sure it’s not just me. For example, I could be missing some simple setting. Also I wonder if there is some kind of workaround I can use until the devs have time to look into this. Otherwise I’ll probably store my files in other format for now, e.g. HDF5.

Hmm odd, I see the same thing. So only for the 2920 band raster, with compression on, is there a difference in the file size, but not the content, of the file created by gdal_translate and the file written from ArchGDAL. It may not be an ArchGDAL issue but a GDAL issue rather. I printed the gdalinfo of both files and they are identical as well, so as far as I can see gdal_translate does not add other creation options. Though I don’t know if gdalinfo shows them all.

I’ve never come across GeoTIFFs with this many bands before though, so perhaps it’s not very common? I would think that a truly n-dimensional format like HDF5, netCDF or zarr would be a better match. Though if you prefer GeoTIFF, perhaps it’s good to make them tiled as well, for instance by using the Cloud Optimized GeoTIFF (COG) driver: COG – Cloud Optimized GeoTIFF generator — GDAL documentation.

Thanks @visr for taking the time to confirm! I’ll open an issue in the GDAL.jl repo then and switch to some other format for my application. I’m not married to GeoTiff, that’s just the format that SMHI (the Swedish Meteorological Institute) used when they provided the source data for me.

By the way, in these experiments I was surprised to see how fast and efficient the new ZSTD compression method was. On my data it performed the best across all three metrics: output file size, read and write speed. Do you know off the top of your head which alternative formats might support this or other modern compression schemes?

Yeah indeed. These benchmarks for tabular data also look pretty good: Feather V2 with Compression Support in Apache Arrow 0.17.0 · Ursa Labs.

There is zarr, which supports zstd compression as well, see https://meggart.github.io/Zarr.jl/latest/tutorial/#Compressors.

I saw the latest GDAL release, 3.4, which is almost out, also adds a Zarr driver.