Open XLSX from Zip Archive in Memory

I want to download this url, get the file named “ZIP5_DEC2022_FINAL.xlsx”, and read the table in as a DataFrame. But I can’t quite work out how to compose Zipfiles.jl with XLSX.jl.

Here is my code currently:

using DataFrames
using HTTP
using XLSX
using ZipFile

url = "https://www.cms.gov/files/zip/2022-end-year-zip-code-file.zip"
zipped_bytes = HTTP.get(url)
zip = ZipFile.Reader(IOBuffer(zipped_bytes.body))
file_index = findfirst((f) -> f.name == "ZIP5_DEC2022_FINAL.xlsx", zip.files)
file_contents = read(zip.files[file_index], String)
df = XLSX.openxlsx(IOBuffer(file_contents)) |> DataFrame
# ERROR: MethodError: no method matching openxlsx(::IOBuffer)

It’s possible I shouldn’t be reading the zipfile in as a String. But it’s extra strange to me that I can see in the XLSX.jl source code where the method is defined for openxlsx(::IO; kwargs), so why the error on arg type?

The code above works when I’m trying to open a CSV in a zip archive (just replacing the last line with CSV.read.

1 Like

Hi,

XLSX.openxlsx expects a filepath (as string) and cannot not handle IOBuffer. You have to convert you stream to a temp file to make it work with XLSX.readtable.

This code works :

using DataFrames
using HTTP
using XLSX
using ZipFile

url = "https://www.cms.gov/files/zip/2022-end-year-zip-code-file.zip"
zipped_bytes = HTTP.get(url)
zip = ZipFile.Reader(IOBuffer(zipped_bytes.body))
file_index = findfirst((f) -> f.name == "ZIP5_DEC2022_FINAL.xlsx", zip.files)
file_contents = read(zip.files[file_index], String)

# Open safe temp file
df = mktemp() do path, io
    # Store your content in the temp file
    write(io, file_contents)
    # read it (the first sheet, here unnamed)
    return DataFrame(XLSX.readtable(path, 1))
end

show(df)

3 Likes

Awesome! Thank you!

to extract in format useful to dataframe constructor I had to use this expression.

df = mktemp() do path, io
    # Store your content in the temp file
    write(io, file_contents)
    # read it (the first sheet, here unnamed)
    return XLSX.readtable(path, 1)
end
DataFrame(;Base.zip(df[2],df[1])...)

I wanted to know how to extract text files.

  1. Can any function of CSV be used?
    Which and which parameters to use?

  2. is it possible with readdlm? and how?

Looks like you might be on an old version of XLSX, readtable used to return column headers and a data matrix separately but is returning a Tables.table compatible object since 0.8

1 Like