Sending excel file as response is slow

I have an API which get csv file in the request and all I need to do is format and send the response as excel file. I am reading the data into dataframe and directly writing into excel.

The issue is that writing data into excel itself taking around 26 seconds and in total it’s taking around 3 mins(sometimes taking even more than that) which includes writing into excel file and reading the content from buffer for around 370k (70 - 80 MB) records.

I have added the code which I have currently written.

df = CSV.read(data, DataFrame)
@info "Rows **", nrow(df)
filename = "data.xlsx"
filecontent = @time create_excel_file(df)
HTTP.Response(200, [
                "Content-Type" => "application/vnd.ms-excel",
                "Content-Disposition" => "attachment; filename=\"$filename\""
            ], body=filecontent)

function create_excel_file(df)
    # Create an in-memory buffer for the Excel file
    io = IOBuffer()
    XLSX.openxlsx(io, mode="w") do xf
        sheet = xf[1]
        @time XLSX.writetable!(sheet, collect(DataFrames.eachcol(df)), DataFrames.names(df))
    end
    return take!(io)
end

Can you provide an example CSV file where your function is slow?

I believe it will not allow me to upload csv files here, but below is my df how it looks like. I have also added time macro and attached below the time it’s taken for each step to complete. Data written from df to excel in 25 seconds but the overall time taken to complete the function is 337 seconds which is >5 mins.

I need help with the better way to read the data from buffer as I am currently using take!(io) which is very slow. I am open to suggestion where we can directly read csv file from http request and pass it as is in the response but with .xlsx extension? The only format I need is date which needs to be in yyyy-mm-dd

10×21 DataFrame
 Row │ REPORT_DT  COMPANY_ID  NUM    DPRT     DPRT_TML  ARRV     ARRV_TML  MARKET   ENTITY   HUB      SUBFT    FLT    DEPS   BHRS     MILES  CAP    MILLIONS  SOURCE    YYYYMM  REPORT_YYYYMM  categ 
     │ String15   String3     Int64  String3  Int64     String3  Int64     String7  String3  String3  String3  Int64  Int64  Float64  Int64  Int64  Float64   String15  Int64   Int64          Int64 
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 7/1/2024   XY              1  AAA          2255  BBB           615  AAABBB   DEF      AAA      98D        123      1  16.3333   8442    257  2.16959   7/1/2024  202407         202407    322
   2 │ 7/1/2024   XY              2  BBB           845  AAA           915  AAABBB   DEF      AAA      98D        123      1  15.5      8442    257  2.16959   7/1/2024  202407         202407    322
   3 │ 7/1/2024   XY              3  CCC          1550  ABC           745  CCCABC   TEF      CCC      22A        345      1   8.9167   4432    167  0.740144  7/1/2024  202407         202407    455
   4 │ 7/1/2024   XY              4  XXX          1340  YYY          1755  YYYXXX   TEF      YYY      12A        564      1  10.25     4825    276  1.3317    7/1/2024  202407         202407    111
   5 │ 7/1/2024   XY              5  YYY          2025  XXX          1135  YYYXXX   TEF      YYY      12A        564      1   9.1667   4825    276  1.3317    7/1/2024  202407         202407    111
   6 │ 7/1/2024   XY              6  DEF          1645  YYY          1450  YYYDEF   DEF      YYY      98D        123      1  12.0833   6632    257  1.70442   7/1/2024  202407         202407    322
   7 │ 7/1/2024   XY              7  YYY          1045  DEF          1400  YYYDEF   DEF      YYY      98D        123      1  13.25     6632    257  1.70442   7/1/2024  202407         202407    322
   8 │ 7/1/2024   XY             12  ABC           950  CCC          1250  CCCABC   TEF      CCC      22A        345      1  10.0      4432    167  0.740144  7/1/2024  202407         202407    455
   9 │ 7/1/2024   XY             14  SDF          1855  XXX           715  SDFXXX   TEF      SDF      22A        345      1   7.3333   3457    167  0.577319  7/1/2024  202407         202407    455
  10 │ 7/1/2024   XY             17  XXX          1200  SDF          1510  SDFXXX   TEF      SDF      22A        345      1   8.1667   3457    167  0.577319  7/1/2024  202407         202407    455

Thanks for the example data, I did some profiling with StatProfilerHTML.jl and for me, about 80% of the time is spent calling ZipFile.jl/src/ZipFile.jl at a599b0aac5d17403fdbbc4ea63612be299cf8417 · fhs/ZipFile.jl · GitHub
So this seems to be the same issue you were having in File zipping taking longer for large files

Here is what I profiled:

using CSV
using DataFrames
using XLSX
using StatProfilerHTML

function gen_fake_csv(path::String, nlines::Int)
    header = "REPORT_DT,COMPANY_ID,NUM,DPRT,DPRT_TML,ARRV,ARRV_TML,MARKET,ENTITY,HUB,SUBFT,FLT,DEPS,BHRS,MILES,CAP,MILLIONS,SOURCE,YYYYMM,REPORT_YYYYMM,categ"
    line = "7/1/2024,XY,1,ABC,1234,ABC,1234,ABCDFE,XYZ,ABC,12A,789,1,16.3333,9876,456,2.16959,7/1/2024,202407,202407,787"
    open(path; write=true) do f
        println(f, header)
        join(f, Iterators.repeated(line, nlines), "\n")
    end
end

function create_excel_file(df)
    # Create an in-memory buffer for the Excel file
    io = IOBuffer()
    XLSX.openxlsx(io, mode="w") do xf
        sheet = xf[1]
        XLSX.writetable!(sheet, collect(DataFrames.eachcol(df)), DataFrames.names(df))
    end
    return take!(io)
end

test_file = "test.csv"
gen_fake_csv(test_file, 100000)
df = CSV.read(test_file, DataFrame)
@time create_excel_file(df) # compile the method
@profilehtml create_excel_file(df) # profile

It should be relatively easy to use ZipArchives.jl instead of ZipFile.jl in XLSX.jl/src/write.jl at e80dba0f9842dafc46ae35ac26aefd252ce9ad91 · felipenoris/XLSX.jl · GitHub to speed this up. If you figure out how to do that and it works, a PR to XLSX.jl would help others with this issue.

Thank you for your analysis, if you want me to make that change yeah I am more than happy to take a stab at it. Let me look.

I have question here, I think the issue is not with writetable function as it’s taking 24 seconds to write it(little high as you mentioned about it’s using ZipFile instead of ZipArchives). The real issue is with take!(io) which itself consuming around 3 mins.

I have updated the code to write data into excel file and use read(filename) instead to take the data and still seeing “read” being used more than 2 mins.

take!(io) should be almost instant. I think the slowest function is the call to XLSX.openxlsx. This is the function that writes the data to the io.