Reading gz'ed CSV does not work - length of provided header doesn't match the number of columns

I have a trivial program working on a simple gz’ed CSV (available here):

using CSV
df = CSV.File("Perf_Health_MSSQL\$PTSQL_Buffer Manager_Page life expectancy_21-8h.csv.gz",
     header=[:domain, :host, :feature, :oid, :largeversion, :clientid, 
        :from, :to, :aggrlevel, :firstocc, :lastocc, :livesuntil, 
        :ct, :sum, :min, :max, :g_lower, :g_upper, :g_ct, :g_sum],
    delim='|')

This fails with

ArgumentError: The length of provided header (20) doesn’t match the number of columns at row 1 (5).

Manually unpacking the file and reading it works perfectly. Can anyone tell me what the problem is, and maybe how to solve it?
Especially, I dont understand why there would be 5 rows in the first row …

// Edit: Suspecting that the delim does not work with a gz’ed file (but why??), I “analyzed” the first line with

using StatsBase
filter((k,v)->v==4, countmap(collect("PHARMATECHNIK|STA-WS174|Perf/Health:MSSQL\$PTSQL:Buffer Manager\\Page life expectancy:21-8h|15294|2019.11|STA-WS174|2019-07-29T21:00:00|2019-07-29T21:10:00|2|2019-07-29T21:00:06|2019-07-29T21:09:06|2019-08-03T21:10:00|10|7549.00000|484.00000|1024.00000||||")))

There are a few letters that occur 4 times (thus splitting the line into 5 columns) - but none makes much sense as a default delimiter:

  'P' => 4
  '.' => 4
  'f' => 4
  'A' => 4

Hi,

I rather think that the problem is that you should send an unpacked stream to CSV.jl. You can use https://github.com/bicycle1885/CodecZlib.jl to do it. Something like:

using CSV, CodecZlib
io = open(yourfilename)
data = CSV.read(GzipDecompressorStream(io))
close(io)

should work

Yes, thanks, works - and it seems that the .gz reading version actually ignores the delim parameter - because your code (without a delim) also claims that there are 5 columns in the file.

// Edit: … or, ahem, it simply does not support gz reading. I might be confused because CSVFiles claims to support it … But then, the error message is really unhelpful …

my code uses CSV.jl delimiter autodetection feature (but this is unrelated to decompression). If you still have problems and can share the file I can check.

A link for downloading the data file is in my first posting - but does CSV actually read gz natively (without the Gz…Stream)?? I might have supposed this wrongly …

IIUC, CSVFiles.jl does but not CSV.jl.

1 Like

I have checked and my code works - you have to specify | as delim kwarg (the reason is that audtodetection puts preference for space and space produces a valid output).

Thanks - now I understand all of what happens! Muchas gracias to all …