How to read a compressed CSV file?

Juan · January 16, 2019, 7:25pm

Hello.

I’m working with several large csv files. (Around 2GB each, million rows, thousands of columns).
In order to reduce the space on disk (and speed to sync online) they are compressed.

I was using R to work with them.
data.table’s fread let’s me read directly the compressed file with:

myDT <- fread("7z e -y -bso0 -so mycompress.7z", stringsAsFactors=F, na.strings=c("", "NA")) # and sometimes selecting columns or rows.

That executes transparently 7-zip and forwards the result to fread.

How can I do something similar with Julia?

I’m also considering using feather or hdf5 but I feel safer using csv for now, it’s easier for other people to access the files.

davidanthoff · January 16, 2019, 7:52pm

If they are gz compressed, you can read them directly with CSVFiles.jl, see here. But that won’t work for 7z compression, I’m afraid…

Juan · January 16, 2019, 11:26pm

I first tried with “zip” but it didn’t work well together with fread.
Would you suggest any other compressed file format to share data between R and Julia?
For example “feather” doesn’t offer internal compression.
Maybe some fast database? (I’m interested on Windows but something multiplatform would be nice).

stevengj · January 16, 2019, 11:45pm

Can’t you do

myDT = open(`7z e -y -bso0 -so mycompress.7z`, "r") do io
    load(Stream(format"CSV", io)) |> DataFrame
end

to load it from a pipe just like in R?

davidanthoff · January 17, 2019, 12:20am

Yes, that probably would also work, I just haven’t tried it

Juan · January 17, 2019, 12:31am

using CSVFiles
using DataFrames
myDT = open(`7z e -y -bso0 -so mycompress.7z`, "r") do io
    load(Stream(format"CSV", io)) |> DataFrame
end

ERROR: UndefVarError: Stream not defined
Stacktrace:
[1] (::getfield(Main, Symbol(“##9#10”)))(::Base.Process) at .\REPL[15]:2
[2] open(::getfield(Main, Symbol(“##9#10”)), ::Cmd, ::String) at .\process.jl:617
[3] top-level scope at none:0

What else do I need to do?

davidanthoff · January 17, 2019, 12:39am

Ah, you also need using FileIO (and first add the FileIO.jl package)!

I should just reexport Stream from CSVFiles…

Juan · January 17, 2019, 12:45am

OK, thanks, it seems to work.

Is it supposed to be read with possible missings?
How can I know how the number of missings on each column?

davidanthoff · January 17, 2019, 1:16am

I’m not entirely sure what you mean by that… Are there rows with missing values? CSVFiles.jl should handle those just fine. If not, please open an issue.

Juan · January 17, 2019, 1:18am

Yes, some rows on the csv have missings, it’s supposed to be like that because that value wasn’t measured.

 aa , bb  
1   , 11
2   ,
3   , 23

davidanthoff · January 17, 2019, 1:23am

Ok, and I assume it loaded it properly? The way this should work is that the column bb in your DataFrame should now have a missing value in the second row.

I guess I’m just not sure whether there is a problem, or whether you are just reporting success

Juan · January 17, 2019, 1:51am

it’s OK, thanks.

Topic		Replies	Views
Reading files embedded in a Zip-file General Usage zip	10	3934	September 2, 2024
Reading Data Is Still Too Slow Data	35	8817	August 2, 2019
What's the difference between CSV.jl and CSVFiles.jl? New to Julia	25	8111	January 29, 2020
Easiest way to load a DataFrame from a compressed, newline delimited json file on the cloud? Data dataframes	2	2602	October 20, 2020
Processing multiple large zipped csv files Data csv , zip	2	794	April 4, 2022

How to read a compressed CSV file?

Related topics