How to read a compressed CSV file?

Hello.

I’m working with several large csv files. (Around 2GB each, million rows, thousands of columns).
In order to reduce the space on disk (and speed to sync online) they are compressed.

I was using R to work with them.
data.table’s fread let’s me read directly the compressed file with:

myDT <- fread("7z e -y -bso0 -so mycompress.7z", stringsAsFactors=F, na.strings=c("", "NA")) # and sometimes selecting columns or rows.

That executes transparently 7-zip and forwards the result to fread.

How can I do something similar with Julia?

I’m also considering using feather or hdf5 but I feel safer using csv for now, it’s easier for other people to access the files.

If they are gz compressed, you can read them directly with CSVFiles.jl, see here. But that won’t work for 7z compression, I’m afraid…

2 Likes

I first tried with “zip” but it didn’t work well together with fread.
Would you suggest any other compressed file format to share data between R and Julia?
For example “feather” doesn’t offer internal compression.
Maybe some fast database? (I’m interested on Windows but something multiplatform would be nice).

Can’t you do

myDT = open(`7z e -y -bso0 -so mycompress.7z`, "r") do io
    load(Stream(format"CSV", io)) |> DataFrame
end

to load it from a pipe just like in R?

3 Likes

Yes, that probably would also work, I just haven’t tried it :slight_smile:

using CSVFiles
using DataFrames
myDT = open(`7z e -y -bso0 -so mycompress.7z`, "r") do io
    load(Stream(format"CSV", io)) |> DataFrame
end

ERROR: UndefVarError: Stream not defined
Stacktrace:
[1] (::getfield(Main, Symbol(“##9#10”)))(::Base.Process) at .\REPL[15]:2
[2] open(::getfield(Main, Symbol(“##9#10”)), ::Cmd, ::String) at .\process.jl:617
[3] top-level scope at none:0

What else do I need to do?

Ah, you also need using FileIO (and first add the FileIO.jl package)!

I should just reexport Stream from CSVFiles

OK, thanks, it seems to work.

Is it supposed to be read with possible missings?
How can I know how the number of missings on each column?

I’m not entirely sure what you mean by that… Are there rows with missing values? CSVFiles.jl should handle those just fine. If not, please open an issue.

Yes, some rows on the csv have missings, it’s supposed to be like that because that value wasn’t measured.

 aa , bb  
1   , 11
2   ,
3   , 23

Ok, and I assume it loaded it properly? The way this should work is that the column bb in your DataFrame should now have a missing value in the second row.

I guess I’m just not sure whether there is a problem, or whether you are just reporting success :slight_smile:

it’s OK, thanks.

1 Like