How to read a compressed CSV file?


#1

Hello.

I’m working with several large csv files. (Around 2GB each, million rows, thousands of columns).
In order to reduce the space on disk (and speed to sync online) they are compressed.

I was using R to work with them.
data.table’s fread let’s me read directly the compressed file with:

myDT <- fread("7z e -y -bso0 -so mycompress.7z", stringsAsFactors=F, na.strings=c("", "NA")) # and sometimes selecting columns or rows.

That executes transparently 7-zip and forwards the result to fread.

How can I do something similar with Julia?

I’m also considering using feather or hdf5 but I feel safer using csv for now, it’s easier for other people to access the files.


#2

If they are gz compressed, you can read them directly with CSVFiles.jl, see here. But that won’t work for 7z compression, I’m afraid…


#3

I first tried with “zip” but it didn’t work well together with fread.
Would you suggest any other compressed file format to share data between R and Julia?
For example “feather” doesn’t offer internal compression.
Maybe some fast database? (I’m interested on Windows but something multiplatform would be nice).


#4

Can’t you do

myDT = open(`7z e -y -bso0 -so mycompress.7z`, "r") do io
    load(Stream(format"CSV", io)) |> DataFrame
end

to load it from a pipe just like in R?


#5

Yes, that probably would also work, I just haven’t tried it :slight_smile:


#6
using CSVFiles
using DataFrames
myDT = open(`7z e -y -bso0 -so mycompress.7z`, "r") do io
    load(Stream(format"CSV", io)) |> DataFrame
end

ERROR: UndefVarError: Stream not defined
Stacktrace:
[1] (::getfield(Main, Symbol("##9#10")))(::Base.Process) at .\REPL[15]:2
[2] open(::getfield(Main, Symbol("##9#10")), ::Cmd, ::String) at .\process.jl:617
[3] top-level scope at none:0

What else do I need to do?


#7

Ah, you also need using FileIO (and first add the FileIO.jl package)!

I should just reexport Stream from CSVFiles


#8

OK, thanks, it seems to work.

Is it supposed to be read with possible missings?
How can I know how the number of missings on each column?


#9

I’m not entirely sure what you mean by that… Are there rows with missing values? CSVFiles.jl should handle those just fine. If not, please open an issue.


#10

Yes, some rows on the csv have missings, it’s supposed to be like that because that value wasn’t measured.

 aa , bb  
1   , 11
2   ,
3   , 23

#11

Ok, and I assume it loaded it properly? The way this should work is that the column bb in your DataFrame should now have a missing value in the second row.

I guess I’m just not sure whether there is a problem, or whether you are just reporting success :slight_smile:


#12

it’s OK, thanks.