I’m working with several large csv files. (Around 2GB each, million rows, thousands of columns).
In order to reduce the space on disk (and speed to sync online) they are compressed.
I was using R to work with them.
data.table’s fread let’s me read directly the compressed file with:
myDT <- fread("7z e -y -bso0 -so mycompress.7z", stringsAsFactors=F, na.strings=c("", "NA")) # and sometimes selecting columns or rows.
That executes transparently 7-zip and forwards the result to fread.
How can I do something similar with Julia?
I’m also considering using feather or hdf5 but I feel safer using csv for now, it’s easier for other people to access the files.
I first tried with “zip” but it didn’t work well together with fread.
Would you suggest any other compressed file format to share data between R and Julia?
For example “feather” doesn’t offer internal compression.
Maybe some fast database? (I’m interested on Windows but something multiplatform would be nice).
using CSVFiles
using DataFrames
myDT = open(`7z e -y -bso0 -so mycompress.7z`, "r") do io
load(Stream(format"CSV", io)) |> DataFrame
end
ERROR: UndefVarError: Stream not defined
Stacktrace:
[1] (::getfield(Main, Symbol(“##9#10”)))(::Base.Process) at .\REPL[15]:2
[2] open(::getfield(Main, Symbol(“##9#10”)), ::Cmd, ::String) at .\process.jl:617
[3] top-level scope at none:0
I’m not entirely sure what you mean by that… Are there rows with missing values? CSVFiles.jl should handle those just fine. If not, please open an issue.
Ok, and I assume it loaded it properly? The way this should work is that the column bb in your DataFrame should now have a missing value in the second row.
I guess I’m just not sure whether there is a problem, or whether you are just reporting success