Reading large csv file

I checked some answers and all point to CSV.jl, but it is slow(i guess this is “time to first plot” issue), it uses a lot of memory, it produces some strange type of array as output… Do I have another choice?

some times CSV.jl kills julia (out of memory issue)

Don’t worry about the types of output. They will work just like any other julia array.

Why is the time to first plot issue a problem? This might mean you are calling a julia script from the command line over and over again, which is not recommended.

How big is the CSV file? Is it bigger than your RAM?

the final result fits in memory

hmmm… I’m not sure, then. Maybe you need to specify the types via a keyword argument more strictly? What version of CSV.jl are you on?

I am already struggling with pooled array and categorical array :upside_down_face: now there is something like chain array ?! :grimacing:

Chain array is a way to read in the data faster. I really wouldn’t worry about it, these are optimizations that should be hidden from the user to maximize performance.

Pooled array is the same way. It’s only there for performance (saving memory).

There seem to be a few keyword arguments that can reduce the memory footprint in the documentation. Maybe those will help.

any example files?

is there a package to read and write stata file (stata 16). maybe it is better than csv?

Maybe this? It uses the same C library as R’s haven does iirc.

1 Like

Do you have a CSV file which is read in successfully and fast in R, Python or similar but that CSV.jl reads in very slowly or even not at all? If so I’m sure that would be considered a bug in CSV.jl so would be great to reproduce.

Time to first plot is unlikely to be an issue here if the file is very large, ie the parsing actually takes nonnegligible time.

1 Like

Additionally CSV.jl 1.0 release will be soon announced, where @quinnj wants to resolve “time to first plot” as much as possible. As @nilshg commented - if you run into issues please open a reproducible issue in CSV.jl and it will be handled.

Finally, in order to use “standard” types use pool=false and threaded=false, as this will turn-off most of the optimizations that cause you issues (note though that pool=false will disable pooling of string columns which will increase the memory footprint of the object you read in).

2 Likes

threaded = false helped (for me it is much faster than not setting it), so i should set it false when I am reading large files?