Reading large csv file

xinchin · August 29, 2021, 1:25am

I checked some answers and all point to CSV.jl, but it is slow(i guess this is “time to first plot” issue), it uses a lot of memory, it produces some strange type of array as output… Do I have another choice?

xinchin · August 29, 2021, 1:27am

some times CSV.jl kills julia (out of memory issue)

pdeffebach · August 29, 2021, 1:27am

Don’t worry about the types of output. They will work just like any other julia array.

Why is the time to first plot issue a problem? This might mean you are calling a julia script from the command line over and over again, which is not recommended.

How big is the CSV file? Is it bigger than your RAM?

xinchin · August 29, 2021, 1:29am

the final result fits in memory

pdeffebach · August 29, 2021, 1:30am

hmmm… I’m not sure, then. Maybe you need to specify the types via a keyword argument more strictly? What version of CSV.jl are you on?

xinchin · August 29, 2021, 1:31am

I am already struggling with pooled array and categorical array now there is something like chain array ?!

pdeffebach · August 29, 2021, 1:35am

Chain array is a way to read in the data faster. I really wouldn’t worry about it, these are optimizations that should be hidden from the user to maximize performance.

Pooled array is the same way. It’s only there for performance (saving memory).

There seem to be a few keyword arguments that can reduce the memory footprint in the documentation. Maybe those will help.

jling · August 29, 2021, 1:37am

any example files?

xinchin · August 29, 2021, 1:46am

is there a package to read and write stata file (stata 16). maybe it is better than csv?

pdeffebach · August 29, 2021, 1:47am

Maybe this? It uses the same C library as R’s haven does iirc.

nilshg · August 29, 2021, 7:08am

Do you have a CSV file which is read in successfully and fast in R, Python or similar but that CSV.jl reads in very slowly or even not at all? If so I’m sure that would be considered a bug in CSV.jl so would be great to reproduce.

Time to first plot is unlikely to be an issue here if the file is very large, ie the parsing actually takes nonnegligible time.

bkamins · August 29, 2021, 8:53am

Additionally CSV.jl 1.0 release will be soon announced, where @quinnj wants to resolve “time to first plot” as much as possible. As @nilshg commented - if you run into issues please open a reproducible issue in CSV.jl and it will be handled.

Finally, in order to use “standard” types use pool=false and threaded=false, as this will turn-off most of the optimizations that cause you issues (note though that pool=false will disable pooling of string columns which will increase the memory footprint of the object you read in).

xinchin · August 30, 2021, 12:10am

threaded = false helped (for me it is much faster than not setting it), so i should set it false when I am reading large files?

Topic		Replies	Views
CSV read in is too slow than other language General Usage performance	13	1358	June 21, 2023
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018
Failing to import (relatively) large CSV file with Julia and VSC Data performance , csv , arrow	24	752	September 22, 2024
CSV.read extremely slow wrt readtable Data	14	3638	July 27, 2018
Performance Report: Effect of Reading CSV file on Mergeing two DataFrames Performance question , dataframes , csv	18	488	November 17, 2023

Reading large csv file

Related topics