Not able to read a csv file

sai_matcha · February 18, 2022, 9:51am

earllier i have used CSV.read function very frequnetly. but now it is taking very long time to read CSV file, and finally not giving any result even after 10 minutes .

This is the code im using and , which was uccesfully read in previous instances.

df1 = CSV.read("path//data.csv",DataFrame,missingstring = "",header = 2)

can someone help, why it is happening? Thank you.

Henrique_Becker · February 18, 2022, 1:43pm

Which is the size of the file? Can you share it? What is your version of CSV.jl?

sai_matcha · February 21, 2022, 3:04am

these are my all packages list

(@v1.7) pkg> st
      Status `~/.julia/environments/v1.7/Project.toml`
  [cbdf2221] AlgebraOfGraphics v0.6.5
  [c52e3926] Atom v0.12.36
  [336ed68f] CSV v0.10.2
  [5d742f6a] CSVFiles v1.0.1
  [13f3f980] CairoMakie v0.7.3
  [8be319e6] Chain v0.4.10
  [a93c6f00] DataFrames v1.3.2
  [1313f7d8] DataFramesMeta v0.10.0
  [28b8d3ca] GR v0.64.0
  [c91e804a] Gadfly v1.3.4
  [7073ff75] IJulia v1.23.2
  [e5e0dc1b] Juno v0.8.4
  [bd3c0b08] MissingsAsFalse v0.1.0
  [3beb2ed1] PDFmerger v0.2.0
  [69de0a69] Parsers v2.2.2
  [91a5bcdd] Plots v1.25.10
  [d330b81b] PyPlot v2.10.0
  [1277b4bf] ShiftedArrays v1.0.0
  [f3b207a7] StatsPlots v0.14.33
  [ade2ca70] Dates

data contains around 12000 rows and 18 columns.

gustaphe · February 21, 2022, 5:58am

Try using skipto to read only a couple of lines.

sai_matcha · February 21, 2022, 6:54am

this is the code i have used

df1 = CSV.read("path//data.csv",DataFrame,missingstring = "",header = 2,skipto = 4)

its still loading for more than 30 minutes.

at the same time load option in CSVFiles is very quick for even big data.
my only doubt is, i used to read the same dataframe using CSV.read , but suddenly it started to take very long time and not giving any result at the end.

I want to use CSV.read option as well!

gustaphe · February 21, 2022, 7:03am

You’ll need to skip more than 4 lines to make a difference.

nilshg · February 21, 2022, 7:28am

Why not just limit = 2 if you only want to read a couple of lines?

sai_matcha · February 21, 2022, 9:48am

limit to 11000 also same result

nilshg · February 21, 2022, 10:32am

Just to be clear skipto=n skips the first n rows, while limit=n will only read the first n rows. So if you only want to read 2 rows, you do limit=2 or skipto=x-2 where x is the total number of rows in your file.

That said 11,000 rows isn’t very long and shouldn’t take more than a second or two, depending on how many columns you have. You still haven’t told us the size of the file.

sai_matcha · February 21, 2022, 11:06am

filesize - 903 kb

nilshg · February 21, 2022, 11:15am

That size should take a fraction of a second to read. Can you share the file? Is this a problem for all files you are reading in or just a specific file?

sai_matcha · February 24, 2022, 5:04am

HI, i tried with different CSV files. its happening with all files

tk3369 · February 24, 2022, 7:27am

Is the file located in a network drive? Even so, the timing that you reported was still unreasonably long….

nilshg · February 24, 2022, 8:30am

It’s very hard to check what’s going on if you can’t share the csv file. What if you do:

julia> using CSV, DataFrames

julia> CSV.write("test.csv", DataFrame(rand(1_000_000, 10), :auto));

julia> filesize("test.csv")/1e6 # This is about a 200MB csv
192.697339

julia> @time CSV.read("test.csv", DataFrame);
 10.003148 seconds (42.79 M allocations: 1.836 GiB, 3.27% gc time, 85.23% compilation time)

julia> @time CSV.read("test.csv", DataFrame);
  1.297614 seconds (40.00 M allocations: 1.723 GiB, 15.92% gc time)

First call is to get a sense of the compilation overhead, second call is the “typical” time after compilation. So reading in a 200MB csv takes about a second on my machine. This is with a single thread, when adding threads I get (in a new session, second call to CSV.read):

julia> Threads.nthreads()
2

julia> @time CSV.read("test.csv", DataFrame);
  0.351343 seconds (3.62 k allocations: 81.311 MiB)

mgp · March 1, 2022, 11:40am

I had an equivalent issue. After trawling discussions, I found a suggestion of setting Parsers to version 2.2.0, which fixed it for me:

add Parsers @2.2.0
pin Parsers

sai_matcha · March 2, 2022, 5:42am

Thank you tried it… but no result

jar1 · March 2, 2022, 5:48am

CSV.read sometimes hangs during compilation, so I switched to DelimitedFiles.

ceysa75 · March 2, 2022, 7:26am

Isn’t that skipto=n means which row to start with? So the term skip is a bit irritating here, or am I wrong?

gustaphe · March 2, 2022, 7:32am

I don’t know, seems pretty self-explanatory to me. “Skip to the nth line” isn’t any less clear than “start at the nth line”, and can’t be confused with Startat, my local tattoo parlor.

nilshg · March 2, 2022, 7:57am

I think the only confusion was my choice of words - indeed skipto skips to the n-th row, rather than skipping the first n rows (which would imply skipping to row n+1).

Topic		Replies	Views
CSV read in is too slow than other language General Usage performance	13	1358	June 21, 2023
CSV.read extremely slow wrt readtable Data	14	3638	July 27, 2018
Rough start with julia (with CSV package) New to Julia	18	3850	February 16, 2017
Reading a few rows from a BIG CSV file General Usage dataframes , csv , big-data	39	4562	January 18, 2024
CSV.read very slow when number of threads changed General Usage multithreading , csv	2	297	September 18, 2023

Not able to read a csv file

Related topics