Issues reading big CSV file despite using CSV.Row

candidaorelmex · July 18, 2022, 9:54am

I’d like to import a .tsv file which CSV.read(file, DataFrame) doesn’t like.

The error I get is

ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575

I found an alternative that should work

csv_rows_file = CSV.Rows(file_path)  # load file path
df = DataFrame([[] for i in csv_rows_file.names], csv_rows_file.names)  # create a df with correct column names

for row in csv_rows_file  # iterate over every row
    push!(a, [row[i] for i in 1:ncol(a)])  # add new row by turning CSV.Row2 object values into a vector and pushing it into the df
end

I still get the same error!

ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575

Shouldn’t my approach circumvent the memory issue? Is there a workaround for this?
Also, the limit of 1048575 seems quite low…

Thanks in advance for your help!

nilshg · July 18, 2022, 9:59am

This is not a memory issue, but an issue with strings in your file being too long, see:

https://github.com/JuliaData/CSV.jl/issues/1009

candidaorelmex · July 18, 2022, 10:08am

I see.

If I understand correctly: at a given row, one of the string values has a length of 204996615 which is too long to be imported?

nilshg · July 18, 2022, 10:11am

Yes, currently only strings up to c. 100k characters are supported. You have a string which has roughly 205 million (!) characters.

If you didn’t expect this maybe the delimiter has been incorrectly identified by CSV.jl - I believe the first 10 rows are used to figure out the columns and the delimiter, so if your file has some other information in the first rows this might fail. Try setting the delim kwarg explicitly to see if that helps.

bkamins · July 18, 2022, 10:17am

CC @quinnj

candidaorelmex · July 18, 2022, 10:24am

I didn’t scroll through each row but I’ve worked with this software’s (the one Im using to generate the tables Id like to look at) output tables quite a lot, 205million sound like 204.9999 million too much per cell…

Changing the delimiter didn’t work.

nilshg · July 18, 2022, 10:57am

You could also try limiting the number of rows you’re reading in like CSV.read(file, DataFrame; limit = 10) to first of all make sure it’s working for parts of the file and then hone in on the row where it fails and inspect that further.

candidaorelmex · July 18, 2022, 12:00pm

I tried that, the delimiter is correctly recognized by CSV.read. When I try to only import 1 column

csv_rows_file = CSV.read(file_path, DataFrame; select = [:Peptide])

which contains a maximum of 25 characters per cell I still get the error

ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575

Does that mean there’s a bug?

nilshg · July 18, 2022, 12:09pm

My suggestion was to read in only the first few rows, not just one column - what do you get from that?

pdeffebach · July 18, 2022, 12:23pm

My guess is the most likely problem is that you have a string with a missing closing quotation mark. So hunt for that.

candidaorelmex · July 18, 2022, 12:43pm

That’s what I did. it worked just fine. Rows can be read up to row ~2.1 million, then the error pops up

candidaorelmex · July 18, 2022, 12:45pm

I get the same error when I ready just 1 column with floating point values, I doubt it’s an issue of quotation marks.

I managed to import the table with R’s read.delim function.

pdeffebach · July 18, 2022, 12:56pm

CSV still has to know where the delimeters are. If there is a missing ending quotation, it wouldn’t be able to keep track of columns enough to just read the one you selected.

candidaorelmex · July 18, 2022, 1:13pm

The software that generated the table has never shown an issue of generating a file where a string would have misplaced quotation marks.

R can import the file just fine. I counted the number of characters for each column and plotted them on histograms.

36 columns, so 36 histograms. none of them had any values beyond 100.
Unluess read.delim in R does some magic to randomly insert quotation marks we can exclude that possibility - good suggestion though!

candidaorelmex · July 18, 2022, 1:48pm

I found a workaround through R, in case someone runs into the same problem:

using RCall
x = reval("read.delim('/folder/file_path')")
df = rcopy(x)

gives a dataframe as expected.

(substitute “/folder/file_path” with your file’s file location. dont forget the ’ '!)

Topic		Replies	Views
Errors in Parsers.PosLen loading dataframe from large CSV file (80M rows) New to Julia dataframes , csv , io	7	854	August 2, 2022
CSV.write("*.txt",DataFrame) ReadOnlyMemoryError() General Usage dataframes	14	1015	January 9, 2020
ReadOnlyMemoryError in CSV.read New to Julia csv	12	1336	July 2, 2019
Reading huge csv files Data	5	4072	January 19, 2019
CSV mmap error when parsing large file General Usage package	6	2403	May 25, 2019

Issues reading big CSV file despite using CSV.Row

Related topics