Issues reading big CSV file despite using CSV.Row

I’d like to import a .tsv file which CSV.read(file, DataFrame) doesn’t like.

The error I get is

ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575

I found an alternative that should work

csv_rows_file = CSV.Rows(file_path)  # load file path
df = DataFrame([[] for i in csv_rows_file.names], csv_rows_file.names)  # create a df with correct column names

for row in csv_rows_file  # iterate over every row
    push!(a, [row[i] for i in 1:ncol(a)])  # add new row by turning CSV.Row2 object values into a vector and pushing it into the df
end

I still get the same error!

ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575

Shouldn’t my approach circumvent the memory issue? Is there a workaround for this?
Also, the limit of 1048575 seems quite low…

Thanks in advance for your help!

This is not a memory issue, but an issue with strings in your file being too long, see:

https://github.com/JuliaData/CSV.jl/issues/1009

I see.

If I understand correctly: at a given row, one of the string values has a length of 204996615 which is too long to be imported?

Yes, currently only strings up to c. 100k characters are supported. You have a string which has roughly 205 million (!) characters.

If you didn’t expect this maybe the delimiter has been incorrectly identified by CSV.jl - I believe the first 10 rows are used to figure out the columns and the delimiter, so if your file has some other information in the first rows this might fail. Try setting the delim kwarg explicitly to see if that helps.

CC @quinnj

I didn’t scroll through each row but I’ve worked with this software’s (the one Im using to generate the tables Id like to look at) output tables quite a lot, 205million sound like 204.9999 million too much per cell…

Changing the delimiter didn’t work.

You could also try limiting the number of rows you’re reading in like CSV.read(file, DataFrame; limit = 10) to first of all make sure it’s working for parts of the file and then hone in on the row where it fails and inspect that further.

I tried that, the delimiter is correctly recognized by CSV.read. When I try to only import 1 column

csv_rows_file = CSV.read(file_path, DataFrame; select = [:Peptide])

which contains a maximum of 25 characters per cell I still get the error

ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575

Does that mean there’s a bug?

My suggestion was to read in only the first few rows, not just one column - what do you get from that?

My guess is the most likely problem is that you have a string with a missing closing quotation mark. So hunt for that.

That’s what I did. it worked just fine. Rows can be read up to row ~2.1 million, then the error pops up

I get the same error when I ready just 1 column with floating point values, I doubt it’s an issue of quotation marks.

I managed to import the table with R’s read.delim function.

CSV still has to know where the delimeters are. If there is a missing ending quotation, it wouldn’t be able to keep track of columns enough to just read the one you selected.

The software that generated the table has never shown an issue of generating a file where a string would have misplaced quotation marks.

R can import the file just fine. I counted the number of characters for each column and plotted them on histograms.

36 columns, so 36 histograms. none of them had any values beyond 100.
Unluess read.delim in R does some magic to randomly insert quotation marks we can exclude that possibility - good suggestion though!

I found a workaround through R, in case someone runs into the same problem:

using RCall
x = reval("read.delim('/folder/file_path')")
df = rcopy(x)

gives a dataframe as expected.

(substitute “/folder/file_path” with your file’s file location. dont forget the ’ '!)