I’d like to import a .tsv file which CSV.read(file, DataFrame) doesn’t like.
The error I get is
ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575
I found an alternative that should work
csv_rows_file = CSV.Rows(file_path) # load file path
df = DataFrame([[] for i in csv_rows_file.names], csv_rows_file.names) # create a df with correct column names
for row in csv_rows_file # iterate over every row
push!(a, [row[i] for i in 1:ncol(a)]) # add new row by turning CSV.Row2 object values into a vector and pushing it into the df
end
I still get the same error!
ArgumentError: length argument to Parsers.PosLen (204996615) is too large; max length allowed is 1048575
Shouldn’t my approach circumvent the memory issue? Is there a workaround for this?
Also, the limit of 1048575 seems quite low…
Yes, currently only strings up to c. 100k characters are supported. You have a string which has roughly 205 million (!) characters.
If you didn’t expect this maybe the delimiter has been incorrectly identified by CSV.jl - I believe the first 10 rows are used to figure out the columns and the delimiter, so if your file has some other information in the first rows this might fail. Try setting the delim kwarg explicitly to see if that helps.
I didn’t scroll through each row but I’ve worked with this software’s (the one Im using to generate the tables Id like to look at) output tables quite a lot, 205million sound like 204.9999 million too much per cell…
You could also try limiting the number of rows you’re reading in like CSV.read(file, DataFrame; limit = 10) to first of all make sure it’s working for parts of the file and then hone in on the row where it fails and inspect that further.
CSV still has to know where the delimeters are. If there is a missing ending quotation, it wouldn’t be able to keep track of columns enough to just read the one you selected.
The software that generated the table has never shown an issue of generating a file where a string would have misplaced quotation marks.
R can import the file just fine. I counted the number of characters for each column and plotted them on histograms.
36 columns, so 36 histograms. none of them had any values beyond 100.
Unluess read.delim in R does some magic to randomly insert quotation marks we can exclude that possibility - good suggestion though!