Inconsistencies in the number of lines in a CSV file

I had some adventures in reading a CSV file created on a windows computer in Julia, and it as left me with some confusion on what counts as a line. In particular, it sees like CSV.jl and readlines have different opinions, which is a pain.

Lets demonstrate with a MWE.

julia> # We create test-file

julia> begin
           file_content = "a, b\r\r\r\nc, d\r\r\n1, 2\r\r\n3, 4"
           file1 = joinpath(homedir(), "file1.txt")
           write(file1, file_content);
       end
26

julia> # How many lines is that according to `readlines`?

julia> readlines(file1) |> length
4

julia> # So if we read line 4, we get "3, 4"

julia> readlines(file1)[4]
"3, 4"

julia> # We will now use CSV. We start by creating another test-file, 

julia> # to show that the argument "header" does as expected

julia> using CSV

julia> begin
           file_content2 = "a, b\nc, d\n1, 2\n3, 4"
           file2 = joinpath(homedir(), "file2.txt")
           write(file2, file_content2)
       end
19

julia> # kwarg "header" sets that line as the header, and the following lines as the content:

julia> CSV.File(file2, header=2)
2-element CSV.File:
 CSV.Row: (c = 1, var" d" = 2)
 CSV.Row: (c = 3, var" d" = 4)

julia> # We indeed see that "c, d" on line two is the header, and lines 3 and 4 end up as the content

julia> # And if we set the header to line 4, we get no content:

julia> CSV.File(file2, header=4)
0-element CSV.File

julia> # So if we tell CSV that the header of the original file is line 4, we expect no lines after:

julia> CSV.File(file1, header=4)
2-element CSV.File:
 CSV.Row: (c = 1, var" d" = 2)
 CSV.Row: (c = 3, var" d" = 4)

julia> # Why did we get a header, PLUS two elements, when starting from line 4? 

julia> # Lets investigate this with n_lines_CSVFile

julia> #                                The number of content lines      + header row + the number of lines skipped

julia> n_lines_CSVFile(header, file) = length(CSV.File(file, header=header)) +      1     +        (header-1)
n_lines_CSVFile (generic function with 1 method)

julia> # So n_lines_CSVFile should now return the number of lines in the file, right?

julia> # We test it on file2:

julia> [n_lines_CSVFile(header, file2) for header in 1:4]
4-element Vector{Int64}:
 4
 4
 4
 4

julia> # and we get 4 lines in any case. But what about the original file?

julia> [n_lines_CSVFile(header, file1) for header in 1:4]
4-element Vector{Int64}:
 4
 4
 5
 6

The result is seen to vary based on where we start in the file! It turns out that CSV uses the line numners that are consistent with opening the file in an editor:
image

Almost at least. None of the methods report 8 lines. But at least if I specify header as line X, I get the content seen in a text editor on line X as my header. But only for the header argument, and not the number of content lines. This is why the number of lines counted varies.

So on one side, there is CSV.read and how the content is displayed in an editor. On the other side is readlines. But readlines is not alone. It gets support from readdlm from DelimitedFiles.jl:

julia> using DelimitedFiles

julia> readdlm(file1, ',')
4×2 Matrix{Any}:
  "a"   " b\r"
  "c"   " d"
 1     2
 3     4

And even CSV.Rows:

julia> let n = 0
           for row in CSV.Rows(file1)
               # @show row
               n += 1
           end
           @show n
       end
n = 3

( 3 content rows + 1 header line = 4)

I have to say I am partial to how a normal text editor displays it, as it is usually how a user would manually see “okay this content is on line X”. This btw also includes opening the CSV in a spreadsheet editor like libreoffice calc, and I assume Microsoft excel. But that interpretation of the number of lines seems to be in minority with the tools I tested here.

In any case, this discrepancy is bad. And I was actually unable to solve my use-case, which was reading a CSV file from a line of a given input. This is because I tried to look for some specific content, but I was unable to find a way to read line by line until I found what I was looking for, and pass that line as header, because both CSV.row and readlines count differently than the header keyword. I was also unable to make CSV.read not count the empty lines.

So should CSV change it’s behaviour? Should it add an argument to parse lines like readlines? Or should readlines get an option to parse lines as they are displayed in text-editors? I feel like something has to change, as there currently is simply no way to unite readlines and CSV.read. At least not that I have found in this adventure, which is much more than can be expected by the average user trying to read a CSV file.

\r = CR = carriage return is not a new line, but just move cursor to the beginning of the line.
\n = LF = linefeed which is move the paper a line above (or the cursor into the next line down).

(Exact details depend on OS and editors)

In your data these are mixed clearly to show your problem.

I don’t have a strong opinion, but my tendency is always: if data is not well defined, like in your example the line separator, the data needs to be pruned and not the function for reading needs be be automagically intelligent.

My typical data are excel files/exports from biologist. To read this universe of possible data problems my customized read function would be inconceivable complex.

So, my strategy with data reading is: be extremely strict and error out anything not matching the strict rules. Edit the data until everything is according the rules.

readlines has no option to define the line separator nor has CSV.read. That’s what I would prefer. To give explicit rules for what is a line and what is a column. From the docs of CSV.jl:

Newlines

For all parsing functionality, newlines are detected/parsed automatically, regardless if they’re present in the data as a single newline character ('\n'), single return character ('\r'), or full CRLF sequence ("\r\n").

Because this is the typical parsing rule, I use a set of dos2unix, unix2dos, sed, gawk, tr, … to bring my data into shape.

I think it’s somehow standard or just not well defined which is the best approach, so tools do as they wish.

3 Likes

Linking a related posted on counting the lines in a CSV file.

2 Likes

Editing the data to simplify parsing seems like a good general approach. I will try to remember that in the future.

I shy away from it because often multiple people use the sama data-files, so it could be messy to if everyone made different versions of the data. But I suppose making a dummy-file, or doing the editing in RAM and not on disk (i.e. on the variables defines, meaning that it would be redine each session), would be just fine.

On a more general note, I do not care if lines are interpreted in one way or another. It is the fact that different functions do it differently, without the option to change the behaviour, that is problematic.