I had some adventures in reading a CSV file created on a windows computer in Julia, and it as left me with some confusion on what counts as a line. In particular, it sees like CSV.jl and readlines
have different opinions, which is a pain.
Lets demonstrate with a MWE.
julia> # We create test-file
julia> begin
file_content = "a, b\r\r\r\nc, d\r\r\n1, 2\r\r\n3, 4"
file1 = joinpath(homedir(), "file1.txt")
write(file1, file_content);
end
26
julia> # How many lines is that according to `readlines`?
julia> readlines(file1) |> length
4
julia> # So if we read line 4, we get "3, 4"
julia> readlines(file1)[4]
"3, 4"
julia> # We will now use CSV. We start by creating another test-file,
julia> # to show that the argument "header" does as expected
julia> using CSV
julia> begin
file_content2 = "a, b\nc, d\n1, 2\n3, 4"
file2 = joinpath(homedir(), "file2.txt")
write(file2, file_content2)
end
19
julia> # kwarg "header" sets that line as the header, and the following lines as the content:
julia> CSV.File(file2, header=2)
2-element CSV.File:
CSV.Row: (c = 1, var" d" = 2)
CSV.Row: (c = 3, var" d" = 4)
julia> # We indeed see that "c, d" on line two is the header, and lines 3 and 4 end up as the content
julia> # And if we set the header to line 4, we get no content:
julia> CSV.File(file2, header=4)
0-element CSV.File
julia> # So if we tell CSV that the header of the original file is line 4, we expect no lines after:
julia> CSV.File(file1, header=4)
2-element CSV.File:
CSV.Row: (c = 1, var" d" = 2)
CSV.Row: (c = 3, var" d" = 4)
julia> # Why did we get a header, PLUS two elements, when starting from line 4?
julia> # Lets investigate this with n_lines_CSVFile
julia> # The number of content lines + header row + the number of lines skipped
julia> n_lines_CSVFile(header, file) = length(CSV.File(file, header=header)) + 1 + (header-1)
n_lines_CSVFile (generic function with 1 method)
julia> # So n_lines_CSVFile should now return the number of lines in the file, right?
julia> # We test it on file2:
julia> [n_lines_CSVFile(header, file2) for header in 1:4]
4-element Vector{Int64}:
4
4
4
4
julia> # and we get 4 lines in any case. But what about the original file?
julia> [n_lines_CSVFile(header, file1) for header in 1:4]
4-element Vector{Int64}:
4
4
5
6
The result is seen to vary based on where we start in the file! It turns out that CSV uses the line numners that are consistent with opening the file in an editor:
Almost at least. None of the methods report 8 lines. But at least if I specify header
as line X, I get the content seen in a text editor on line X as my header. But only for the header argument, and not the number of content lines. This is why the number of lines counted varies.
So on one side, there is CSV.read
and how the content is displayed in an editor. On the other side is readlines
. But readlines
is not alone. It gets support from readdlm
from DelimitedFiles.jl:
julia> using DelimitedFiles
julia> readdlm(file1, ',')
4Ă—2 Matrix{Any}:
"a" " b\r"
"c" " d"
1 2
3 4
And even CSV.Rows
:
julia> let n = 0
for row in CSV.Rows(file1)
# @show row
n += 1
end
@show n
end
n = 3
( 3 content rows + 1 header line = 4)
I have to say I am partial to how a normal text editor displays it, as it is usually how a user would manually see “okay this content is on line X”. This btw also includes opening the CSV in a spreadsheet editor like libreoffice calc, and I assume Microsoft excel. But that interpretation of the number of lines seems to be in minority with the tools I tested here.
In any case, this discrepancy is bad. And I was actually unable to solve my use-case, which was reading a CSV file from a line of a given input. This is because I tried to look for some specific content, but I was unable to find a way to read line by line until I found what I was looking for, and pass that line as header
, because both CSV.row
and readlines
count differently than the header
keyword. I was also unable to make CSV.read
not count the empty lines.
So should CSV change it’s behaviour? Should it add an argument to parse lines like readlines
? Or should readlines
get an option to parse lines as they are displayed in text-editors? I feel like something has to change, as there currently is simply no way to unite readlines
and CSV.read
. At least not that I have found in this adventure, which is much more than can be expected by the average user trying to read a CSV file.