Countlines() bug?

RandomString123 · February 1, 2018, 2:21pm

I am trying to figure out if this is a bug or by design. There is a weirdness in how countlines() reports the number of lines in a file. It seems to under report the lines by 1 whenever the file does not end with a newline character. I would expect countlines() to return the same number of lines as you would get items in a vector using readlines(). Thoughts?

x = """
       abcd
       efgh"""

"abcd\nefgh"


countlines(IOBuffer(x))
1

readlines(IOBuffer(x))
2-element Array{String,1}:
 "abcd"
 "efgh"

 x = """
       abcd
       efgh
       """

"abcd\nefgh\n"

countlines(IOBuffer(x))
2

readlines(IOBuffer(x))
2-element Array{String,1}:
 "abcd"
 "efgh"

x = """
       abcd
       efgh
        """
"abcd\nefgh\n "

countlines(IOBuffer(x))
2

readlines(IOBuffer(x))
3-element Array{String,1}:
 "abcd"
 "efgh"
 " "

kristoffer.carlsson · February 1, 2018, 2:28pm

Countlines simply counts the number of newline characters.

nalimilan · February 1, 2018, 5:17pm

I agree it’s kind of unintuitive, though. FWIW, wc -l returns 1 even when there is no new line in the input.

stevengj · February 1, 2018, 6:13pm

I agree that one probably wants countlines to match the length of the eachline iterator (or readlines). Currently it does not:

julia> collect(eachline(IOBuffer("abcd\nefgh")))
2-element Array{String,1}:
 "abcd"
 "efgh"

julia> countlines(IOBuffer("abcd\nefgh"))
1

stevengj · February 1, 2018, 6:33pm

I’ve posted a PR to make this change: https://github.com/JuliaLang/julia/pull/25845

giordano · February 1, 2018, 6:57pm

Because that’s the POSIX definition of line

Something not terminated by a <newline> character is an incomplete line. There are several tools that behaves unexpectedly with incomplete lines, like cat or wc. Also git highlights incomplete lines at the end of files. I’m pretty sure that some compilers warn (or used to) about missing newlines

nalimilan · February 2, 2018, 8:27am

Sure, but what matters the most is that Julia be consistent internally, and as @stevengj noted eachline returns incomplete lines, so it would make sense for countlines to give the number of elements eachlines returns.

giordano · February 2, 2018, 10:25am

Yes, I can see the point of the proposed change, I was just giving some context to why Unix tools behave unexpectedly when the newline at the end of a file is missing

RandomString123 · February 2, 2018, 1:17pm

I agree with the consistency part most of all. Developers will most likely use countlines() and {read,each}lines() in unison. If I was going to write code that read in a large but unknown length data (ex. 20 million lines) I would want to pre-allocate a vector to prevent gc thrashing. It is much faster to count the number lines in the file via reading newline characters, versus reading in the whole dataset. So I would write some code along the following lines (Not working, but codeish):

v = Vector{MyType}(countlines(file))
for l in eachlines(file)
   v[i] = parse_mytype(l)
end

The above could end up in an exception when eachlines and countlines don’t behave the same.

Topic		Replies	Views
Readlines(filepath) and read(filepath,String) read different content General Usage file	15	251	October 3, 2024
Inconsistencies in the number of lines in a CSV file General Usage csv	3	493	November 23, 2023
CSV.jl number of lines General Usage csv , io	13	1144	November 3, 2021
Readline() and end-of-file New to Julia io	6	5514	July 9, 2021
.csv number of rows Data csv	6	3306	September 13, 2022

Countlines() bug?

Related topics