Countlines() bug?


#1

I am trying to figure out if this is a bug or by design. There is a weirdness in how countlines() reports the number of lines in a file. It seems to under report the lines by 1 whenever the file does not end with a newline character. I would expect countlines() to return the same number of lines as you would get items in a vector using readlines(). Thoughts?

x = """
       abcd
       efgh"""

"abcd\nefgh"


countlines(IOBuffer(x))
1

readlines(IOBuffer(x))
2-element Array{String,1}:
 "abcd"
 "efgh"

 x = """
       abcd
       efgh
       """

"abcd\nefgh\n"

countlines(IOBuffer(x))
2

readlines(IOBuffer(x))
2-element Array{String,1}:
 "abcd"
 "efgh"

x = """
       abcd
       efgh
        """
"abcd\nefgh\n "

countlines(IOBuffer(x))
2

readlines(IOBuffer(x))
3-element Array{String,1}:
 "abcd"
 "efgh"
 " "

#2

Countlines simply counts the number of newline characters.


#3

I agree it’s kind of unintuitive, though. FWIW, wc -l returns 1 even when there is no new line in the input.


#4

I agree that one probably wants countlines to match the length of the eachline iterator (or readlines). Currently it does not:

julia> collect(eachline(IOBuffer("abcd\nefgh")))
2-element Array{String,1}:
 "abcd"
 "efgh"

julia> countlines(IOBuffer("abcd\nefgh"))
1

#5

I’ve posted a PR to make this change: https://github.com/JuliaLang/julia/pull/25845


#6

Because that’s the POSIX definition of line

Something not terminated by a <newline> character is an incomplete line. There are several tools that behaves unexpectedly with incomplete lines, like cat or wc. Also git highlights incomplete lines at the end of files. I’m pretty sure that some compilers warn (or used to) about missing newlines


#7

Sure, but what matters the most is that Julia be consistent internally, and as @stevengj noted eachline returns incomplete lines, so it would make sense for countlines to give the number of elements eachlines returns.


#8

Yes, I can see the point of the proposed change, I was just giving some context to why Unix tools behave unexpectedly when the newline at the end of a file is missing


#9

I agree with the consistency part most of all. Developers will most likely use countlines() and {read,each}lines() in unison. If I was going to write code that read in a large but unknown length data (ex. 20 million lines) I would want to pre-allocate a vector to prevent gc thrashing. It is much faster to count the number lines in the file via reading newline characters, versus reading in the whole dataset. So I would write some code along the following lines (Not working, but codeish):

v = Vector{MyType}(countlines(file))
for l in eachlines(file)
   v[i] = parse_mytype(l)
end

The above could end up in an exception when eachlines and countlines don’t behave the same.