I am trying to figure out if this is a bug or by design. There is a weirdness in how countlines() reports the number of lines in a file. It seems to under report the lines by 1 whenever the file does not end with a newline character. I would expect countlines() to return the same number of lines as you would get items in a vector using readlines(). Thoughts?
x = """
abcd
efgh"""
"abcd\nefgh"
countlines(IOBuffer(x))
1
readlines(IOBuffer(x))
2-element Array{String,1}:
"abcd"
"efgh"
x = """
abcd
efgh
"""
"abcd\nefgh\n"
countlines(IOBuffer(x))
2
readlines(IOBuffer(x))
2-element Array{String,1}:
"abcd"
"efgh"
x = """
abcd
efgh
"""
"abcd\nefgh\n "
countlines(IOBuffer(x))
2
readlines(IOBuffer(x))
3-element Array{String,1}:
"abcd"
"efgh"
" "
Countlines simply counts the number of newline characters.
I agree it’s kind of unintuitive, though. FWIW, wc -l
returns 1
even when there is no new line in the input.
I agree that one probably wants countlines
to match the length of the eachline
iterator (or readlines
). Currently it does not:
julia> collect(eachline(IOBuffer("abcd\nefgh")))
2-element Array{String,1}:
"abcd"
"efgh"
julia> countlines(IOBuffer("abcd\nefgh"))
1
5 Likes
Because that’s the POSIX definition of line
Something not terminated by a <newline>
character is an incomplete line. There are several tools that behaves unexpectedly with incomplete lines, like cat
or wc
. Also git
highlights incomplete lines at the end of files. I’m pretty sure that some compilers warn (or used to) about missing newlines
Sure, but what matters the most is that Julia be consistent internally, and as @stevengj noted eachline
returns incomplete lines, so it would make sense for countlines
to give the number of elements eachlines
returns.
Yes, I can see the point of the proposed change, I was just giving some context to why Unix tools behave unexpectedly when the newline at the end of a file is missing
2 Likes
I agree with the consistency part most of all. Developers will most likely use countlines() and {read,each}lines() in unison. If I was going to write code that read in a large but unknown length data (ex. 20 million lines) I would want to pre-allocate a vector to prevent gc thrashing. It is much faster to count the number lines in the file via reading newline characters, versus reading in the whole dataset. So I would write some code along the following lines (Not working, but codeish):
v = Vector{MyType}(countlines(file))
for l in eachlines(file)
v[i] = parse_mytype(l)
end
The above could end up in an exception when eachlines and countlines don’t behave the same.
1 Like