Readlines(filepath) and read(filepath,String) read different content

Hey Julianners,
This is fairly strange behavior as the readlines(fila_path) doesn’t read the last enter from the file. Why is that?

file_path="test/playground/test/src/profile_startup.jl"
lines = readlines(file_path, keep=false)
@show lines
show(join(lines, "\n"))
println()
show(read(file_path,String) )
println()
@assert join(lines, "\n") == read(file_path,String) 

If I remove the empty line from the file, then they read the same.
Is this intentional? It caused a pretty big headache to find how did the file changed without any reason, due to there were different version used in the two place.

using the:
lines = collect(eachline(open(file_path)))
Also causes ambiguous behavior.

Ok, this is actually pretty strange. So I just moved forward with this solution:
lines=split(read(file_path,String), "\n")

But maybe worth note that the readlines actually doesn’t read the last empty line from the file in certain case. And if there is no empty line then it read the last line… So it is pretty ambiguous.

There is a keyword parameter keep in readlines, which is default false and if true it keeps the new line in the strings. This may be correlated.

At least, the function readlines is not documented accurately, as keep is just indirectly documented in the examples. So I would think this is a bug or at least bad documentation. You should open an issue for readlines.

I saw that flag, but that causes even more problem sadly.
So with keep=true it actually parse somehow like each \n is equal with double \n. But the last \n equal with 1 \n. So in spite of it is not impossible to create a parsing for this to reconstruct the \n numbers accurately it is still pretty bad behavior.
The readlines should return with everything that is in the file like the read does.

It’s a little philosophical: what is a “line”? Is a file that contains just a "\n" one line or two? Does a line end with a \n or does \n split lines? How many lines does "1\n2\n3\n" have and is it the same as "1\n2\n3"?

Yes, readlines with the default keep argument treats files with a final \n as exactly the same as those without it. If you need to detangle the two, use keep=true and it’ll return everything that is in the file like the read does — just concatenate the outputs (without adding extra \ns) and you’ll get back the same thing as read(_, String).

I think it’s good to play around a bit with these things to get an understanding of the behaviors:

julia> readlines(IOBuffer("1\n2\n3"))
3-element Vector{String}:
 "1"
 "2"
 "3"

julia> readlines(IOBuffer("1\n2\n3\n"))
3-element Vector{String}:
 "1"
 "2"
 "3"

julia> readlines(IOBuffer("1\n2\n3"), keep=true)
3-element Vector{String}:
 "1\n"
 "2\n"
 "3"

julia> readlines(IOBuffer("1\n2\n3\n"), keep=true)
3-element Vector{String}:
 "1\n"
 "2\n"
 "3\n"
4 Likes

Yet, the intended behavior should be documented clearly.

readlines(IOBuffer(“1\n2\n3\n”))

Thank you for the answers!

Well a little bit philosophical indeed. I would be of course voting for the version what we see in the file view and also to not lose data during the transformation. So if I see an empty line in the end of the file in the file view then I would say there is a “line” in the end. But I understand that here we interpret it like there is a “new line” with nothing in it… so we eventually drop it when it is false.
It would be nice if there would be a chance to vote for what is the standard. One objective point for the file view to get that extra emtpy "" in the end is because we will actually lose information by readlines and then joining back together. So it would be “ok” to be able to readlines and the join back together as many time as you want without losing information.
So this:

julia> readlines(IOBuffer("1\n2\n3\n"))
4-element Vector{String}:
 "1"
 "2"
 "3"
 ""  # this would be the way to go I think to not allow losing data

Anyways, It is just an opinion ofc.

I mean… without reading the docs I would expect it to preserve every information in every case. :smiley:

What file view? Different tools display/treat files without a final \n (or \r\n) differently. There’s not really a standard; one such standard says it’s just incomplete. GitHub doesn’t even show the line; instead it flags final lines that are missing their newline. wc — a utility whose only purpose is to count — doesn’t count incomplete lines. 1\n2 only has one line in its view. Some editors even silently add missing final newlines upon save.

If you don’t want to “lose data” (including the newline flavor), use keep=true.

But empty lines at the end of a file already mean something and you’d want to preserve them, too. ["1","2","3",""] is what you get from readlines(IOBuffer("1\n2\n3\n\n")).

The “inverse” of the default readlines(x) is not join with \n, but rather foreach(println, _). And the “inverse” of readlines(x; keep=true) is foreach(print, _). Yes, the default might add a final newline if your file doesn’t already have it. It might also change the line separator. It’s an opinionated take that matches how many tools work with files.

2 Likes

I think it’s also worth scrolling up a bit for the POSIX definition of a line, “A sequence of zero or more non-<newline> characters plus a terminating <newline> character”.

2 Likes

The “inverse” of the default readlines(x) is not join with \n , but rather foreach(println, _) .
For me:

julia> lines=readlines(IOBuffer("1\n2\n3\n"))
3-element Vector{String}:
 "1"
 "2"
 "3"

julia> foreach(println, lines)
1
2
3

julia> lines=readlines(IOBuffer("1\n2\n3"))
3-element Vector{String}:
 "1"
 "2"
 "3"

julia> foreach(println, lines)
1
2
3

julia> 

So it isn’t an inverse of it. as we actually lost data if we had enter or not in the last line during the process.

For me I am fine not using it, I understand how it works and I appreciate your ideas and also respect this behavior so I don’t want to change.

Of course I would be curious who want to lose information occasionally, (I guess 90% of the people not aware of this behavior and luckily they don’t get errors due to this). :smiley:

I think it’s more that final trailing newlines (or the lack thereof) aren’t used to convey meaningful information in the vast majority of use-cases, and it’s generally considered best practice to ensure that text files end with that final newline.

I’m sorry you got caught out here, it’s always frustrating when that happens.

I notice the comment there:

Now, on non POSIX compliant systems (nowadays that’s mostly Windows), the point is moot: files don’t generally end with a newline, and the (informal) definition of a line might for instance be “text that is separated by newlines” (note the emphasis). This is entirely valid. However, for structured data (e.g. programming code) it makes parsing minimally more complicated: it generally means that parsers have to be rewritten.

Julia could have done with the (informal) “Windows” way, read the last incomplete line, as a full line. I’m thinking would it be really bad? Less surprising to most users? Another option would be to throw an error for incomplete lines as a compromize.

I think readlines, and eachline would mostly be used for text, I doubt anyone is going to be implementing cat, or work with binary data. I think you even can’t since kept isn’t the default (not do you want it as such). [You of course want the possibility to work with binary data too, somehow, and if changing Julia then not in a breaking way, and would throwing an exception be considered a breaking change?]

Would most Windows editors also add a newline? I.e. avoid the issue, or how did OP actually get into this mess?

1 Like

That’s what we do. The default is just to strip trailing newline character(s), whether they exist or not. If you care about the existence of newlines, keep them.

1 Like

Yeah most of the people won’t care for each line, so most of the time people will luckily avoid running into issue due to this behavior, I know.
For me I checked if the file I read from previous run changed and in the check I used readlines and before that it was read which read the file as it is correctly. So it randomly throwed file changed singal… It was pretty hidden issue. XD