Read a file separated by blank lines?

I have a file containing several multi-line blocks of text separate by blank lines. A blank line may contain just a line break or also some space. Is there something similar to readline() which allows me to read in these blocks one by one, each as a string?

I am not aware of a direct way to do this. But you could readline() anyway. To separate your blocks (i.e., to detect blank lines) you could use something like that:

f = open( "foo_data.txt" )
lines = readlines(f)
for l in lines
    if isempty( filter(x -> !isspace(x), l) )
        println("found blank line")
    else
        #do something with the read data
    end
end
close(f)

The else statement (and/or the loop) needs to be adapted depending on your desired final data format.

For example:

function readparagraph(io)
    buf = IOBuffer()
    while !eof(io)
        line = readline(io; keep=true)
        all(isspace, line) && break
        print(buf, line)
    end
    return String(take!(buf))
end
3 Likes

Thanks, I’ll use it!

A question about the last line of your function: is it possible to convert Vector{Char} to String without copying the data and allocating memory, similar to how take! constructs Vector{Char} without copying? I guess that’s not possible in Julia because the vector is mutable and the string is immutable?

String(take!(buf)) already constructs the string without making a copy — from a Vector{UInt} in the UTF-8 encoding, not a Vector{Char}.

(Vector{Char} requires 4 bytes per character, similar to UTF-32, which is different from the encoding used by String, and is not generally a recommended way to store strings.)

More generally, it is possible to construct a String(vec) from a vec::Vector{UInt8} without making a copy of vec, but only if the Vector{UInt8} is specially allocated — this special allocation is done by IOBuffer objects and also by read(io, numbytes) as documented in the String docstring, but can also be accomplished using the undocumented vec = Base.StringVector(numbytes) constructor. See also Document/export copy-free string allocation? · Issue #19945 · JuliaLang/julia · GitHub and Conversion of Vector{UInt8} to String without copy

You can also use StringViews.jl to create a String-like object (another subtype of AbstractString) from an AbstractVector{UInt8} (e.g. a subarray) without making a copy.

1 Like

(In principle, you could make it even more efficient than this by using lower-level APIs to read bytes directly in to the IOBuffer without allocating intermediate string objects via readline, but it’s probably not worth the effort. Another alternative would be to use mmap to access the file as an array of bytes, which you could then scan for ASCII newline and whitespace characters. You could then use the StringViews.jl package to create string-like views of the mmap-ed data without making a copy. There are lots of ways to wring more speed out of Julia if you care enough.)

1 Like