How to read only the last line of a file (.txt)?

Hello,

I have a very large txt file and would like to read only the last line of it. What is the best way to do it?

For example if I have this file

some header option
header other option
random metadata
A,B,C,X,Y,Z
1,1,1,2.0,0.0,102.0
1,1,2,2.0,0.0,202.0
1,1,3,2.0,0.0,302.0
1,2,1,3.0,1.0,103.0
1,2,2,3.0,1.0,203.0
1,2,3,3.0,1.0,303.0
1,3,1,4.0,2.0,104.0
1,3,2,4.0,2.0,204.0
1,3,3,4.0,2.0,304.0
1,4,1,5.0,3.0,105.0
1,4,2,5.0,3.0,205.0
1,4,3,5.0,3.0,305.0
1,5,1,6.0,4.0,106.0
1,5,2,6.0,4.0,206.0
1,5,3,6.0,4.0,306.0
1,6,1,7.0,5.0,107.0
1,6,2,7.0,5.0,207.0
1,6,3,7.0,5.0,307.0

but it has 100Gb and I only want to rapidly read the last line (1,6,3,7.0,5.0,307.0).

4 Likes

are you using Linux? the tailexternal call can help?

Unfortunately I am not using linux.

I am trying something like this

function _read_last_line(file)
    io = Base.open(PATH_CSV)
    seekend(io)
    current_char = 'a'
    pos = 2
    seek(io, position(io) - pos) # this is a \n
    while current_char != '\n'
        @show current_char
        current_char = Base.read(io, Char)
        pos += 1
        seek(io, position(io) - pos)
        if pos >= 40
            break
        end
    end
    last_line = Base.read(io, String)
    Base.close(io)
    return last_line
end
1 Like

your solution seems to be fast enough

This is working for now but seems very ugly

function _read_last_line(file)
    io = Base.open(file)
    seekend(io)
    current_char = 'a'
    pos = 1
    seek(io, position(io) - pos) # this is a \n
    while current_char != '\n'
        seek(io, position(io) - pos)
        current_char = Base.read(io, Char)
        pos += 1
    end
    seek(io, position(io) - pos)
    last_line = readlines(io)[end]
    Base.close(io)
    return last_line
end
1 Like
julia> function read_last(file)
         open(file) do io
           seekend(io)
           seek(io, position(io) - 1)
           while Char(peek(io)) != '\n'
             seek(io, position(io) - 1)
           end
           read(io, Char)
           read(io, String)
         end
       end
read_last (generic function with 1 method)

julia> read_last("file.txt")
"1,6,3,7.0,5.0,307.0"

maybe?

13 Likes
function lastLine(file)
    ea = eachline(file)
    local line
    for l in ea
        line = l
    end
    line
end

Should be plenty fast enough.

Seems to be it but why do you have to read the io as a char and then as a String?

That’s just to skip the \n. Do give @Sukera’s solution a try first though.

One thing to note - if your data always just looks like that CSV and you always want to read in only the last row, do you have an option of accessing that new data directly, instead of going through that large CSV? Would save you the issue of having to interact with the 100Gb themselves.

Thank you @pfitzseb and @Sukera I benchmarks the solutions and here are the results

julia> using BenchmarkTools

julia> file = "gerter.csv"
"gerter.csv"

julia> function lastLine(file)
           ea = eachline(file)
               local line
           for l in ea
               line = l
           end
           line
       end
lastLine (generic function with 1 method)

julia> 

julia> function read_last(file)
           open(file) do io
               seekend(io)
               seek(io, position(io) - 2)
               while Char(peek(io)) != '\n'
                   seek(io, position(io) - 1)
               end
               Base.read(io, Char)
               Base.read(io, String)
           end
       end
read_last (generic function with 1 method)

julia> @btime lastLine(file)
  2.125 ms (21829 allocations: 1.00 MiB)
"15,2,720,152720.0,720.0,152.0"

julia> @btime read_last(file)
  128.200 ÎĽs (17 allocations: 1.27 KiB)
"15,2,720,152720.0,720.0,152.0\n"

I have tested in a not particularly big file, only 1Mb

1 Like

Yeah, I wouldn’t expect the eachline solution to be faster - it actually has to materialize each line and goes through the file from the beginning, whereas the solution by @pfitzseb walks in from the end. Depends on your performance needs I guess.

Could you check this version of “eachline”:

function lastline2(file)
    last = ""
    open(file) do io
        while !eof(io)
            last = readline(io)
        end
    end
    return last
end

Another one: foldl((x,y)->y, eachline(filename))

FWIW, tested the functions above in two CSV files having: (1) 146,240 rows x 8 columns; and (2) 275 rows x 427 columns.

In the first case @pfitzseb’s function is ~ 250x faster, while in the second case the “eachline” solutions are ~ 14x faster.

Yeah, eachline does algorithmically worse with more rows. It has to construct & throw away O(rows) strings, whereas the go-from-the-end solution is O(1) in terms of rows. It could be improved by reading a cacheline worth of bytes starting from the end at a time, checking if any are a '\n' and then doing the read(io, String) from that index.

The advantage of the eachline solution is that IO can read ahead a lot and cache that, improving performance for short files. Not sure if seek seeks from the start though… would have to be tested.

2 Likes

The equivalent command in Windows is:
Get-content -tail 1 "filename"

(NB: replace 1 by number of lines desired)

Is it possible to run this powershell command from the Julia REPL and get back the output string in Julia?

1 Like

Sure:

julia> read(`powershell Get-content -tail 1 "file.txt"`, String)
"1,6,3,7.0,5.0,307.0\r\n"
2 Likes

@pfitzseb, brilliant.
There seems to be a fixed ~200 ms time lag/overhead in this call, is it normal?

Well powershell has to be started as well, that’s presumably not free.

On top of which, tail would fundamentally be doing something similar, so I’d guess at best you’d still have to pay that overhead.

3 Likes