How to read only the last line of a file (.txt)?

guilhermebodin · September 10, 2021, 8:07pm

Hello,

I have a very large txt file and would like to read only the last line of it. What is the best way to do it?

For example if I have this file

some header option
header other option
random metadata
A,B,C,X,Y,Z
1,1,1,2.0,0.0,102.0
1,1,2,2.0,0.0,202.0
1,1,3,2.0,0.0,302.0
1,2,1,3.0,1.0,103.0
1,2,2,3.0,1.0,203.0
1,2,3,3.0,1.0,303.0
1,3,1,4.0,2.0,104.0
1,3,2,4.0,2.0,204.0
1,3,3,4.0,2.0,304.0
1,4,1,5.0,3.0,105.0
1,4,2,5.0,3.0,205.0
1,4,3,5.0,3.0,305.0
1,5,1,6.0,4.0,106.0
1,5,2,6.0,4.0,206.0
1,5,3,6.0,4.0,306.0
1,6,1,7.0,5.0,107.0
1,6,2,7.0,5.0,207.0
1,6,3,7.0,5.0,307.0

but it has 100Gb and I only want to rapidly read the last line (1,6,3,7.0,5.0,307.0).

jbytecode · September 10, 2021, 8:15pm

are you using Linux? the tailexternal call can help?

guilhermebodin · September 10, 2021, 8:26pm

Unfortunately I am not using linux.

I am trying something like this

function _read_last_line(file)
    io = Base.open(PATH_CSV)
    seekend(io)
    current_char = 'a'
    pos = 2
    seek(io, position(io) - pos) # this is a \n
    while current_char != '\n'
        @show current_char
        current_char = Base.read(io, Char)
        pos += 1
        seek(io, position(io) - pos)
        if pos >= 40
            break
        end
    end
    last_line = Base.read(io, String)
    Base.close(io)
    return last_line
end

jbytecode · September 10, 2021, 8:35pm

your solution seems to be fast enough

guilhermebodin · September 10, 2021, 8:38pm

This is working for now but seems very ugly

function _read_last_line(file)
    io = Base.open(file)
    seekend(io)
    current_char = 'a'
    pos = 1
    seek(io, position(io) - pos) # this is a \n
    while current_char != '\n'
        seek(io, position(io) - pos)
        current_char = Base.read(io, Char)
        pos += 1
    end
    seek(io, position(io) - pos)
    last_line = readlines(io)[end]
    Base.close(io)
    return last_line
end

pfitzseb · September 10, 2021, 8:55pm

julia> function read_last(file)
         open(file) do io
           seekend(io)
           seek(io, position(io) - 1)
           while Char(peek(io)) != '\n'
             seek(io, position(io) - 1)
           end
           read(io, Char)
           read(io, String)
         end
       end
read_last (generic function with 1 method)

julia> read_last("file.txt")
"1,6,3,7.0,5.0,307.0"

maybe?

Sukera · September 10, 2021, 9:10pm

function lastLine(file)
    ea = eachline(file)
    local line
    for l in ea
        line = l
    end
    line
end

Should be plenty fast enough.

guilhermebodin · September 10, 2021, 9:11pm

Seems to be it but why do you have to read the io as a char and then as a String?

pfitzseb · September 10, 2021, 9:12pm

That’s just to skip the \n. Do give @Sukera’s solution a try first though.

Sukera · September 10, 2021, 9:18pm

One thing to note - if your data always just looks like that CSV and you always want to read in only the last row, do you have an option of accessing that new data directly, instead of going through that large CSV? Would save you the issue of having to interact with the 100Gb themselves.

guilhermebodin · September 10, 2021, 9:19pm

Thank you @pfitzseb and @Sukera I benchmarks the solutions and here are the results

julia> using BenchmarkTools

julia> file = "gerter.csv"
"gerter.csv"

julia> function lastLine(file)
           ea = eachline(file)
               local line
           for l in ea
               line = l
           end
           line
       end
lastLine (generic function with 1 method)

julia> 

julia> function read_last(file)
           open(file) do io
               seekend(io)
               seek(io, position(io) - 2)
               while Char(peek(io)) != '\n'
                   seek(io, position(io) - 1)
               end
               Base.read(io, Char)
               Base.read(io, String)
           end
       end
read_last (generic function with 1 method)

julia> @btime lastLine(file)
  2.125 ms (21829 allocations: 1.00 MiB)
"15,2,720,152720.0,720.0,152.0"

julia> @btime read_last(file)
  128.200 μs (17 allocations: 1.27 KiB)
"15,2,720,152720.0,720.0,152.0\n"

I have tested in a not particularly big file, only 1Mb

Sukera · September 10, 2021, 9:21pm

Yeah, I wouldn’t expect the eachline solution to be faster - it actually has to materialize each line and goes through the file from the beginning, whereas the solution by @pfitzseb walks in from the end. Depends on your performance needs I guess.

rafael.guerra · September 11, 2021, 12:10am

Could you check this version of “eachline”:

function lastline2(file)
    last = ""
    open(file) do io
        while !eof(io)
            last = readline(io)
        end
    end
    return last
end

sijo · September 11, 2021, 5:37am

Another one: foldl((x,y)->y, eachline(filename))

rafael.guerra · September 11, 2021, 9:19am

FWIW, tested the functions above in two CSV files having: (1) 146,240 rows x 8 columns; and (2) 275 rows x 427 columns.

In the first case @pfitzseb’s function is ~ 250x faster, while in the second case the “eachline” solutions are ~ 14x faster.

Sukera · September 11, 2021, 10:23am

Yeah, eachline does algorithmically worse with more rows. It has to construct & throw away O(rows) strings, whereas the go-from-the-end solution is O(1) in terms of rows. It could be improved by reading a cacheline worth of bytes starting from the end at a time, checking if any are a '\n' and then doing the read(io, String) from that index.

The advantage of the eachline solution is that IO can read ahead a lot and cache that, improving performance for short files. Not sure if seek seeks from the start though… would have to be tested.

rafael.guerra · September 12, 2021, 10:13am

The equivalent command in Windows is:
Get-content -tail 1 "filename"

(NB: replace 1 by number of lines desired)

Is it possible to run this powershell command from the Julia REPL and get back the output string in Julia?

pfitzseb · September 12, 2021, 10:21am

Sure:

julia> read(`powershell Get-content -tail 1 "file.txt"`, String)
"1,6,3,7.0,5.0,307.0\r\n"

rafael.guerra · September 12, 2021, 10:27am

@pfitzseb, brilliant.
There seems to be a fixed ~200 ms time lag/overhead in this call, is it normal?

Sukera · September 12, 2021, 10:29am

Well powershell has to be started as well, that’s presumably not free.

On top of which, tail would fundamentally be doing something similar, so I’d guess at best you’d still have to pay that overhead.

Topic		Replies	Views
Removing the first line in a text file New to Julia question	11	7031	March 5, 2020
Reading a file line by line General Usage question	3	11526	December 3, 2018
Reading a file from line x to line y General Usage csv	28	741	May 22, 2024
Readline() and end-of-file New to Julia io	6	5513	July 9, 2021
Readlines(filepath) and read(filepath,String) read different content General Usage file	15	251	October 3, 2024

How to read only the last line of a file (.txt)?

Related topics