Fastest way to read fixed-width tables embedded in a larger text file

Tetrakai · August 12, 2024, 7:41pm

I’ve got a bunch of .txt files that contain a few hundred lines of “junk” at the beginning, then two tables embedded at the end. Those two tables always begin the same number of lines from the end of the file. Further, the columns are all fixed width. It looks something like this:

junk
junk

Name        V1  V2  V3
A          XXX   1  22
B          XXX   3 444

junk

Name        V1  V2  V3
A          XXX   1  22
B          XXX   3 444

junk

Code to generate the above test.txt:

testStr = rpad("Name", 10)*lpad("V1", 4)*lpad("V2", 4)*lpad("V3", 4)*"\n"*
rpad("A", 10)*lpad("XXX", 4)*lpad("1", 4)*lpad("22", 4)*"\n"*
rpad("B", 10)*lpad("XXX", 4)*lpad("3", 4)*lpad("444", 4)*"\n"*"\n"*"junk"*"\n"

testStr = "junk\n"*"junk\n"*"\n"*testStr*"\n"*testStr

path_run = pwd()
fname    = joinpath(path_run, "test.txt")
write(fname, testStr)

What I would like to do is get the first column as string, then a subset of the other columns as integers into a dataframe (unless dataframe is going to be far slower than some other multi-type structure). Finally, store the two dataframes in a vector (or similar):

using DataFrames, BenchmarkTools, Profile
# Constant
keepcols = [1; 3:4]
colnames = ["Name", "V2", "V3"]
nr = 2

nchar     = [10; repeat([4], 3)]
col_end   = cumsum(nchar)
col_start = col_end - nchar .+ 1

function read_txt(fname, keepcols, colnames, nr, col_end, col_start)
    # Variable
    input_str = readlines(fname)
    len       = length(input_str)
    idx_start = [len - 2 - 2*(nr) - 3, len - 1 - nr]
    idx_end   = [len - 2 - nr - 4, len - 2]

    res = Vector{DataFrame}(undef, 2)
    @views for i in 1:2
        strvec = input_str[idx_start[i]:idx_end[i]]
        mat  = Matrix{SubString{String}}(undef, length(strvec), length(keepcols))
        for j in eachindex(strvec)
            str = strvec[j]
            cnt = 1
            for k in keepcols
                mat[j, cnt] = str[col_start[k]:col_end[k]]
                cnt += 1
            end
        end
        df = DataFrame(parse.(Int16, mat[:, 2:end]), colnames[2:end])
        insertcols!(df, 1, "Name" => mat[:, 1])

        res[i] = df
    end
    return res
end


path_run = pwd()
fname    = joinpath(path_run, "test.txt")
res      = read_txt(fname, keepcols, colnames, nr, col_end, col_start)
@benchmark res = read_txt($fname, $keepcols, $colnames, $nr, $col_end, $col_start)

Desired Output:

2-element Vector{DataFrame}:
 2×3 DataFrame
 Row │ Name        V2     V3    
     │ SubStrin…   Int16  Int16 
─────┼──────────────────────────
   1 │ A               1     22
   2 │ B               3    444
 2×3 DataFrame
 Row │ Name        V2     V3    
     │ SubStrin…   Int16  Int16 
─────┼──────────────────────────
   1 │ A               1     22
   2 │ B               3    444

Benchmark:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.991 μs … 542.371 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.905 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.038 μs ±   9.227 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃▅██▇▆▅▄▃▃▂▁▁               ▂▁▁▅▄▃▃▂▁                      ▂
  ▅▇██████████████▇▆▆▆▅▆▅▄▅▇▅▇▇███████████▇▆▇▇█▇█▇▇▆▆▆▆▅▆▅▅▅▅▆ █
  15 μs         Histogram: log(frequency) by time      33.5 μs <

 Memory estimate: 5.82 KiB, allocs estimate: 85.

Profiling shows the bottlenecks are readlines(), parse(), and DataFrame().

Is there a way to read the file from the end, ie skip the rest and check only the last n lines of the file? Memory is not an issue here, only speed. Or skip some overhead of readlines? I did try split(read(fname, String), "\n")[1:end-1] and saw fewer, but larger, allocations, which ended up being slower.
Dataframe is the easiest for me to work with and debug, but if another solution is far (eg, 2x) faster it may be worth switching over to that. I’d really prefer keeping the string “Name” column and int columns in the same rows of a table.
How about converting from substring to int? Is there any way to improve on that?

Tetrakai · August 13, 2024, 7:06pm

I figured out using a struct instead of a dataframe does give a substantial boost:

# Constant
struct TxtTable
    name   :: Vector{SubString{String}}
    cnames :: Vector{String}
    vals   :: Matrix{Int16}
end

keepcols = [3, 4]
colnames = ["V2", "V3"]
nr = 2

nchar     = [10; repeat([4], 3)]
col_end   = cumsum(nchar)
col_start = col_end - nchar .+ 1

function read_txt(fname, keepcols, colnames, nr, col_end, col_start)
    # Variable
    input_str = readlines(fname)
    len       = length(input_str)
    idx_start = [len - 2 - 2*(nr) - 3, len - 1 - nr]
    idx_end   = [len - 2 - nr - 4, len - 2]

    res     = Vector{TxtTable}(undef, 2)
    namevec = Vector{SubString{String}}(undef, nr)
    mat     = Matrix{Int16}(undef, nr, length(keepcols))
    @views for i in 1:2
        strvec = input_str[idx_start[i]:idx_end[i]]
        for j in eachindex(strvec)
            str        = strvec[j]
            namevec[j] = str[col_start[1]:col_end[1]]
            cnt = 1
            for k in keepcols
                mat[j, cnt] = parse(Int16, str[col_start[k]:col_end[k]])
                cnt += 1
            end
        end
        res[i] = TxtTable(namevec, colnames, mat)
    end
    return res
end

path_run = pwd()
fname    = joinpath(path_run, "test.txt")
res      = read_txt(fname, keepcols, colnames, nr, col_end, col_start)
@benchmark res = read_txt($fname, $keepcols, $colnames, $nr, $col_end, $col_start)

Output:

2-element Vector{TxtTable}:
 TxtTable(SubString{String}["A         ", "B         "], ["V2", "V3"], Int16[1 22; 3 444])
 TxtTable(SubString{String}["A         ", "B         "], ["V2", "V3"], Int16[1 22; 3 444])

Benchmark

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.786 μs … 379.308 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.388 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.081 μs ±   7.851 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▃▄▅█▇▇▆▅▃▃▂▂▂▁      ▅▅▄▆▄▃▃▁▁▁                              ▂
  ████████████████▇▇▇▇▇███████████████▇▇█▇▇▇▆█▇▇█▆▇▆▆▅▆▃▄▅▅▅▄▅ █
  10.8 μs       Histogram: log(frequency) by time      24.8 μs <

 Memory estimate: 2.16 KiB, allocs estimate: 43.

Perhaps I pay for this later when trying to manipulate data based on the name column though.

Also, it still seems to me like readlines is much slower than necessary here. That line alone takes about ~10 us on my benchmark.

aplavin · August 14, 2024, 5:42am

FixedWidthTables.jl is nice for reading such tables. You can try:

using FixedWidthTables

nlines = countlines(file)
nskip = nlines - nrowstable
tbl = FixedWidthTables.read(file, (
     x=(1:2, Int),
     y=(3:5, Float64),
     z=(6:6, Char),
     ...
); skiprows=1:nskip)

Should be reasonably performant, although it’s hard to beat specialized handwritten code.

Tetrakai · August 14, 2024, 5:19pm

aplavin:

using FixedWidthTables

nlines = countlines(file)
nskip = nlines - nrowstable
tbl = FixedWidthTables.read(file, (
     x=(1:2, Int),
     y=(3:5, Float64),
     z=(6:6, Char),
     ...
); skiprows=1:nskip)

Thanks, I looked into it but these tables are more like “embedded”. There are extra lines before, between, and after the tables. Just skipping lines is not enough.

In the end the solution will probably be avoid these files altogether, but I was trying to avoid changing that part.

rafael.guerra · August 14, 2024, 6:53pm

Check also this simple approach using vectors of named tuples, which can then be converted to dataframes.

Reading as vectors of named tuples

# https://discourse.julialang.org/t/parsing-strings-as-strings/13833/6
kcparse(T, x::AbstractString) = parse(T, x)
kcparse(::Type{T}, x::AbstractString) where {T <: AbstractString} = T(x)

function read_txt2(fname, colnames, types, colpos)
    res = Any[]                 # same performance as: Vector{NamedTuple{colnames, Tuple{types...}}}[]
    N = length(colnames)
    cname = string(first(colnames))
    open(fname) do io
        while !eof(io)
            str = split(readline(io))
            if !isempty(str) && str[1] == cname
                vnt = NamedTuple{colnames, Tuple{types...}}[]
                str = split(readline(io))
                while !isempty(str)
                    push!(vnt, NamedTuple{colnames}(kcparse.(types, str[colpos])))
                    str = split(readline(io))
                end
                push!(res, vnt)
            end
        end
    end
    return res
end

fname = "test.txt"
colnames = (:Name, :V2, :V3)
types = (String, Int, Int)
colpos = [1, 3, 4]

res2 = read_txt2(fname, colnames, types, colpos)
@btime read_txt2($fname, $colnames, $types, $colpos)     # 45 μs (160 allocs: 9.4 KiB)

using DataFrames
df1 = DataFrame(res2[1])
df2 = DataFrame(res2[2])

aplavin · August 14, 2024, 8:43pm

Hmm, I see… Would it help to have an option like takerows=43:128? It would ignore everything before row 43 and everything after row 128.

Tetrakai · August 14, 2024, 8:48pm

I benchmarked a few alternatives to readlines on one of my larger actual files (not the test.txt), and was able to figure out a way to speed it up ~3x.

It required assuming I only care about the last n ~ 4000 bytes.

NB: apparently @btime shows the minimum so it looks like only 2x below.

function readComm(fname)
    out  = String[]
    for line in eachline(fname)
        if length(line) == 98
            push!(out, line)
        end
    end
    return out
end


function readComm2(fname)
    out = Vector{String}(undef, 36)
    i = 1
    for line in eachline(fname)
        if length(line) == 98
            out[i] = line
            i =+ 1
        end
    end
    return out
end


function readComm3(fname)
    x = read(fname, String)
    return split(x[(end - 4000):(end-1)], "\n")
end


function readComm4(fname)
    io = open(fname, "r")
    seekend(io)
    seek(io, position(io) - 4000)
    x = read(io, String)
    close(io)

    return split(x[1:end-1], "\n")
end

@btime readlines($fname);
# 21.88 μs (399 allocations: 25.52 KiB) 

@btime split(read($fname, String)[1:(end-1)], "\n");
# 19.34 μs (27  allocations: 44.10 KiB) 

@btime readComm($fname);
# 24.62 μs (195 allocations: 15.34 KiB) 

@btime readComm2($fname);
# 25.61 μs (194 allocations: 15.21 KiB) 

@btime readComm3($fname);
# 10.79 μs (25  allocations: 21.01 KiB) 

@btime readComm4($fname);
#  9.72 μs (25  allocations: 14.57 KiB)

So it looks like the last is a clear winner. I may be able to use seek to skip a few more junk lines as well.

Finally, a do block variant on readComm4 also worked but was slightly slower. My understanding is it does something to ensure the file is closed if interrupted, so perhaps that is worth the cost.

Tetrakai · August 14, 2024, 8:50pm

I think so, especially for large files. Because then fewer lines would need to be parsed.

Tetrakai · August 14, 2024, 9:37pm

Thanks, Ill take a look. Those may be easier to work with than the structs.

Tetrakai · August 14, 2024, 11:11pm

I haven’t tried it out yet, but looking at the code, I see there is a startswith argument:

https://github.com/JuliaAPlavin/FixedWidthTables.jl/blob/master/src/FixedWidthTables.jl

An endswith argument would also help for cases like mine. Theres also multiple tables in the same file, so if it could loop over vectors of those it would save from opening the file multiple times.

Topic		Replies	Views
Reading fixed-width files: a preliminary solution General Usage data	5	1897	June 19, 2021
Reading fixed-width files? General Usage	22	3676	March 22, 2020
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2050	February 23, 2024
[ANN] TableReader.jl - A fast and simple CSV parser Package Announcements package , announcement , data , csv	24	5883	March 28, 2019
Alternative to DataFrame Readtable to read large data files with headers Data	17	4042	November 12, 2018

Fastest way to read fixed-width tables embedded in a larger text file

Related topics