I’ve got a bunch of .txt files that contain a few hundred lines of “junk” at the beginning, then two tables embedded at the end. Those two tables always begin the same number of lines from the end of the file. Further, the columns are all fixed width. It looks something like this:
junk
junk
Name V1 V2 V3
A XXX 1 22
B XXX 3 444
junk
Name V1 V2 V3
A XXX 1 22
B XXX 3 444
junk
Code to generate the above test.txt:
testStr = rpad("Name", 10)*lpad("V1", 4)*lpad("V2", 4)*lpad("V3", 4)*"\n"*
rpad("A", 10)*lpad("XXX", 4)*lpad("1", 4)*lpad("22", 4)*"\n"*
rpad("B", 10)*lpad("XXX", 4)*lpad("3", 4)*lpad("444", 4)*"\n"*"\n"*"junk"*"\n"
testStr = "junk\n"*"junk\n"*"\n"*testStr*"\n"*testStr
path_run = pwd()
fname = joinpath(path_run, "test.txt")
write(fname, testStr)
What I would like to do is get the first column as string, then a subset of the other columns as integers into a dataframe (unless dataframe is going to be far slower than some other multi-type structure). Finally, store the two dataframes in a vector (or similar):
using DataFrames, BenchmarkTools, Profile
# Constant
keepcols = [1; 3:4]
colnames = ["Name", "V2", "V3"]
nr = 2
nchar = [10; repeat([4], 3)]
col_end = cumsum(nchar)
col_start = col_end - nchar .+ 1
function read_txt(fname, keepcols, colnames, nr, col_end, col_start)
# Variable
input_str = readlines(fname)
len = length(input_str)
idx_start = [len - 2 - 2*(nr) - 3, len - 1 - nr]
idx_end = [len - 2 - nr - 4, len - 2]
res = Vector{DataFrame}(undef, 2)
@views for i in 1:2
strvec = input_str[idx_start[i]:idx_end[i]]
mat = Matrix{SubString{String}}(undef, length(strvec), length(keepcols))
for j in eachindex(strvec)
str = strvec[j]
cnt = 1
for k in keepcols
mat[j, cnt] = str[col_start[k]:col_end[k]]
cnt += 1
end
end
df = DataFrame(parse.(Int16, mat[:, 2:end]), colnames[2:end])
insertcols!(df, 1, "Name" => mat[:, 1])
res[i] = df
end
return res
end
path_run = pwd()
fname = joinpath(path_run, "test.txt")
res = read_txt(fname, keepcols, colnames, nr, col_end, col_start)
@benchmark res = read_txt($fname, $keepcols, $colnames, $nr, $col_end, $col_start)
Desired Output:
2-element Vector{DataFrame}:
2×3 DataFrame
Row │ Name V2 V3
│ SubStrin… Int16 Int16
─────┼──────────────────────────
1 │ A 1 22
2 │ B 3 444
2×3 DataFrame
Row │ Name V2 V3
│ SubStrin… Int16 Int16
─────┼──────────────────────────
1 │ A 1 22
2 │ B 3 444
Benchmark:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 14.991 μs … 542.371 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 16.905 μs ┊ GC (median): 0.00%
Time (mean ± σ): 19.038 μs ± 9.227 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅██▇▆▅▄▃▃▂▁▁ ▂▁▁▅▄▃▃▂▁ ▂
▅▇██████████████▇▆▆▆▅▆▅▄▅▇▅▇▇███████████▇▆▇▇█▇█▇▇▆▆▆▆▅▆▅▅▅▅▆ █
15 μs Histogram: log(frequency) by time 33.5 μs <
Memory estimate: 5.82 KiB, allocs estimate: 85.
Profiling shows the bottlenecks are readlines()
, parse()
, and DataFrame()
.
-
Is there a way to read the file from the end, ie skip the rest and check only the last n lines of the file? Memory is not an issue here, only speed. Or skip some overhead of readlines? I did try
split(read(fname, String), "\n")[1:end-1]
and saw fewer, but larger, allocations, which ended up being slower. -
Dataframe is the easiest for me to work with and debug, but if another solution is far (eg, 2x) faster it may be worth switching over to that. I’d really prefer keeping the string “Name” column and int columns in the same rows of a table.
-
How about converting from substring to int? Is there any way to improve on that?