Reading text files with wrapped numerical data

rafael.guerra · October 27, 2022, 11:15pm

In the following example, the input text file contains numeric data spread over several rows.

Assuming that the number of columns (=7) is known and that all input data can be read as Float64, is there a better way to read such data into a matrix?

# 1 - WRAPPED DATA INPUT

const input = """
0.000E+00
00000000     4.999E-03       8.771E-03   00001001
5.000E-04    5.087E-01
1.500E-02
00000001     2.112E-03       3.462E-03   00001002
7.000E-04    2.186E-01
3.000E-02
00000002     3.020E-03       2.212E-03   00001003
9.000E-04    3.383E-01
"""

file = "data_wrapped.txt"
write(file, input)

# 2 - READ FILE WITH WRAPPED DATA

using Scanf

function read_n_wrapped_floats(file, n)
   str = repeat("%f ",n) * '\n'
   fmt = Scanf.Format(str)
   m = NTuple{n+1, Float64}[]
   types = Vector{Float64}(undef, n)
   open(file, "r") do io
      while !eof(io)
         push!(m, scanf(io, fmt, types...))
      end
   end
   return hcat(collect.(m)...)'
end

n = 7
read_n_wrapped_floats(file, n)

Thanks.

joa-quim · October 27, 2022, 11:41pm

Hmm, your example is confusing my simple algo to detect time columns

julia> D = gmtread("data_wrapped.txt")
Attributes:  Dict("Timecol" => "3,4,6,7")
BoundingBox: [0.0, 0.03, 0.0, 2.0, 0.002112, 0.004999, 0.002212, 0.008771, 1001.0, 1003.0, 0.0005, 0.0009, 0.2186, 0.5087] 3×7 GMTdataset{Float64, 2}
 Row │   col.1    col.2                     Time                    Time2    col.5    Time3    Time4
     │ Float64  Float64                  Float64                  Float64  Float64  Float64  Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 │   0.0        0.0  1970-01-01T00:00:00.005  1970-01-01T00:00:00.009   1001.0   0.0005   0.5087
   2 │   0.015      1.0  1970-01-01T00:00:00.002  1970-01-01T00:00:00.003   1002.0   0.0007   0.2186
   3 │   0.03       2.0  1970-01-01T00:00:00.003  1970-01-01T00:00:00.002   1003.0   0.0009   0.3383

but not fatal

julia> D.data
3×7 Matrix{Float64}:
 0.0    0.0  0.004999  0.008771  1001.0  0.0005  0.5087
 0.015  1.0  0.002112  0.003462  1002.0  0.0007  0.2186
 0.03   2.0  0.00302   0.002212  1003.0  0.0009  0.3383

Good enough?

rafael.guerra · October 28, 2022, 7:26am

Thanks Joaquim.
In Julia 1.8.0, using GMT v0.43.1, it throws an error:

ERROR

gmtread [WARNING]: Mismatch between actual (3) and expected (4) fields near line 2 in file      
gmtread [ERROR]: Mismatch between actual (3) and expected (4) fields near line 3 in file data_wrapped.txt
ERROR:  Failed to read file "data_wrapped.txt"

Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:35
 [2] gmtread(fname::String; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ GMT C:\Users\..\.julia\packages\GMT\mzT4h\src\gmtreadwrite.jl:149
 [3] gmtread(fname::String)
   @ GMT C:\Users\..\.julia\packages\GMT\mzT4h\src\gmtreadwrite.jl:68
 [4] top-level scope
   @ REPL[19]:1

Ultimately, what Julia functions is GMT calling to read such wrapped files?

rafael.guerra · October 28, 2022, 8:49am

This new function performs better than previous one (which was using Scanf):

function read_n_wrapped_floats2(file, n)
    m = Vector{Float64}[]
    open(file, "r") do io
       while !eof(io)
          k = 0
          m1 = Float64[]
          while k < n
            x = parse.(Float64, split(readline(io)))
            push!(m1, x...)
            k += length(x)
          end
          push!(m, m1)
       end
    end
    return reduce(hcat, m)'
 end

joa-quim · October 28, 2022, 10:55am

That error means the file has first row with 3 columns and the second with 4 and that is not accepted in gmtread where all rows must have the same number of columns. gmtread reads the file in the C side and doesn’t use a Julia function for that.

rafael.guerra · October 28, 2022, 12:03pm

Okay, thank you. So that’s not what’s asked. In fact, there are 7 data points from 7 different curves spread over several lines (with carriage returns), and the pattern repeats for the next 7 samples from each of the 7 curves, etc.

NB: I have edited the original post to have each batch of 7 samples spreading over 3 rows in the input text file, but it could be other number.

joa-quim · October 28, 2022, 1:26pm

OK, so perhaps readdlm now in

https://docs.julialang.org/en/v1/stdlib/DelimitedFiles/

PeterSimon · October 29, 2022, 2:21am

function read_n_per_row(file, n)
    dat = Float64[]
    for line in eachline(file), word in split(line)
        push!(dat, parse(Float64, word))
    end
    l = length(dat)
    iszero(l % n) || error("Number of file entries ", l, " not divisible by ", n)
    m = l ÷ n
    return transpose(reshape(dat, (n, m)))
end

using BenchmarkTools

@btime read_n_wrapped_floats2($file, $n); # 8.074 μs (87 allocations: 5.48 KiB)

@btime read_n_per_row($file, $n) # 6.747 μs (43 allocations: 4.07 KiB)

read_n_wrapped_floats2(file, n) == read_n_per_row(file, n) # true

rafael.guerra · October 29, 2022, 6:46am

Thank you Peter for such a nice, clear and efficient solution. A pleasure to read.

Topic		Replies	Views
Read complex text files in Julia Data	4	565	March 17, 2021
How to read a textfile directly as `Matrix{Char}` General Usage question , io	3	382	December 3, 2023
Reading fixed-width files: a preliminary solution General Usage data	5	1880	June 19, 2021
Reading Fixed-Width Column Data General Usage data	6	3153	September 11, 2017
Reading data in Julia 1.2.0 Performance	1	306	September 1, 2020

Reading text files with wrapped numerical data

Related topics