Reading text files with wrapped numerical data

In the following example, the input text file contains numeric data spread over several rows.

Assuming that the number of columns (=7) is known and that all input data can be read as Float64, is there a better way to read such data into a matrix?

# 1 - WRAPPED DATA INPUT

const input = """
0.000E+00
00000000     4.999E-03       8.771E-03   00001001
5.000E-04    5.087E-01
1.500E-02
00000001     2.112E-03       3.462E-03   00001002
7.000E-04    2.186E-01
3.000E-02
00000002     3.020E-03       2.212E-03   00001003
9.000E-04    3.383E-01
"""

file = "data_wrapped.txt"
write(file, input)

# 2 - READ FILE WITH WRAPPED DATA

using Scanf

function read_n_wrapped_floats(file, n)
   str = repeat("%f ",n) * '\n'
   fmt = Scanf.Format(str)
   m = NTuple{n+1, Float64}[]
   types = Vector{Float64}(undef, n)
   open(file, "r") do io
      while !eof(io)
         push!(m, scanf(io, fmt, types...))
      end
   end
   return hcat(collect.(m)...)'
end

n = 7
read_n_wrapped_floats(file, n)

Thanks.

Hmm, your example is confusing my simple algo to detect time columns

julia> D = gmtread("data_wrapped.txt")
Attributes:  Dict("Timecol" => "3,4,6,7")
BoundingBox: [0.0, 0.03, 0.0, 2.0, 0.002112, 0.004999, 0.002212, 0.008771, 1001.0, 1003.0, 0.0005, 0.0009, 0.2186, 0.5087] 3Γ—7 GMTdataset{Float64, 2}
 Row β”‚   col.1    col.2                     Time                    Time2    col.5    Time3    Time4
     β”‚ Float64  Float64                  Float64                  Float64  Float64  Float64  Float64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚   0.0        0.0  1970-01-01T00:00:00.005  1970-01-01T00:00:00.009   1001.0   0.0005   0.5087
   2 β”‚   0.015      1.0  1970-01-01T00:00:00.002  1970-01-01T00:00:00.003   1002.0   0.0007   0.2186
   3 β”‚   0.03       2.0  1970-01-01T00:00:00.003  1970-01-01T00:00:00.002   1003.0   0.0009   0.3383

but not fatal

julia> D.data
3Γ—7 Matrix{Float64}:
 0.0    0.0  0.004999  0.008771  1001.0  0.0005  0.5087
 0.015  1.0  0.002112  0.003462  1002.0  0.0007  0.2186
 0.03   2.0  0.00302   0.002212  1003.0  0.0009  0.3383

Good enough?

Thanks Joaquim.
In Julia 1.8.0, using GMT v0.43.1, it throws an error:

ERROR
gmtread [WARNING]: Mismatch between actual (3) and expected (4) fields near line 2 in file      
gmtread [ERROR]: Mismatch between actual (3) and expected (4) fields near line 3 in file data_wrapped.txt
ERROR:  Failed to read file "data_wrapped.txt"

Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:35
 [2] gmtread(fname::String; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ GMT C:\Users\..\.julia\packages\GMT\mzT4h\src\gmtreadwrite.jl:149
 [3] gmtread(fname::String)
   @ GMT C:\Users\..\.julia\packages\GMT\mzT4h\src\gmtreadwrite.jl:68
 [4] top-level scope
   @ REPL[19]:1

Ultimately, what Julia functions is GMT calling to read such wrapped files?

This new function performs better than previous one (which was using Scanf):

function read_n_wrapped_floats2(file, n)
    m = Vector{Float64}[]
    open(file, "r") do io
       while !eof(io)
          k = 0
          m1 = Float64[]
          while k < n
            x = parse.(Float64, split(readline(io)))
            push!(m1, x...)
            k += length(x)
          end
          push!(m, m1)
       end
    end
    return reduce(hcat, m)'
 end

That error means the file has first row with 3 columns and the second with 4 and that is not accepted in gmtread where all rows must have the same number of columns. gmtread reads the file in the C side and doesn’t use a Julia function for that.

1 Like

Okay, thank you. So that’s not what’s asked. In fact, there are 7 data points from 7 different curves spread over several lines (with carriage returns), and the pattern repeats for the next 7 samples from each of the 7 curves, etc.

NB: I have edited the original post to have each batch of 7 samples spreading over 3 rows in the input text file, but it could be other number.

OK, so perhaps readdlm now in

https://docs.julialang.org/en/v1/stdlib/DelimitedFiles/

function read_n_per_row(file, n)
    dat = Float64[]
    for line in eachline(file), word in split(line)
        push!(dat, parse(Float64, word))
    end
    l = length(dat)
    iszero(l % n) || error("Number of file entries ", l, " not divisible by ", n)
    m = l Γ· n
    return transpose(reshape(dat, (n, m)))
end

using BenchmarkTools

@btime read_n_wrapped_floats2($file, $n); # 8.074 ΞΌs (87 allocations: 5.48 KiB)

@btime read_n_per_row($file, $n) # 6.747 ΞΌs (43 allocations: 4.07 KiB)

read_n_wrapped_floats2(file, n) == read_n_per_row(file, n) # true
1 Like

Thank you Peter for such a nice, clear and efficient solution. A pleasure to read.

1 Like