Loading a csv-like file with a multi-line header

Hi,

I would like to load a IAGA-2002 file like http://www.bcmt.fr/DATABANK/VARIATION/ppt/sec/2024/03/ppt20240301vsec.sec

 Format                 IAGA-2002                                    |
 Source of Data         IPGP (France)                                |
 Station Name           Pamatai                                      |
 IAGA Code              PPT                                          |
 Geodetic Latitude      -17.567                                      |
 Geodetic Longitude     210.426                                      |
 Elevation              357.0                                        |
 Reported               XYZF                                         |
 Sensor Orientation     HDZF                                         |
 Digital Sampling       0.2 second                                   |
 Data Interval Type     Filtered 1-second                            |
 Data Type              variation                                    |
 #                                                                   |
 # File added to the data base on the 2024-03-02                     |
 # Conditions of use:                                                |
 # License: CC BY-NC 4.0                                             |
 #           http://creativecommons.org/licenses/by-nc/4.0/          |
 # DOI citation:  http://dx.doi.org/10.18715/BCMT.MAG.VAR            |
 #                                                                   |
 # For any enquiry, please contact: bcmt@ipgp.fr                     |
 #                                                                   |
 # Formulas for computing non-reported elements:                     |
 # H=SquareRoot(X*X+Y*Y), tan(D)=Y/X, tan(I)=Z/SquareRoot(X*X+Y*Y)   |
 # 1-second values are computed using a 24 bits digitizer and a      |
 # Gaussian filter of 15 coefficients.                               |
DATE       TIME         DOY     PPTX      PPTY      PPTZ      PPTF   |
2024-03-01 00:00:00.000 061     29138.14   5945.07 -18657.18  35103.95
2024-03-01 00:00:01.000 061     29138.14   5945.09 -18657.15  88888.00
2024-03-01 00:00:02.000 061     29138.12   5945.11 -18657.10  88888.00
2024-03-01 00:00:03.000 061     29138.11   5945.13 -18657.09  88888.00
2024-03-01 00:00:04.000 061     29138.13   5945.12 -18657.11  88888.00
2024-03-01 00:00:05.000 061     29138.14   5945.17 -18657.11  35103.98
2024-03-01 00:00:06.000 061     29138.14   5945.21 -18657.08  88888.00
2024-03-01 00:00:07.000 061     29138.12   5945.22 -18657.04  88888.00
etc. 86400 lines of data

I don’t need the lines ending with β€˜|’, so I did:

file=readlines("ppt20240301vsec.sec")
filter!(x -> !contains(x, '|'), file)

for stringline in file
    splitted=split(stringline)
    do_something_else
end

I’m stuck there, I would like first to end up with a 86400 lines x 7 columns matrix or array.
Then better have as columns types: Date Date Int Float Float Float Float, or maybe regroup the two Date columns.

I tried stuff, but too complicated for something i feel Julia could do simply.

Any help is welcome.

Have you tried using CSV.jl? It can handle all sorts of delimited files (not just commas) and can skip lines and infer column types.

1 Like

I tried it, but the problem is that the header can have a variable number of lines.

You could try to detect the size of the header yourself, and then pass that to CSV:

function count_header_lines(file)
    idx = 1
    for line in eachline(file)
        endswith(line, "|") || return idx
        idx += 1
     end
     error("Found no header lines ending with `|`")
end

CSV.File(file; skipto = count_header_lines(file))
3 Likes

try this.
It is based on a specific characteristic of the file you addressed.
If the others are built β€œsimilarly” it should work

rows=CSV.File("testmh.txt"; header=false,comment=" ")

names=tab[1][1:end-1]

select(DataFrame(Tables.rows(rows[2:end])),1=>ByRow(split)=>names)

The title should also be changed: it is misleading. We are not talking about multiline headers in this case (which would still be handled by CSV).

Hmmm, geomagnetic data files. You got me nostalgic so I made some changes in gmtread (GMT.jl) and now with master version we can do:

using GMT

D = gmtread("ppt20240301vsec.sec", table=true, stride=5)
Comment:        ["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "DATE       TIME         DOY     PPTX      PPTY      PPTZ      PPTF   |"]
Attribute table
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Timecol β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
BoundingBox: [1.7092512e9, 1.7092512e9, 0.0, 23.99861111111111, 61.0, 61.0, 29056.38, 29138.14, 5852.03, 5946.78, -18715.51, -18656.79, 35046.35, 35104.03]

17280Γ—7 GMTdataset{Float64, 2}
 Row β”‚                Time  TIME   DOY     PPTX     PPTY      PPTZ     PPTF
─────┼──────────────────────────────────────────────────────────────────────
   1 β”‚ 2024-03-01T00:00:00   0.0  61.0  29138.1  5945.07  -18657.2  35103.9
   2 β”‚ 2024-03-01T00:00:05   0.0  61.0  29138.1  5945.17  -18657.1  35104.0
   3 β”‚ 2024-03-01T00:00:10   0.0  61.0  29138.1  5945.26  -18657.1  35104.0
...

But note that in order to get those column names I had to add a # as first char in line 26 to follow the standard that comment lines should start with a hash (people at Chambon should take more care on how they write the file headers).

I also read only one every other 5th row because the others do not have the total field and X,Y,Z components are just repeated. To see it:

plot(D, xvar=:Time, yvar=:PPTF, show=true)

2 Likes

Thanks everyone ! I’ll dig into that.

About the line 26:

Check β€œThe mandatory data header record”.
And

Page 123, labels don’t need a #.

They may not need a # when writing a reader for a particular format but they do need it when writing generic code that aims to read many different files with variable number of headers. Go to Matlab and try to read text files where comments do not start with a % and see what happens. Anyway, you may not add that # but then the column names is not found. And, BTW, having two columns for time (one for date and another for time) is not the best of the ideas too.

Well it’s not me or Chambon who decide the International Association of Geomagnetism and Aeronomy standards :slightly_smiling_face:

Sure. :slight_smile: Just a small complain from one of those who was bitten by those inconsistencies.

By the way, the second column in GMTdataset should be part of the time, as it’s the milliseconds of the ISO 8601 hh:mm:ss.sss standard.

The second column (the time in the original file) was added to the date (first column) as can be seen in the short print. I decided to leave it with zeros to avoid copying the entire table when removing an inner column.