Loading a csv-like file with a multi-line header

Ju_ska · May 3, 2024, 5:09pm

Hi,

I would like to load a IAGA-2002 file like http://www.bcmt.fr/DATABANK/VARIATION/ppt/sec/2024/03/ppt20240301vsec.sec

 Format                 IAGA-2002                                    |
 Source of Data         IPGP (France)                                |
 Station Name           Pamatai                                      |
 IAGA Code              PPT                                          |
 Geodetic Latitude      -17.567                                      |
 Geodetic Longitude     210.426                                      |
 Elevation              357.0                                        |
 Reported               XYZF                                         |
 Sensor Orientation     HDZF                                         |
 Digital Sampling       0.2 second                                   |
 Data Interval Type     Filtered 1-second                            |
 Data Type              variation                                    |
 #                                                                   |
 # File added to the data base on the 2024-03-02                     |
 # Conditions of use:                                                |
 # License: CC BY-NC 4.0                                             |
 #           http://creativecommons.org/licenses/by-nc/4.0/          |
 # DOI citation:  http://dx.doi.org/10.18715/BCMT.MAG.VAR            |
 #                                                                   |
 # For any enquiry, please contact: bcmt@ipgp.fr                     |
 #                                                                   |
 # Formulas for computing non-reported elements:                     |
 # H=SquareRoot(X*X+Y*Y), tan(D)=Y/X, tan(I)=Z/SquareRoot(X*X+Y*Y)   |
 # 1-second values are computed using a 24 bits digitizer and a      |
 # Gaussian filter of 15 coefficients.                               |
DATE       TIME         DOY     PPTX      PPTY      PPTZ      PPTF   |
2024-03-01 00:00:00.000 061     29138.14   5945.07 -18657.18  35103.95
2024-03-01 00:00:01.000 061     29138.14   5945.09 -18657.15  88888.00
2024-03-01 00:00:02.000 061     29138.12   5945.11 -18657.10  88888.00
2024-03-01 00:00:03.000 061     29138.11   5945.13 -18657.09  88888.00
2024-03-01 00:00:04.000 061     29138.13   5945.12 -18657.11  88888.00
2024-03-01 00:00:05.000 061     29138.14   5945.17 -18657.11  35103.98
2024-03-01 00:00:06.000 061     29138.14   5945.21 -18657.08  88888.00
2024-03-01 00:00:07.000 061     29138.12   5945.22 -18657.04  88888.00
etc. 86400 lines of data

I don’t need the lines ending with ‘|’, so I did:

file=readlines("ppt20240301vsec.sec")
filter!(x -> !contains(x, '|'), file)

for stringline in file
    splitted=split(stringline)
    do_something_else
end

I’m stuck there, I would like first to end up with a 86400 lines x 7 columns matrix or array.
Then better have as columns types: Date Date Int Float Float Float Float, or maybe regroup the two Date columns.

I tried stuff, but too complicated for something i feel Julia could do simply.

Any help is welcome.

tbeason · May 3, 2024, 5:37pm

Have you tried using CSV.jl? It can handle all sorts of delimited files (not just commas) and can skip lines and infer column types.

Ju_ska · May 3, 2024, 6:18pm

I tried it, but the problem is that the header can have a variable number of lines.

ericphanson · May 3, 2024, 6:28pm

You could try to detect the size of the header yourself, and then pass that to CSV:

function count_header_lines(file)
    idx = 1
    for line in eachline(file)
        endswith(line, "|") || return idx
        idx += 1
     end
     error("Found no header lines ending with `|`")
end

CSV.File(file; skipto = count_header_lines(file))

rocco_sprmnt21 · May 3, 2024, 7:55pm

try this.
It is based on a specific characteristic of the file you addressed.
If the others are built “similarly” it should work

rows=CSV.File("testmh.txt"; header=false,comment=" ")

names=tab[1][1:end-1]

select(DataFrame(Tables.rows(rows[2:end])),1=>ByRow(split)=>names)

The title should also be changed: it is misleading. We are not talking about multiline headers in this case (which would still be handled by CSV).

joa-quim · May 6, 2024, 1:17am

Hmmm, geomagnetic data files. You got me nostalgic so I made some changes in gmtread (GMT.jl) and now with master version we can do:

using GMT

D = gmtread("ppt20240301vsec.sec", table=true, stride=5)
Comment:        ["", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "DATE       TIME         DOY     PPTX      PPTY      PPTZ      PPTF   |"]
Attribute table
┌─────────┐
│ Timecol │
├─────────┤
│       1 │
└─────────┘
BoundingBox: [1.7092512e9, 1.7092512e9, 0.0, 23.99861111111111, 61.0, 61.0, 29056.38, 29138.14, 5852.03, 5946.78, -18715.51, -18656.79, 35046.35, 35104.03]

17280×7 GMTdataset{Float64, 2}
 Row │                Time  TIME   DOY     PPTX     PPTY      PPTZ     PPTF
─────┼──────────────────────────────────────────────────────────────────────
   1 │ 2024-03-01T00:00:00   0.0  61.0  29138.1  5945.07  -18657.2  35103.9
   2 │ 2024-03-01T00:00:05   0.0  61.0  29138.1  5945.17  -18657.1  35104.0
   3 │ 2024-03-01T00:00:10   0.0  61.0  29138.1  5945.26  -18657.1  35104.0
...

But note that in order to get those column names I had to add a # as first char in line 26 to follow the standard that comment lines should start with a hash (people at Chambon should take more care on how they write the file headers).

I also read only one every other 5th row because the others do not have the total field and X,Y,Z components are just repeated. To see it:

plot(D, xvar=:Time, yvar=:PPTF, show=true)

Ju_ska · May 6, 2024, 9:39am

Thanks everyone ! I’ll dig into that.

About the line 26:

Check “The mandatory data header record”.
And

Page 123, labels don’t need a #.

joa-quim · May 6, 2024, 10:51am

They may not need a # when writing a reader for a particular format but they do need it when writing generic code that aims to read many different files with variable number of headers. Go to Matlab and try to read text files where comments do not start with a % and see what happens. Anyway, you may not add that # but then the column names is not found. And, BTW, having two columns for time (one for date and another for time) is not the best of the ideas too.

Ju_ska · May 6, 2024, 11:35am

Well it’s not me or Chambon who decide the International Association of Geomagnetism and Aeronomy standards

joa-quim · May 6, 2024, 4:28pm

Sure. Just a small complain from one of those who was bitten by those inconsistencies.

Ju_ska · May 19, 2024, 11:49pm

By the way, the second column in GMTdataset should be part of the time, as it’s the milliseconds of the ISO 8601 hh:mm:ss.sss standard.

joa-quim · May 20, 2024, 12:02am

The second column (the time in the original file) was added to the date (first column) as can be seen in the short print. I decided to leave it with zeros to avoid copying the entire table when removing an inner column.

Topic		Replies	Views
Reading headers of delimited files General Usage csv	1	585	January 3, 2023
How to read a large TSV file using Julia? General Usage question , csv , io	9	2513	October 25, 2021
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	78	September 3, 2024
CSVFiles line by line New to Julia	5	2702	December 29, 2018
Reading a non-uniform CSV file Data csv	1	658	November 10, 2020

Loading a csv-like file with a multi-line header

Related topics