I often have to import data from various measurement equipment or software. These come in plethora of formats and conventions. Some come with metadata. Some need to be appended with metadata (e.g. external parameters not captured by the instrument).
Right now I just write a custom script and will just scrape specific lines to extract relevant metadata. This is not sustainable. Can you suggest me a good workflow (cleanup → import → annotate) that will be easy to reuse/share? Any packages that can help with this nightmare?
You can construct pipelines once and reuse them every time you get new data. It works with any kind of table and database implementing the Tables.jl interface. We use it a lot in our projects, and it saves us a lot of headache.
Bonus: if you need to handle geospatial tables, such as images, it works out of the box with the inclusion of GeoStatsBase.jl in the same session, which specializes the pipelines with additional methods.
*Most often some metadata is there in the file or is encoded in the file name. Sometimes that metadata needs to be added manually (e.g. column names).
I will share some files with some of their quirks:
Source-measure unit
ADC sometimes saturates and for a few seconds measurements are wildly off. Would be nice to detect, produce a warning and clean up the offending points.
Data is hierarchical and hierarchy is in a offset tree. Tabular data is delimetered with a pair od [ ... ] but the opening brackets are always the first character of a line (ignores the hierarchy depth). Some files have one tabular set of data, some have multiple sets with different parameters. If temperature controller is not connected a NaN °C is recorded. I’d prefer a missing.
Header
Format Revsion 1.0
Type IV Data
Contents
Format Revsion 2.0
Type IV
Attribute
Device ID Device-1
DateTime 6/21/2023 2:32:08 PM
InvariantDateTime 06/21/2023 14:32:08
Thermometer Channel-1 NaN °C
Thermometer Channel-2 NaN °C
ThermoStream Air NaN °C
ThermoStream DUT NaN °C
ThermalPlate NaN °C
Comment
Category
Device MOSFET
MOSFET ID_VDS
Characterization VDS ID
Data
VGS = 18 V
[
VDS ID VGS IG VDS(Status) ID(Status) VGS(Status) IG(Status) RDS RGS gfs
-0.8146 -9.737 17.9992 8.6E-05 0 0 0 0 0.0836602649686762 209293.023255814 -113220.930232558 4466.66666683578
-0.8103 -9.603 17.99923 8.5E-05 0 0 0 0 0.0843798812870978 211755.647058824 -112976.470588235 -5389.99999982109
-0.8068 -9.4675 17.99915 8.5E-05 0 0 0 0 0.0852178505413256 211754.705882353 -111382.352941176 -1529.41176469986
0.9307 8.548 17.99911 8.1E-05 0 0 0 0 0.108879270004679 222211.234567901 105530.864197531 -3419.99999988649
0.941 8.637 17.99916 8.6E-05 0 0 0 0 0.108949866851916 209292.558139535 100430.23255814 1595.45454541188
0.9497 8.7235 17.99922 8.4E-05 0 0 0 0 0.108866853900384 214276.428571429 103851.19047619 1441.66666663587
]
VGS = 18.4 V
[
VDS ID VGS IG VDS(Status) ID(Status) VGS(Status) IG(Status) RDS RGS gfs
-0.8138 -9.7005 18.39916 8.4E-05 0 0 0 0 0.0838925828565538 219037.619047619 -115482.142857143 -11350.0000004297
-0.81 -9.587 18.39915 8.3E-05 0 0 0 0 0.0844894127464275 221676.506024096 -115506.024096386 24400.0000009237
-0.8063 -9.4565 18.39917 8.1E-05 0 0 0 0 0.0852641040554116 227150.24691358 -116746.913580247 3671.42857138122
]
LabView generated files
Uses commas as a delimiter and individual channels are recorded sequentially because of wrong VI config. Data needs to be reshaped and column names need to be manually added.
LabVIEW Measurement
Writer_Version 2
Reader_Version 2
Separator Tab
Decimal_Separator ,
Multi_Headings No
X_Columns One
Time_Pref Relative
Operator Lab PC
Date 2023/06/22
Time 10:32:30,3262085914611816406
***End_of_Header***
Channels 1
Samples 5
Date 2023/06/22
Time 10:32:30,3262085914611816406
X_Dimension Time
X0 0,0000000000000000E+0
Delta_X 1,000000
***End_of_Header***
X_Value Untitled Comment
0,000000 0,566565
1,000000 0,001000
2,000000 2,000000
3,000000 5,722850
4,000000 5,617282
0,000000 0,576423
1,000000 0,001500
2,000000 2,002500
3,000000 5,721466
4,000000 5,616231
0,000000 0,583428
1,000000 0,002000
2,000000 2,005000
3,000000 5,721620
4,000000 5,618589
LTSpice
File is sometimes encoded in UTF-16 depending on the platform.
I think the “right” way would be to write a small package for each file format, to handle reading and offer suitable options. Of course if some formats are related (e.g. different LabView output formats) they could be handled by the same package.
To go one step further you could then register the packages with FileIO.
Specifically for LabView: Is the format that way because you cannot control the software?
Otherwise, for CSV etc. there is an option to control decimal comma vs. dot. Furthermore, NI TDMS format is better structured, esp. in respect to metadata, as compared to CSV. Unfortunately there are no Julia packages to read TDMS, but there exist a Python package, which could be used to convert TDMS to HDF5 or probably be called from Julia. I must confess I just read TDMS into Excel XLSX and then XLSX to Julia.
Another option: Labview can flatten data to JSON, which can be read by Julia.
This seems like a reasonable solution. In case of formats that are useful for other people it is a good idea to share. Custom stuff, e.g. if someone made a custom one-off logger could go in our private repo.
Sometimes it is that. Sometimes it is because there is already a plethora of data generated so it is easier to just handle the output than change the writer and redo the experiments. I know LabView can be massaged to output almost anything.