Read data file and create a dictionary from the metadata

Austin_Weber · April 6, 2023, 10:54pm

I have a spectral data file in which the contents are stored like this:

#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
.
.
.

As you can see, the metadata are stored at the top of the file, and I want to create a dictionary of this metadata without having to manually generate it (like below):

# Manually-generated dictionary of metadata
meta_dict = Dict("#FORMAT" => "EMSA", "#VERSION" => 1.0, "#TITLE" => "2023-Apr-06", "#NPOINTS" => 4096, "#NCOLUMNS" => 1)

In the actual data file there is much more than just 5 lines of metadata, and so I am looking for a way to avoid having to manually generate a dictionary.

rafael.guerra · April 6, 2023, 11:42pm

Some ideas (edited):

str = """
#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
"""

io = IOBuffer(str)
lines = readlines(io)
ix = findfirst(x->first(x)!='#', lines) - 1
meta = split.(chop.(lines[1:ix], head=1, tail=0),":")
dic = Dict(strip.(first.(meta)) .=> strip.(last.(meta)))

Austin_Weber · April 7, 2023, 12:26am

Your solution works pretty well, thank you. The only issue is that in this case the keys and values in the dictionary maintain the white space before and after the colon. For instance, instead of getting

"#FORMAT" => "EMSA"

the output is

"#FORMAT " => " EMSA"

But I have worked out a way to fix this using a for-loop!

new_dic = Dict()
for (key, val) in dic
    new_key = strip(key)
    new_value = strip(val)
    new_dic[new_key] = new_value
end

Now the code is working how I want it to. I appreciate the help.

NicholasWMRitchie · April 7, 2023, 1:21am

Austin, You might find the package NeXLSpectrum (GitHub - usnistgov/NeXLSpectrum.jl: EDS spectrum analysis tools within the NeXL toolset) interesting. For one, it reads EMSA spectrum files.

rafael.guerra · April 7, 2023, 5:33am

I think we just need to strip it here:

dic = Dict(strip.(first.(meta)) .=> strip.(last.(meta)))

I have edited the code above.

rocco_sprmnt21 · April 7, 2023, 7:25am

data="""#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
"""

io=IOBuffer(data)
el=eachline(io)
md=startswith("#")
mdd=Dict{String, Any}()

itr,_=iterate(el)
while md(itr)
    k=findfirst(' ',itr)-1
    v=findfirst(':',itr)+2
    mdd[itr[begin:k]]=itr[v:end]
    itr,_=iterate(el)
end

mdd

using DIctionaries preserves the order of the metadata

using Dictionaries

julia> @btime begin
       io=IOBuffer(data)
       el=eachline(io)
       md=startswith("#")
       mdd=Dictionary{String, Any}()

       itr,_=iterate(el)
       while md(itr)
           k=findfirst(' ',itr)-1
           v=findfirst(':',itr)+2
           insert!(mdd,itr[begin:k],itr[v:end])
           itr,_=iterate(el)
       end
       mdd
       end
  1.190 μs (43 allocations: 2.23 KiB)
6-element Dictionary{String, Any}
   "#FORMAT" │ "EMSA"
  "#VERSION" │ "1.0"
    "#TITLE" │ "2023-Apr-06"
  "#NPOINTS" │ "4096"
 "#NCOLUMNS" │ "1"
 "#SPECTRUM" │ ""

rocco_sprmnt21 · April 7, 2023, 10:47am

I don’t know if it’s already available, but it would be nice to have a multi-tryparse function for a list of dynamically supplied types.
Just to give an idea, like the following hunk

function mtryparse(str,TS...)
    str==""&&return str
    i=1
    dfrm=DateFormat("y-u-d")
    v=tryparse(TS[i],str)
    while isnothing(v)&& (i<length(TS))
        i+=1
        v=tryparse(TS[i],str)
        #println(v)
    end
    !isnothing(v) ? v : (try; Date(str,dfrm); catch; str; end)
end


julia> begin
           io=IOBuffer(data)
           el=eachline(io)
           md=startswith("#")
           mdd=Dictionary{String, Any}()

           itr,_=iterate(el)
           while md(itr)
               k=findfirst(' ',itr)-1
               v=findfirst(':',itr)+2
               pv=mtryparse(itr[v:end],Int,Float64,Date)
               insert!(mdd,itr[begin:k],pv)
               itr,_=iterate(el)
           end
           mdd
       end
6-element Dictionary{String, Any}
   "#FORMAT" │ "EMSA"
  "#VERSION" │ 1.0
    "#TITLE" │ Date("2023-04-06")
  "#NPOINTS" │ 4096
 "#NCOLUMNS" │ 1
 "#SPECTRUM" │ ""

Austin_Weber · April 7, 2023, 3:33pm

Hi Nicholas, it’s funny – I’ve actually watched your YouTube videos on using DTSA-II, so it’s interesting that you would come across my question. I wasn’t aware that you had made a Julia package for working with .msa files, so thank you very much for sharing.

I only started using Julia about a month ago, so I’m still getting used to the syntax and understanding the documentation. Have you made any videos showing how to use the NeXLSpectrum package? I’m mostly interested in making customizable plots, but I’d also like to know how to perform P/B-ZAF corrections and to quantify peak intensity ratios.

I appreciate the help!

NicholasWMRitchie · April 7, 2023, 6:26pm

Austin,
There is documentation here: Home · NeXLSpectrum.jl
In specific, fitting and quantification is documented here: Fitting K412 (simple API) · NeXLSpectrum.jl
I haven’t implemented peak-to-background corrections (only φ(ρz)) but, if you’d like to, …
You might find these pages helpful too: (Image: )Core - Part of the NeXL X-ray Microanalysis Library · NeXLCore and (Image: )MatrixCorrection - Part of the NeXL X-ray Microanalysis Library · NeXLMatrixCorrection.jl

rafael.guerra · April 7, 2023, 7:06pm

Perhaps we should also read the spectrum matrix data in the Dict object?

Example using DelimitedFiles

str = """
#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
"""

io = IOBuffer(str)
lines = readlines(io)
ix = findfirst(x->first(x)!='#', lines) - 1
meta = split.(chop.(lines[1:ix], head=1, tail=0),":")
dic = Dict{AbstractString, Any}(strip.(first.(meta)) .=> strip.(last.(meta)))

using DelimitedFiles
dic["SPECTRUM"] = readdlm(IOBuffer(str), ',', skipstart=ix+1)

Austin_Weber · April 7, 2023, 8:00pm

Certainly, although I have just been reading the numerical data directly into a data frame like so:

using CSV, DataFrames

# The dictionary from earlier in the conversation
dic = Dict(strip.(first.(meta)) .=> strip.(last.(meta)))

skip2 = length(keys(dic)) + 1;
data= CSV.read("datafile.msa",
    DataFrame,
    skipto=skip2,
    delim=",",
    header=false,
    ignorerepeated=true,
    footerskip=1);

rafael.guerra · April 7, 2023, 8:12pm

Thank you. May I ask what is the benefit to your work of having such a numeric matrix with the spectrum in a data frame?

Austin_Weber · April 7, 2023, 8:22pm

I don’t really know if there is a benefit to it, but I find data frames easy to understand and they are straightforward to work with. I’m very new to Julia, so my opinions on the best way to do things are still malleable.

rafael.guerra · April 7, 2023, 8:46pm

Right, it’s up to you to find your comfort zone.

In that case, have you considered adding the dic dictionary to your dataframe data as metadata?

Something like:

metadata!(data, "Meta", dic, style=:note)
metadata(data, "Meta")

Austin_Weber · April 8, 2023, 4:09pm

I’ll add that to my tool belt, thanks!

Topic		Replies	Views
Metaprogramming process text to Dictionary New to Julia metaprogramming	5	583	December 3, 2019
Reading binary files in pieces Data binaryio	1	1003	February 28, 2017
Reading metadata from files General Usage	0	250	August 26, 2021
A type for metadata? Data question , proposal , metadata	11	1807	August 17, 2018
Best way to store arrays with metadata? New to Julia dataframes	1	386	October 10, 2023

Read data file and create a dictionary from the metadata

Related topics