Read data file and create a dictionary from the metadata

I have a spectral data file in which the contents are stored like this:

#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
.
.
.

As you can see, the metadata are stored at the top of the file, and I want to create a dictionary of this metadata without having to manually generate it (like below):

# Manually-generated dictionary of metadata
meta_dict = Dict("#FORMAT" => "EMSA", "#VERSION" => 1.0, "#TITLE" => "2023-Apr-06", "#NPOINTS" => 4096, "#NCOLUMNS" => 1)

In the actual data file there is much more than just 5 lines of metadata, and so I am looking for a way to avoid having to manually generate a dictionary.

Some ideas (edited):

str = """
#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
"""

io = IOBuffer(str)
lines = readlines(io)
ix = findfirst(x->first(x)!='#', lines) - 1
meta = split.(chop.(lines[1:ix], head=1, tail=0),":")
dic = Dict(strip.(first.(meta)) .=> strip.(last.(meta)))
1 Like

Your solution works pretty well, thank you. The only issue is that in this case the keys and values in the dictionary maintain the white space before and after the colon. For instance, instead of getting

"#FORMAT" => "EMSA"

the output is

"#FORMAT " => " EMSA"

But I have worked out a way to fix this using a for-loop!

new_dic = Dict()
for (key, val) in dic
    new_key = strip(key)
    new_value = strip(val)
    new_dic[new_key] = new_value
end

Now the code is working how I want it to. I appreciate the help.

Austin, You might find the package NeXLSpectrum (GitHub - usnistgov/NeXLSpectrum.jl: EDS spectrum analysis tools within the NeXL toolset) interesting. For one, it reads EMSA spectrum files.

1 Like

I think we just need to strip it here:

dic = Dict(strip.(first.(meta)) .=> strip.(last.(meta)))

I have edited the code above.

1 Like
data="""#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
"""

io=IOBuffer(data)
el=eachline(io)
md=startswith("#")
mdd=Dict{String, Any}()

itr,_=iterate(el)
while md(itr)
    k=findfirst(' ',itr)-1
    v=findfirst(':',itr)+2
    mdd[itr[begin:k]]=itr[v:end]
    itr,_=iterate(el)
end

mdd

using DIctionaries preserves the order of the metadata

using Dictionaries

julia> @btime begin
       io=IOBuffer(data)
       el=eachline(io)
       md=startswith("#")
       mdd=Dictionary{String, Any}()

       itr,_=iterate(el)
       while md(itr)
           k=findfirst(' ',itr)-1
           v=findfirst(':',itr)+2
           insert!(mdd,itr[begin:k],itr[v:end])
           itr,_=iterate(el)
       end
       mdd
       end
  1.190 μs (43 allocations: 2.23 KiB)
6-element Dictionary{String, Any}
   "#FORMAT" │ "EMSA"
  "#VERSION" │ "1.0"
    "#TITLE" │ "2023-Apr-06"
  "#NPOINTS" │ "4096"
 "#NCOLUMNS" │ "1"
 "#SPECTRUM" │ ""

I don’t know if it’s already available, but it would be nice to have a multi-tryparse function for a list of dynamically supplied types.
Just to give an idea, like the following hunk

function mtryparse(str,TS...)
    str==""&&return str
    i=1
    dfrm=DateFormat("y-u-d")
    v=tryparse(TS[i],str)
    while isnothing(v)&& (i<length(TS))
        i+=1
        v=tryparse(TS[i],str)
        #println(v)
    end
    !isnothing(v) ? v : (try; Date(str,dfrm); catch; str; end)
end


julia> begin
           io=IOBuffer(data)
           el=eachline(io)
           md=startswith("#")
           mdd=Dictionary{String, Any}()

           itr,_=iterate(el)
           while md(itr)
               k=findfirst(' ',itr)-1
               v=findfirst(':',itr)+2
               pv=mtryparse(itr[v:end],Int,Float64,Date)
               insert!(mdd,itr[begin:k],pv)
               itr,_=iterate(el)
           end
           mdd
       end
6-element Dictionary{String, Any}
   "#FORMAT" │ "EMSA"
  "#VERSION" │ 1.0
    "#TITLE" │ Date("2023-04-06")
  "#NPOINTS" │ 4096
 "#NCOLUMNS" │ 1
 "#SPECTRUM" │ ""

Hi Nicholas, it’s funny – I’ve actually watched your YouTube videos on using DTSA-II, so it’s interesting that you would come across my question. I wasn’t aware that you had made a Julia package for working with .msa files, so thank you very much for sharing.

I only started using Julia about a month ago, so I’m still getting used to the syntax and understanding the documentation. Have you made any videos showing how to use the NeXLSpectrum package? I’m mostly interested in making customizable plots, but I’d also like to know how to perform P/B-ZAF corrections and to quantify peak intensity ratios.

I appreciate the help!

Austin,
There is documentation here: Home · NeXLSpectrum.jl
In specific, fitting and quantification is documented here: Fitting K412 (simple API) · NeXLSpectrum.jl
I haven’t implemented peak-to-background corrections (only φ(ρz)) but, if you’d like to, …
You might find these pages helpful too: (Image: )Core - Part of the NeXL X-ray Microanalysis Library · NeXLCore and (Image: )MatrixCorrection - Part of the NeXL X-ray Microanalysis Library · NeXLMatrixCorrection.jl

1 Like

Perhaps we should also read the spectrum matrix data in the Dict object?

Example using DelimitedFiles
str = """
#FORMAT      : EMSA
#VERSION     : 1.0
#TITLE       : 2023-Apr-06
#NPOINTS     : 4096
#NCOLUMNS    : 1
#SPECTRUM    :
0.00,        0.0
10.00,        4.0
20.00,        2.0
30.00,        7.0
40.00,        15.0
50.00,        21.0
"""

io = IOBuffer(str)
lines = readlines(io)
ix = findfirst(x->first(x)!='#', lines) - 1
meta = split.(chop.(lines[1:ix], head=1, tail=0),":")
dic = Dict{AbstractString, Any}(strip.(first.(meta)) .=> strip.(last.(meta)))

using DelimitedFiles
dic["SPECTRUM"] = readdlm(IOBuffer(str), ',', skipstart=ix+1)

Certainly, although I have just been reading the numerical data directly into a data frame like so:

using CSV, DataFrames

# The dictionary from earlier in the conversation
dic = Dict(strip.(first.(meta)) .=> strip.(last.(meta)))

skip2 = length(keys(dic)) + 1;
data= CSV.read("datafile.msa",
    DataFrame,
    skipto=skip2,
    delim=",",
    header=false,
    ignorerepeated=true,
    footerskip=1);

Thank you. May I ask what is the benefit to your work of having such a numeric matrix with the spectrum in a data frame?

I don’t really know if there is a benefit to it, but I find data frames easy to understand and they are straightforward to work with. I’m very new to Julia, so my opinions on the best way to do things are still malleable.

Right, it’s up to you to find your comfort zone.

In that case, have you considered adding the dic dictionary to your dataframe data as metadata?

Something like:

metadata!(data, "Meta", dic, style=:note)
metadata(data, "Meta")

I’ll add that to my tool belt, thanks!