XML to DataFrame converter

tp2750 · September 19, 2023, 7:19pm

Many instruments write data in XML format, and I often find that I spend way too much time writing parsers to convert it to DataFrame.

Do we have a package that can take a file like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ExperimentResults>
  <ExperimentInfo Name="Assay001">
    <ExpDescription/>
    <RunID>81EC636D</RunID>
  </ExperimentInfo>
  <AssayParamsInfo/>
  <KineticsData>
    <Step>
      <CommonData>
        <SampleLocation>1</SampleLocation>
        <Temperature>20.6</Temperature>
      </CommonData>
      <CycleTime>0.2</CycleTime>
    </Step>
    <Step>
      <CommonData>
        <SampleLocation>2</SampleLocation>
        <Temperature>20</Temperature>
      </CommonData>
      <CycleTime>0.2</CycleTime>
    </Step>
    <Step>
      <CommonData>
        <SampleLocation>9</SampleLocation>
        <Temperature>20</Temperature>
      </CommonData>
      <CycleTime>0.2</CycleTime>
    </Step>
  </KineticsData>
</ExperimentResults>

and give a result like this:

3×5 DataFrame
 Row │ ExperimentInfo_Name  RunID     SampleLocation  Temperature  CycleTime 
     │ String               String    Int64           Float64      Float64   
─────┼───────────────────────────────────────────────────────────────────────
   1 │ Assay001             81EC636D               1         20.6        0.2
   2 │ Assay001             81EC636D               2         20.0        0.2
   3 │ Assay001             81EC636D               9         20.0        0.2

without having to explicitly point to the individual elements with XPath like this;

using EzXML, DataFrames
doc = EzXML.readxml("ex1.xml")
DataFrame( ExperimentInfo_Name = EzXML.attributes(findfirst("//ExperimentInfo", doc))[1].content,
RunID = findfirst("//RunID", doc).content,
SampleLocation = map(x->x.content, findall("//SampleLocation", doc)),
Temperature = map(x->x.content, findall("//Temperature", doc))
)

It would be great to have a generic function that will just do

XML2DataFrame(“ex1.xml”) and return the result.

bkamins · September 19, 2023, 8:37pm

To solve your issue probably EzXML.jl should expose Tables.jl compatible view of data in it as a table. I am not sure if it does/could do (probably maintainers of this package know best).

tp2750 · September 20, 2023, 4:48pm

That would be very elegant, but also difficult in general.

I have some ideas on how to approach this, but I would hate to solve a problem that is already solved
I remember seeing announcements here, where it turns out that a similar package already exists.

pdeffebach · September 20, 2023, 5:08pm

You could try converting XML to JSON and and using JSONTables.jl.

tbeason · September 20, 2023, 5:28pm

Not sure how you could do this for any general XML file.

Your code seems succinct enough and not too problematic, what is wrong with it?

tp2750 · September 20, 2023, 5:56pm

Thanks for the reply @pdeffebach . I did not think about that. Do you happen to know a good package for this? I’ll probably try XMLDict the to JSON.

tp2750 · September 20, 2023, 6:14pm

Thank you for the comment @tbeason .
It’s not that it is wrong, I just dislike having to analyze the schema of the XML to figure out how to get the data out. I’ll expect there to be software to do that

It might be I’m too optimistic, and looking at tutorials they basically all do something similar to my draft code above: fish out data based on knowledge of tag names in the XML.

My wish is for code to analyze the structure of the XML and recognize parts that are “table like”: repeated sections that have the same tag structure so they can be interpreted as vectors.

The <Step> sections in the example is of this form: 3 repeated sections with the exact same tag structure. I expect that to be reasonably easy to recognize.

It will also be fine to get a vector of 2 DataFrames:

2-element Vector{DataFrame}:
 1×2 DataFrame
 Row │ ExperimentInfo_Name  RunID    
     │ String               String   
─────┼───────────────────────────────
   1 │ Assay001             81EC636D
 3×2 DataFrame
 Row │ SampleLocation  Temperature 
     │ String          String      
─────┼─────────────────────────────
   1 │ 1               20.6
   2 │ 2               20
   3 │ 9               20

Topic		Replies	Views
XML to DataFrame General Usage question	7	3649	April 29, 2021
XML to DataFrame (or just to a dict) General Usage xml	2	1169	January 4, 2021
Convert XML Paths to DataFrame Columns Performance xml , ezxml	0	103	April 17, 2024
Excel data to dataframes Data data , dataframes , xlsx , excel	16	4914	July 25, 2021
Converting XLSX.Worksheet to DataFrame Data xlsx , excel	5	1040	February 16, 2024

XML to DataFrame converter

Related topics