XML to DataFrame converter

Many instruments write data in XML format, and I often find that I spend way too much time writing parsers to convert it to DataFrame.

Do we have a package that can take a file like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ExperimentResults>
  <ExperimentInfo Name="Assay001">
    <ExpDescription/>
    <RunID>81EC636D</RunID>
  </ExperimentInfo>
  <AssayParamsInfo/>
  <KineticsData>
    <Step>
      <CommonData>
        <SampleLocation>1</SampleLocation>
        <Temperature>20.6</Temperature>
      </CommonData>
      <CycleTime>0.2</CycleTime>
    </Step>
    <Step>
      <CommonData>
        <SampleLocation>2</SampleLocation>
        <Temperature>20</Temperature>
      </CommonData>
      <CycleTime>0.2</CycleTime>
    </Step>
    <Step>
      <CommonData>
        <SampleLocation>9</SampleLocation>
        <Temperature>20</Temperature>
      </CommonData>
      <CycleTime>0.2</CycleTime>
    </Step>
  </KineticsData>
</ExperimentResults>

and give a result like this:

3×5 DataFrame
 Row │ ExperimentInfo_Name  RunID     SampleLocation  Temperature  CycleTime 
     │ String               String    Int64           Float64      Float64   
─────┼───────────────────────────────────────────────────────────────────────
   1 │ Assay001             81EC636D               1         20.6        0.2
   2 │ Assay001             81EC636D               2         20.0        0.2
   3 │ Assay001             81EC636D               9         20.0        0.2

without having to explicitly point to the individual elements with XPath like this;

using EzXML, DataFrames
doc = EzXML.readxml("ex1.xml")
DataFrame( ExperimentInfo_Name = EzXML.attributes(findfirst("//ExperimentInfo", doc))[1].content,
RunID = findfirst("//RunID", doc).content,
SampleLocation = map(x->x.content, findall("//SampleLocation", doc)),
Temperature = map(x->x.content, findall("//Temperature", doc))
)

It would be great to have a generic function that will just do

XML2DataFrame(“ex1.xml”) and return the result.

To solve your issue probably EzXML.jl should expose Tables.jl compatible view of data in it as a table. I am not sure if it does/could do (probably maintainers of this package know best).

That would be very elegant, but also difficult in general.

I have some ideas on how to approach this, but I would hate to solve a problem that is already solved :smile:
I remember seeing announcements here, where it turns out that a similar package already exists.

You could try converting XML to JSON and and using JSONTables.jl.

Not sure how you could do this for any general XML file.

Your code seems succinct enough and not too problematic, what is wrong with it?

Thanks for the reply @pdeffebach . I did not think about that. Do you happen to know a good package for this? I’ll probably try XMLDict the to JSON.

Thank you for the comment @tbeason .
It’s not that it is wrong, I just dislike having to analyze the schema of the XML to figure out how to get the data out. I’ll expect there to be software to do that :smile:

It might be I’m too optimistic, and looking at tutorials they basically all do something similar to my draft code above: fish out data based on knowledge of tag names in the XML.

My wish is for code to analyze the structure of the XML and recognize parts that are “table like”: repeated sections that have the same tag structure so they can be interpreted as vectors.

The <Step> sections in the example is of this form: 3 repeated sections with the exact same tag structure. I expect that to be reasonably easy to recognize.

It will also be fine to get a vector of 2 DataFrames:

2-element Vector{DataFrame}:
 1×2 DataFrame
 Row │ ExperimentInfo_Name  RunID    
     │ String               String   
─────┼───────────────────────────────
   1 │ Assay001             81EC636D
 3×2 DataFrame
 Row │ SampleLocation  Temperature 
     │ String          String      
─────┼─────────────────────────────
   1 │ 1               20.6
   2 │ 2               20
   3 │ 9               20