XML to DataFrame

jonjilla · September 5, 2019, 2:24pm

I have an xml file that has the following list of lines under a node. It looks like this (several thousand rows like the following)

<Obj Time="2019-09-01T23:40:13+08:00" attr1="L" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:14+08:00" attr1="R" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:15+08:00" attr1="L" attr2="11" attr3="d"/>
<Obj Time="2019-09-01T23:40:16+08:00" attr1="L" attr2="13" attr3="c"/>

I’m trying to use LightXML to extract this, but I can’t find the command to extract this. I’d like to extract this into a DataFrame but an array of Dicts would be OK also.

Anyone help with this?

longemen3000 · September 5, 2019, 10:44pm

DISCLAIMER, i have basic knowlegdge of XML manipulation, i just want to answer this to learn.

your XML is something like this?

<?xml version = "1.0"?>
<db>
<Obj Time="2019-09-01T23:40:13+08:00" attr1="L" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:14+08:00" attr1="R" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:15+08:00" attr1="L" attr2="11" attr3="d"/>
<Obj Time="2019-09-01T23:40:16+08:00" attr1="L" attr2="13" attr3="c"/>
</db>

jonjilla · September 5, 2019, 10:46pm

Yes, basically.

longemen3000 · September 5, 2019, 10:58pm

this code produces a dataframe from a xml structured in that way

using LightXML, DataFrames

#length of an unkwown iterable, maybe implementing this in LIghtXML would help
function ilength(iter)
    n=0
        for i in iter  
            n+=1
        end
    return n
end

function xml_parse_main()

    obj_vector = xroot["Obj"] #obtain all Obj in the xml
    attr_length = ilength(attributes(obj_vector[1]))

   
    #all xml things are strings, so we are gonna store all as a string first and transform later
    attr_vector = Vector{String}(undef,attr_length)
    for (i,a) in enumerate(attributes(obj_vector[1]))
        attr_vector[i] = name(a)
    end

    datamatrix = Matrix{String}(undef,length(obj_vector),attr_length)
    for (i,ai) in enumerate(attributes(obj_vector[1]))
        for j in 1:length(obj_vector)
            datamatrix[j,i] = value(ai) 
        end
    end
    return (attr_vector,datamatrix)
end

titles,data = xml_parse_main()
df = DataFrame(titles)
names!(df, Symbol.(titles))

the result is this:

4×4 DataFrame
│ Row │ Time                      │ attr1  │ attr2  │ attr3  │
│     │ String                    │ String │ String │ String │
├─────┼───────────────────────────┼────────┼────────┼────────┤
│ 1   │ 2019-09-01T23:40:13+08:00 │ L      │ 1      │ a      │
│ 2   │ 2019-09-01T23:40:13+08:00 │ L      │ 1      │ a      │
│ 3   │ 2019-09-01T23:40:13+08:00 │ L      │ 1      │ a      │
│ 4   │ 2019-09-01T23:40:13+08:00 │ L      │ 1      │ a      │

probably and surely, the next steps is converting all the values to the corresponding types (DateTime, char,Int64,char)

fun experience!

longemen3000 · September 5, 2019, 11:12pm

to parse in the corresponding types

using TimeZones #i had to install this
df.Time = ZonedDateTime.(df.Time, "yyyy-mm-ddTHH:MM:SSzzzz")
df.attr2 = parse.(Int64,df.attr2)

bicycle1885 · September 6, 2019, 3:58am

EzXML.jl supports XPath and thus it is very easy to extract <Obj> tags. Also, you don’t need to care about memory leak because EzXML.jl does memory management for you.

shell> cat test.xml
<data>
<Obj Time="2019-09-01T23:40:13+08:00" attr1="L" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:14+08:00" attr1="R" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:15+08:00" attr1="L" attr2="11" attr3="d"/>
<Obj Time="2019-09-01T23:40:16+08:00" attr1="L" attr2="13" attr3="c"/>
</data>

julia> using EzXML

julia> doc = readxml("test.xml")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fce9c2ab8b0>))

julia> elements = findall("//Obj", doc.root)
4-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c2f6130>)
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c7d1fe0>)
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c7c6ea0>)
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c298430>)

julia> elements[1]["Time"]
"2019-09-01T23:40:13+08:00"

julia> elements[2]["attr3"]
"a"

jonjilla · September 6, 2019, 2:35pm

thanks everyone for your help. I’ll try to test these out later today.

tp2750 · April 29, 2021, 3:01pm

Had a similar situation, and just share my solution here in case it can be useful for others.
I need the numerical values to be parsed as Float64.

import EzXML, DataFrames

function XPathDataFrame(path,doc)
    elements = findall(path, doc)
    reduce(vcat, DataFrames.DataFrame.(nodeDict.(elements)))
end

function nodeDict(n::EzXML.Node)
    my_keys = map(x -> x.name, EzXML.attributes(n))
    my_vals = map(my_keys) do x
        if tryparse(Float64, n[x]) == nothing
            return(n[x])
        end
        parse(Float64, n[x])
    end
    Dict{String, Union{String, Float64}}(zip(my_keys,my_vals))
end

Then we get

doc = EzXML.readxml("test.xml")

XPathDataFrame("//Obj",doc)
4×4 DataFrame
 Row │ Time                       attr1   attr2    attr3  
     │ String                     String  Float64  String 
─────┼────────────────────────────────────────────────────
   1 │ 2019-09-01T23:40:13+08:00  L           1.0  a
   2 │ 2019-09-01T23:40:14+08:00  R           1.0  a
   3 │ 2019-09-01T23:40:15+08:00  L          11.0  d
   4 │ 2019-09-01T23:40:16+08:00  L          13.0  c

Topic		Replies	Views
XML to DataFrame converter Data dataframes , xml	6	760	September 20, 2023
XML to DataFrame (or just to a dict) General Usage xml	2	1169	January 4, 2021
Convert XML Paths to DataFrame Columns Performance xml , ezxml	0	103	April 17, 2024
How to convert EzXML.Document file into a dictionary New to Julia	1	251	August 7, 2023
Converting XLSX.Worksheet to DataFrame Data xlsx , excel	5	1040	February 16, 2024

XML to DataFrame

Related topics