XML to DataFrame

I have an xml file that has the following list of lines under a node. It looks like this (several thousand rows like the following)

<Obj Time="2019-09-01T23:40:13+08:00" attr1="L" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:14+08:00" attr1="R" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:15+08:00" attr1="L" attr2="11" attr3="d"/>
<Obj Time="2019-09-01T23:40:16+08:00" attr1="L" attr2="13" attr3="c"/>

I’m trying to use LightXML to extract this, but I can’t find the command to extract this. I’d like to extract this into a DataFrame but an array of Dicts would be OK also.

Anyone help with this?

DISCLAIMER, i have basic knowlegdge of XML manipulation, i just want to answer this to learn.

your XML is something like this?

<?xml version = "1.0"?>
<db>
<Obj Time="2019-09-01T23:40:13+08:00" attr1="L" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:14+08:00" attr1="R" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:15+08:00" attr1="L" attr2="11" attr3="d"/>
<Obj Time="2019-09-01T23:40:16+08:00" attr1="L" attr2="13" attr3="c"/>
</db>

Yes, basically.

this code produces a dataframe from a xml structured in that way

using LightXML, DataFrames

#length of an unkwown iterable, maybe implementing this in LIghtXML would help
function ilength(iter)
    n=0
        for i in iter  
            n+=1
        end
    return n
end

function xml_parse_main()

    obj_vector = xroot["Obj"] #obtain all Obj in the xml
    attr_length = ilength(attributes(obj_vector[1]))

   
    #all xml things are strings, so we are gonna store all as a string first and transform later
    attr_vector = Vector{String}(undef,attr_length)
    for (i,a) in enumerate(attributes(obj_vector[1]))
        attr_vector[i] = name(a)
    end

    datamatrix = Matrix{String}(undef,length(obj_vector),attr_length)
    for (i,ai) in enumerate(attributes(obj_vector[1]))
        for j in 1:length(obj_vector)
            datamatrix[j,i] = value(ai) 
        end
    end
    return (attr_vector,datamatrix)
end

titles,data = xml_parse_main()
df = DataFrame(titles)
names!(df, Symbol.(titles))

the result is this:

4Γ—4 DataFrame
β”‚ Row β”‚ Time                      β”‚ attr1  β”‚ attr2  β”‚ attr3  β”‚
β”‚     β”‚ String                    β”‚ String β”‚ String β”‚ String β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2019-09-01T23:40:13+08:00 β”‚ L      β”‚ 1      β”‚ a      β”‚
β”‚ 2   β”‚ 2019-09-01T23:40:13+08:00 β”‚ L      β”‚ 1      β”‚ a      β”‚
β”‚ 3   β”‚ 2019-09-01T23:40:13+08:00 β”‚ L      β”‚ 1      β”‚ a      β”‚
β”‚ 4   β”‚ 2019-09-01T23:40:13+08:00 β”‚ L      β”‚ 1      β”‚ a      β”‚

probably and surely, the next steps is converting all the values to the corresponding types (DateTime, char,Int64,char)

fun experience!

3 Likes

to parse in the corresponding types

using TimeZones #i had to install this
df.Time = ZonedDateTime.(df.Time, "yyyy-mm-ddTHH:MM:SSzzzz")
df.attr2 = parse.(Int64,df.attr2)

EzXML.jl supports XPath and thus it is very easy to extract <Obj> tags. Also, you don’t need to care about memory leak because EzXML.jl does memory management for you.

shell> cat test.xml
<data>
<Obj Time="2019-09-01T23:40:13+08:00" attr1="L" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:14+08:00" attr1="R" attr2="1" attr3="a"/>
<Obj Time="2019-09-01T23:40:15+08:00" attr1="L" attr2="11" attr3="d"/>
<Obj Time="2019-09-01T23:40:16+08:00" attr1="L" attr2="13" attr3="c"/>
</data>

julia> using EzXML

julia> doc = readxml("test.xml")
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x00007fce9c2ab8b0>))

julia> elements = findall("//Obj", doc.root)
4-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c2f6130>)
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c7d1fe0>)
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c7c6ea0>)
 EzXML.Node(<ELEMENT_NODE[Obj]@0x00007fce9c298430>)

julia> elements[1]["Time"]
"2019-09-01T23:40:13+08:00"

julia> elements[2]["attr3"]
"a"
4 Likes

thanks everyone for your help. I’ll try to test these out later today.

Had a similar situation, and just share my solution here in case it can be useful for others.
I need the numerical values to be parsed as Float64.

import EzXML, DataFrames

function XPathDataFrame(path,doc)
    elements = findall(path, doc)
    reduce(vcat, DataFrames.DataFrame.(nodeDict.(elements)))
end

function nodeDict(n::EzXML.Node)
    my_keys = map(x -> x.name, EzXML.attributes(n))
    my_vals = map(my_keys) do x
        if tryparse(Float64, n[x]) == nothing
            return(n[x])
        end
        parse(Float64, n[x])
    end
    Dict{String, Union{String, Float64}}(zip(my_keys,my_vals))
end

Then we get

doc = EzXML.readxml("test.xml")

XPathDataFrame("//Obj",doc)
4Γ—4 DataFrame
 Row β”‚ Time                       attr1   attr2    attr3  
     β”‚ String                     String  Float64  String 
─────┼────────────────────────────────────────────────────
   1 β”‚ 2019-09-01T23:40:13+08:00  L           1.0  a
   2 β”‚ 2019-09-01T23:40:14+08:00  R           1.0  a
   3 β”‚ 2019-09-01T23:40:15+08:00  L          11.0  d
   4 β”‚ 2019-09-01T23:40:16+08:00  L          13.0  c
7 Likes