I have a lot of XML-files (gzipped - about 1.5Gb in size) containing approximately 400000 elements
(
<REC></REC>
)
in which I am interested.
The structure of the file:
<records>
<REC>
...
</REC>
etc.
</records>
I have first looked at LightXML but could not figure out how to do streaming. Then I tried EzXML in this way (which worked for a few records, but then just hangs after a random number of records in the function eenrekord()
):
using EzXML
using CodecZlib
function eenrekord(x)
r = parsexml(string(x)) # This is where the process hangs - sometimes after 20 - 80 iterations.
ut = nodecontent(findfirst(root(r), "//UID"))
ut, string(r)
end
function xml_records(gzipfile,tabel,zipfile)
l = GzipDecompressorStream(open(gzipfile))
reader = EzXML.StreamReader(l)
count = 0
rcount = 0
conn = dbcon()
list = []
while !done(reader)
if nodetype(reader) == 1 && nodename(reader) == "REC"
rcount += 1
if rcount % 2 == 1
x = expandtree(reader)
ut, xml = eenrekord(x)
push!(list, [zipfile, gzipfile, ut, xml])
count +=1
if count % 20000 == 0
println(count)
# write_list_sql_table - another function not shown here.
list = []
end
end
end
end
if telling % 20000 != 0
println(count)
# write_list_sql_table - another function not shown here.
end
close(conn)
end
gzipfile = "/tmp/some.xml.gz"
zipfile = "somezipfile.zip"
tabel = "core_2018"
xml_records(gzipfile,tabel,zipfile)
I have looked at LibExpat.jl of which the README indicates streaming from a file using xp_streaming_parsefile
but I have no idea how to use it and I could not any example using it.
At this stage I just want to put <REC> </REC>
as a XML-type record in the SQL table from where I will use PostgreSQL’s XPATH-capabilities to manipulate it. In total it will be about 60 million records I have to process in this way.
I will appreciate any references to good examples or tutorials on how to do this.
And any comments about mistakes I made in the code shown above.