Parsing XML as IO-stream

johann.spies · July 10, 2018, 12:59pm

I have a lot of XML-files (gzipped - about 1.5Gb in size) containing approximately 400000 elements
(

<REC></REC>

)
in which I am interested.

The structure of the file:

<records>
<REC>
...
</REC>
etc.
</records>

I have first looked at LightXML but could not figure out how to do streaming. Then I tried EzXML in this way (which worked for a few records, but then just hangs after a random number of records in the function eenrekord()):

using EzXML
using CodecZlib

function eenrekord(x)
    r = parsexml(string(x)) # This is where the process hangs - sometimes after 20 - 80 iterations.
    ut = nodecontent(findfirst(root(r), "//UID"))
    ut, string(r)
end

function xml_records(gzipfile,tabel,zipfile)
    l = GzipDecompressorStream(open(gzipfile))
    reader = EzXML.StreamReader(l)
    count = 0
    rcount = 0
    conn = dbcon()
    list = []
    while !done(reader)
        if nodetype(reader) == 1 && nodename(reader) == "REC"
            rcount += 1
            if rcount % 2 == 1
                x = expandtree(reader)
                ut, xml = eenrekord(x)
                push!(list, [zipfile, gzipfile, ut, xml])
                count +=1
                if count % 20000 == 0
                    println(count)
                   #  write_list_sql_table  - another function not shown here.     
                    list = []
                end
            end
        end
    end
    if telling % 20000 != 0
        println(count)
                   #  write_list_sql_table  - another function not shown here.     
    end
    close(conn)
end
gzipfile = "/tmp/some.xml.gz"
zipfile = "somezipfile.zip"
tabel = "core_2018"
xml_records(gzipfile,tabel,zipfile)

I have looked at LibExpat.jl of which the README indicates streaming from a file using xp_streaming_parsefile but I have no idea how to use it and I could not any example using it.

At this stage I just want to put <REC> </REC> as a XML-type record in the SQL table from where I will use PostgreSQL’s XPATH-capabilities to manipulate it. In total it will be about 60 million records I have to process in this way.

I will appreciate any references to good examples or tutorials on how to do this.

And any comments about mistakes I made in the code shown above.

kevbonham · July 12, 2018, 2:03am

On my phone, so forgive the lack of code/links, but I recall running into a similar problem with a python library for streaming XML and the issue was that a huge chunck of each record was maintained in memory - I was only grabbing a single field from very large records, but all of the parts of the record that were not in my query were held onto, and I had to explicitly flush in every loop, otherwise the memory would load up and it would stall.w

I think it’s unlikely that the libraries you tried have this problem, but thought I’d mention it just in case

bicycle1885 · July 12, 2018, 4:22am

Hi, I’m a developer of EzXML and CodecZlib.

First of all, you don’t need to use CodecZlib because EzXML detects gzip compression and implicitly decompresses it. Also, calling expandtree is inefficient and may be unsafe. I’d like to know what you meant by “hang”; does the process run forever without warning messages?

johann.spies · July 12, 2018, 8:14am

Thanks for the remark about EzXML which detects gzip compression.

I have used expandtree to be able to get a document which I could use to search for a specific element. Is there another way of doing it?

The process gave no error or warning messages and nothing was happening. I used print statements to determine the exact piece of code where this seemed to go into some endless loop. This happened sometimes after retrieving anything between 12 and 80 records from the same file.

bicycle1885 · July 12, 2018, 8:35am

Using a streaming parser is a little bit difficult; you’ll need to keep track of the current parsing state using a stack or something.

It would be helpful for me to debug the problem if you could create reproducible code with some (smaller) data and file an issue at https://github.com/bicycle1885/EzXML.jl/issues.

johann.spies · July 12, 2018, 8:38am

Thanks @kevbonham. I have had the same experience in Python where I did:

 element.clear()
 gc.collect()

I am not exactly sure how I am going to do this in Julia. Will investigate.

johann.spies · July 12, 2018, 8:46am

Thanks @bicycle1885.
I am willling to do that but the data has strict licensing restrictions and I am reluctant to even put a portion of the data out there in public domain. I will file an issue if I can succeed to replicate the problem with “safe” data.

Topic		Replies	Views
Released EzXML.jl - a new package for XML/HTML Community announcement	0	2771	November 17, 2016
XML parsing with Requests + LibExpat ... OS-dependent? General Usage	11	1276	January 20, 2017
Streaming large xml files using LibExpat package downloaded from GitHub General Usage	1	385	October 3, 2019
What exactly is this XML_Parse in the LibExpat package? General Usage	1	441	October 3, 2019
GzipDecompressionStream compared to GZip.jl? General Usage	2	1442	August 22, 2017

Parsing XML as IO-stream

Related topics