Parsing XML as IO-stream


#1

I have a lot of XML-files (gzipped - about 1.5Gb in size) containing approximately 400000 elements
(

<REC></REC>

)
in which I am interested.

The structure of the file:

<records>
<REC>
...
</REC>
etc.
</records>

I have first looked at LightXML but could not figure out how to do streaming. Then I tried EzXML in this way (which worked for a few records, but then just hangs after a random number of records in the function eenrekord()):

using EzXML
using CodecZlib

function eenrekord(x)
    r = parsexml(string(x)) # This is where the process hangs - sometimes after 20 - 80 iterations.
    ut = nodecontent(findfirst(root(r), "//UID"))
    ut, string(r)
end

function xml_records(gzipfile,tabel,zipfile)
    l = GzipDecompressorStream(open(gzipfile))
    reader = EzXML.StreamReader(l)
    count = 0
    rcount = 0
    conn = dbcon()
    list = []
    while !done(reader)
        if nodetype(reader) == 1 && nodename(reader) == "REC"
            rcount += 1
            if rcount % 2 == 1
                x = expandtree(reader)
                ut, xml = eenrekord(x)
                push!(list, [zipfile, gzipfile, ut, xml])
                count +=1
                if count % 20000 == 0
                    println(count)
                   #  write_list_sql_table  - another function not shown here.     
                    list = []
                end
            end
        end
    end
    if telling % 20000 != 0
        println(count)
                   #  write_list_sql_table  - another function not shown here.     
    end
    close(conn)
end
gzipfile = "/tmp/some.xml.gz"
zipfile = "somezipfile.zip"
tabel = "core_2018"
xml_records(gzipfile,tabel,zipfile)

I have looked at LibExpat.jl of which the README indicates streaming from a file using xp_streaming_parsefile but I have no idea how to use it and I could not any example using it.

At this stage I just want to put <REC> </REC> as a XML-type record in the SQL table from where I will use PostgreSQL’s XPATH-capabilities to manipulate it. In total it will be about 60 million records I have to process in this way.

I will appreciate any references to good examples or tutorials on how to do this.

And any comments about mistakes I made in the code shown above.


#2

On my phone, so forgive the lack of code/links, but I recall running into a similar problem with a python library for streaming XML and the issue was that a huge chunck of each record was maintained in memory - I was only grabbing a single field from very large records, but all of the parts of the record that were not in my query were held onto, and I had to explicitly flush in every loop, otherwise the memory would load up and it would stall.w

I think it’s unlikely that the libraries you tried have this problem, but thought I’d mention it just in case :slight_smile:


#3

Hi, I’m a developer of EzXML and CodecZlib.

First of all, you don’t need to use CodecZlib because EzXML detects gzip compression and implicitly decompresses it. Also, calling expandtree is inefficient and may be unsafe. I’d like to know what you meant by “hang”; does the process run forever without warning messages?


#4

Thanks for the remark about EzXML which detects gzip compression.

I have used expandtree to be able to get a document which I could use to search for a specific element. Is there another way of doing it?

The process gave no error or warning messages and nothing was happening. I used print statements to determine the exact piece of code where this seemed to go into some endless loop. This happened sometimes after retrieving anything between 12 and 80 records from the same file.


#5

Using a streaming parser is a little bit difficult; you’ll need to keep track of the current parsing state using a stack or something.

It would be helpful for me to debug the problem if you could create reproducible code with some (smaller) data and file an issue at https://github.com/bicycle1885/EzXML.jl/issues.


#6

Thanks @kevbonham. I have had the same experience in Python where I did:

 element.clear()
 gc.collect()

I am not exactly sure how I am going to do this in Julia. Will investigate.


#7

Thanks @bicycle1885.
I am willling to do that but the data has strict licensing restrictions and I am reluctant to even put a portion of the data out there in public domain. I will file an issue if I can succeed to replicate the problem with “safe” data.