Released EzXML.jl - a new package for XML/HTML

I released a new package, EzXML.jl,
XML/HTML handling tools for primates. The initial release was about a week ago,
and it started to support Windows from the latest tagged version (v0.2.0).

But as you know, we already have several packages for XML. So, why did I make yet
another? The main reason is because I’m not satisfied with the APIs of other
packages. Also, I wanted to support more features. Highlighted features are:

  • Consistent and Julian APIs.
  • Intuitive namespace handling.
  • Searching elements with XPath.
  • Streaming reader for large files.
  • Automatic memory management.

Let me explain a little bit more for each feature here. Since EzXML.jl and
LightXML.jl are both built on top of libxml2 it would be better to compare them
by examples. I hope it will convince you of its usefulness.

Here I compared some APIs of EzXML.jl with those of LightXML.jl:

# EzXML.jl version.
using EzXML
xdoc = readxml("ex1.xml")
xroot = root(xdoc)
for c in eachnode(xroot)
    println(nodetype(c))
    if iselement(c)
        println(name(c))
    end
end

# LightXML.jl version.
using LightXML
xdoc = parse_file("ex1.xml")
xroot = root(xdoc)
for c in child_nodes(xroot)
    println(nodetype(c))
    if is_elementnode(c)
        e = XMLElement(c)
        println(name(e))
    end
end

As you noticed, some functions have a different name. For example, parse_file
is renamed as readxml because “read” is used consistently in the standard
library to read data from a file. “parse” is reserved for parsing a string and
the function is named as parsexml. child_nodes is also renamed as
eachnode, which mimics the names of Base.eachindex and Base.eachline.
Types have different names, too. In EzXML.jl, Node is the only type that
represents a node in XML while LightXML.jl has XMLElement, XMLNode and
XMLAttr. A whole XML document is stored in a Document object but it is just
a thin wrapper of Node and most operations are delegated to the node. So, in a
sense, everything is a Node object in EzXML.jl. No function names exported
from EzXML.jl have underscores since it’s a Julian way.

Attribute access has a different syntax. In EzXML.jl, getindex is overloaded
for the purpose and you can get an attribute value like
elem["attribute-name"]. Of course, setindex! and delete! are also
overloaded and work as you expect. Moreover, the attribute name may be prefixed
by a namespace so you can handle XML documents with namespaces:

using EzXML
elm = firstelement(xroot)
println(elm["category"])
elm["like"] = "yes"
delete!(elm, "tag")
println(elm)

using LightXML
elm = first(child_elements(xroot))
println(attribute(elm, "category"))
set_attribute(elm, "like", "yes")
# I don't know how I can delete an attribute.
println(elm)

The XPath query language is supported in EzXML.jl. find(<document or node>, <xpath>) is overloaded to find all matching nodes under a document or a node.
Like attribute names, namespace prefixes are automatically registered and can be
used in a query:

using EzXML
find(xdoc, "/bookstore/book")    # A vector of "book" elements under "bookstore".
findfirst(xroot, "book")         # The first "book" element under the root.
content.(find(xdoc, "//title"))  # A vector of title strings.

In addition, streaming reader is supported. This is especially important when
you want to parse extremely large XML files with limited memory. Since streaming
reader does not construct an XML document tree in memory, it can parse large
files with low and constant footprint:

reader = open(XMLReader, "ex1.xml")
while !done(reader)
    # reader has a current reading node in it.
    if nodetype(reader) == EzXML.READER_ELEMENT && name(reader) == "book"
        println(reader["category"])
    end
    next(reader)
end

Another important feature of EzXML.jl superior to LightXML.jl is automatic
memory management. Nodes of EzXML.jl can automatically free their memories when
they are no longer accessible from your code. This is far more difficult than
you may expect because nodes are connected to each other in libxml2, which is
invisible from Julia’s garbage collecter. EzXML.jl solved (I believe) this
problem by keeping an owner node in each node object and satisfying uniqueness
property of node proxy objects. The “uniqueness property” here means that two
Node objects pointing to the same node in a document are identical, which
enables to avoid “double free” of nodes.

EzXML.jl would be already usable in your projects. However, the package is still
young (I started it this month) and hence you will encounter problems with it. I
really welcome any feedbacks from users including bug reports, feature requests,
and API suggestions. Try it now and let me know what you think!

16 Likes