Manipulating HTML DOM using Julia

Hi, I am trying to manipulate the DOM elements(e.g. adding a parent node and changing hierarchy of some nodes) of a HTML document. I could parse the HTML using Gumbo.jl. However, I could not manipulate the DOM using that package. I looked into LightXML.jl and EzXML.jl. However, they were not really suited for parsing HTML with javascript inside script tags and had multiple parsing errors. Could someone please let me know if there are any other resources to manipulate HTML DOM in Julia (e.g. adding parent nodes, child nodes, changing tag names and changing the content of nodes).

I think I figured out how to manipulate the DOM using Gumbo.jl manually. However I am having difficulty getting to change the actual tree.

using Gumbo
import AbstractTrees
doc = parsehtml("""
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <span>this is a span 1</span>
        <div>
            <span>this is a span 2</span>
        </div>
    </body>
</html>
""");

if i wanted to change the <span> to a <p>

for elem in AbstractTrees.PreOrderDFS(doc.root)
    if isa(elem, HTMLElement)
        if tag(elem) == :span
            elem = HTMLElement{:p}(elem.children,elem.parent,elem.attributes)
        end
    end
end

and this does not work

julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
  <head>
    <title>
      Title
    </title>
  </head>
  <body>
    <span>
      this is a span 1
    </span>
    <div>
      <span>
        this is a span 2
      </span>
    </div>
  </body>
</HTML>

Could someone please let me know how to do this correctly.

The problem is that PreOrderDFS makes a copy. So “elem” has nothing to do with doc.root.

If have had success in changing doc.root[2][3] and then

prettyprint(doc.root) again.

So you have to browse trough the doc.root yourself to find all the span elements in the hierarchy.

I was working on this but I gave up as this became too cumbersome to do something without modifying the struct types in Gumbo.jl.

I am copying the issue I opened on Gumbo.jl if it helps someone figure out what to do.

I was trying to use Gumbo.jl to manipulate the DOM by changing tags and content of nodes. However, I am confused with some code.

doc = parsehtml("""
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <span>this is a span 1</span>
        <div>
        <h1>this is a heading</h1>
        <span>this is a span 2</span>
        </div>
    </body>
</html>
""");

If I wanted to change/ replace the first <span> node to a <abc> node

julia> elem = doc.root[2][1]
HTMLElement{:span}:
<span>
  this is a span 1
</span>

The following does not work

julia> elem = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)
HTMLElement{:abc}:
<abc>
  this is a span 1
</abc>

julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
  <head>
    <title>
      Title
    </title>
  </head>
  <body>
    <span>
      this is a span 1
    </span>
    <div>
      <h1>
        this is a heading
      </h1>
      <span>
        this is a span 2
      </span>
    </div>
  </body>
</HTML>

I have to assign this in the parents children for it to work. For which I have to know the position of the node in the parent node.

elem.parent.children[1] = HTMLElement{:abc}(elem.children,elem.parent,elem.attributes)

julia> doc
HTML Document:
<!DOCTYPE >
<HTML>
  <head>
    <title>
      Title
    </title>
  </head>
  <body>
    <abc>
      this is a span 1
    </abc>
    <div>
      <h1>
        this is a heading
      </h1>
      <span>
        this is a span 2
      </span>
    </div>
  </body>
</HTML>

Is this behavior intended? I thought both parents children and the child node should point to the same location.

Also I see that the information about position in parent is present in index_within_parent. I was wondering if it would be possible to add this information for each node in addition to parents children and attributes. If we have this information then we could overcome the above issue.

struct Node{T}
    gntype::Int32  # enum
    parent::Ptr{Node}
    index_within_parent::Csize_t
    parse_flags::Int32  # enum
    v::T
end

Please let me know your thoughts or if I am approaching this entirely in the wrong direction. Is there a more straight forward way to manipulate the Nodes?