Asking for help with a particular task (showing an XML Sitemap as an Abstract Tree), and also for direction as to how to approach this topic more broadly. General goal is to learn to work with graphs structures, starting with a tree of parent/child relationships in a unidirected graph, and progressing on to other forms such as DAG’s, and perhaps getting to point of using ‘GraphModularDecomposion.jl’ that @StefanKarpinski developed and linked to here: Develop simple open source graph visualization library - #2 by StefanKarpinski and for more inspiration, there is “The DAG of Julia packages” by @juliohm visualized here: The DAG of Julia packages - Systematic Learning
- All amazing stuff, but crawl… walk… run… So:
Have as input XML Sitemaps, that adhere to a standard schema (https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd).
Example .xml file is attached.
Have learned to use ‘walkdir’ to create a tree view of a local directory (awesome!).
Essentially want to do the same thing, but with web URI’s.
To illustrate, desired output below was generated by building out the example structure in a local directory, and using the ‘fstree.jl’ example in the ‘AbstractTrees.jl’ package (https://github.com/Keno/AbstractTrees.jl/blob/master/examples/fstree.jl).
weburl_tree_example
└─ webaddress.tld
├─ category1
│ ├─ page1
│ │ ├─ file.csv
│ │ ├─ file.txt
│ │ └─ pdf_file.pdf
│ └─ page2
├─ category2
│ └─ page1
└─ category3
Here is the code that generated that hierarchy output:
using AbstractTrees
import AbstractTrees: children, printnode
struct File
path::String
end
children(f::File) = ()
struct Directory
path::String
end
function children(d::Directory)
contents = readdir(d.path)
children = Vector{Union{Directory,File}}(undef,length(contents))
for (i,c) in enumerate(contents)
path = joinpath(d.path,c)
children[i] = isdir(path) ? Directory(path) : File(path)
end
return children
end
printnode(io::IO, d::Directory) = print(io, basename(d.path))
printnode(io::IO, f::File) = print(io, basename(f.path))
#dirpath = realpath(joinpath(dirname(pathof(AbstractTrees)),".."))
#d = Directory(dirpath)
dirpath = pwd() # This assumes current working directory is aligned with tree to create.
d = Directory(dirpath)
print_tree(d)
Can’t attach an .xml file directly, so here is the example file as ouput from EzXML:
prettyprint(rootnode)
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.webaddress.tld/category1/</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category1/page1/</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category1/page1/file.csv</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category1/page1/file.txt</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category1/page1/pdf_file.pdf</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category1/page2/</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category2/</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category2/page1/</loc>
</url>
<url>
<loc>https://www.webaddress.tld/category3/</loc>
</url>
</urlset>
Perhaps could parse the xml file directly (and found bash script examples online to do this), but want to work within Julia, able to work with xml files in general.
- Found a couple packages for this:
EzXML.jl
(https://github.com/bicycle1885/EzXML.jl) andLightXML.jl
(GitHub - JuliaIO/LightXML.jl: A light-weight Julia package for XML based on libxml2.) which looks similar. Went with the first.
One way to load the xml using EzXML
is:
doc = readxml("sitemap_example.xml")
Went with a ‘streaming’ approach (need to use functions here, a work in progress!):
using EzXML
reader = open(EzXML.StreamReader, "sitemap_example.xml") # https://bicycle1885.github.io/EzXML.jl/stable/manual/# Streaming API
#urlset=Array{String,1} # setting type, so as to avoid expected error "MethodError: Cannot `convert` an object of type Array{Any,1} to an object of type DataFrame" when converting array to df for using CSV.write to save.
@show reader.type # the initial state is READER_NONE; comment this line out once working
iterate(reader); # advance the reader's state from READER_NONE to READER_ELEMENT
@show reader.type # show state is READER_ELEMENT; comment this line out once working
#@show reader.content # show the string of url's, comment this line out once working
rawlist = reader.content;
close(reader)
#rawlist
#typeof(rawlist)
#strippedlist = strip(rawlist, "\n \n ") # MethodError: objects of type String are not callable
# so, try 'replace', but for multiple occurrences
# see: https://discourse.julialang.org/t/replacing-multiple-strings-errors/13654/9 for this method solved by @bkamins and @bennedich added the 'foldl' bit:
# reduce(replace, ["A"=>"a", "B"=>"b", "C"=>"c"], init="ABC")
# not sure, that form didn't work, this does:
replacedlist = (replace(rawlist, "\n \n"=>""))
replacedlist = (replace(replacedlist, " https://www."=>""))
urlset = split(replacedlist, " \n")
replacedlist=nothing
rawlist=nothing
urlset
This is the output, which seems to be going in the right direction:
:
9-element Array{SubString{String},1}:
“webaddress.tld/category1/”
“webaddress.tld/category1/page1/”
“webaddress.tld/category1/page1/file.csv”
“webaddress.tld/category1/page1/file.txt”
“webaddress.tld/category1/page1/pdf_file.pdf”
“webaddress.tld/category1/page2/”
“webaddress.tld/category2/”
“webaddress.tld/category2/page1/”
“webaddress.tld/category3/”
I think using split
creates an Array with SubString:
typeof(urlset)
Array{SubString{String},1}
Thought these would need to be broken apart, so:
for i in urlset
line = split(i, "/")
println(line)
end
Output:
SubString{String}[“webaddress.tld”, “category1”, “”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “file.csv”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “file.txt”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “pdf_file.pdf”]
SubString{String}[“webaddress.tld”, “category1”, “page2”, “”]
SubString{String}[“webaddress.tld”, “category2”, “”]
SubString{String}[“webaddress.tld”, “category2”, “page1”, “”]
SubString{String}[“webaddress.tld”, “category3”, “”]
Am reading the AbstractTrees source to understand how Parent/Child relationships are identified and parsed to create a list of the relationships. Also looking at source by @tkoolen (https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/src/graphs/directed_graph.jl and https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/test/test_graph.jl).
- Feel like this is going in right direction, but this has taken a good while and am floundering at this point.
- Any suggestions & advice welcome!