Directory Path -> URL Mining; output a Tree

Asking for help with a particular task (showing an XML Sitemap as an Abstract Tree), and also for direction as to how to approach this topic more broadly. General goal is to learn to work with graphs structures, starting with a tree of parent/child relationships in a unidirected graph, and progressing on to other forms such as DAG’s, and perhaps getting to point of using ‘GraphModularDecomposion.jl’ that @StefanKarpinski developed and linked to here: Develop simple open source graph visualization library - #2 by StefanKarpinski and for more inspiration, there is “The DAG of Julia packages” by @juliohm visualized here: The DAG of Julia packages - Systematic Learning

Have learned to use ‘walkdir’ to create a tree view of a local directory (awesome!).
Essentially want to do the same thing, but with web URI’s.
To illustrate, desired output below was generated by building out the example structure in a local directory, and using the ‘fstree.jl’ example in the ‘AbstractTrees.jl’ package (https://github.com/Keno/AbstractTrees.jl/blob/master/examples/fstree.jl).

weburl_tree_example
└─ webaddress.tld
   ├─ category1
   │  ├─ page1
   │  │  ├─ file.csv
   │  │  ├─ file.txt
   │  │  └─ pdf_file.pdf
   │  └─ page2
   ├─ category2
   │  └─ page1
   └─ category3

Here is the code that generated that hierarchy output:

using AbstractTrees
import AbstractTrees: children, printnode

struct File
    path::String
end

children(f::File) = ()

struct Directory
    path::String
end

function children(d::Directory)
    contents = readdir(d.path)
    children = Vector{Union{Directory,File}}(undef,length(contents))
    for (i,c) in enumerate(contents)
        path = joinpath(d.path,c)
        children[i] = isdir(path) ? Directory(path) : File(path)
    end
    return children
end

printnode(io::IO, d::Directory) = print(io, basename(d.path))
printnode(io::IO, f::File) = print(io, basename(f.path))

#dirpath = realpath(joinpath(dirname(pathof(AbstractTrees)),".."))
#d = Directory(dirpath)
dirpath = pwd() # This assumes current working directory is aligned with tree to create.
d = Directory(dirpath)
print_tree(d)

Can’t attach an .xml file directly, so here is the example file as ouput from EzXML:
prettyprint(rootnode)

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.webaddress.tld/category1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/file.csv</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/file.txt</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/pdf_file.pdf</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page2/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category2/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category2/page1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category3/</loc>
  </url>
</urlset>

Perhaps could parse the xml file directly (and found bash script examples online to do this), but want to work within Julia, able to work with xml files in general.

One way to load the xml using EzXML is:
doc = readxml("sitemap_example.xml")

Went with a ‘streaming’ approach (need to use functions here, a work in progress!):

using EzXML
reader = open(EzXML.StreamReader, "sitemap_example.xml") # https://bicycle1885.github.io/EzXML.jl/stable/manual/# Streaming API
#urlset=Array{String,1} # setting type, so as to avoid expected error "MethodError: Cannot `convert` an object of type Array{Any,1} to an object of type DataFrame" when converting array to df for using CSV.write to save. 
@show reader.type # the initial state is READER_NONE; comment this line out once working
iterate(reader);  # advance the reader's state from READER_NONE to READER_ELEMENT
@show reader.type # show state is READER_ELEMENT; comment this line out once working
#@show reader.content # show the string of url's, comment this line out once working 
rawlist = reader.content;
close(reader)
#rawlist
#typeof(rawlist)
#strippedlist = strip(rawlist, "\n  \n    ") # MethodError: objects of type String are not callable
# so, try 'replace', but for multiple occurrences
# see: https://discourse.julialang.org/t/replacing-multiple-strings-errors/13654/9 for this method solved by @bkamins and @bennedich added the 'foldl' bit:
# reduce(replace, ["A"=>"a", "B"=>"b", "C"=>"c"], init="ABC")
# not sure, that form didn't work, this does:
replacedlist = (replace(rawlist, "\n  \n"=>""))
replacedlist = (replace(replacedlist, "    https://www."=>""))
urlset = split(replacedlist, "  \n")
replacedlist=nothing
rawlist=nothing
urlset

This is the output, which seems to be going in the right direction:
:
9-element Array{SubString{String},1}:
“webaddress.tld/category1/”
“webaddress.tld/category1/page1/”
“webaddress.tld/category1/page1/file.csv”
“webaddress.tld/category1/page1/file.txt”
“webaddress.tld/category1/page1/pdf_file.pdf”
“webaddress.tld/category1/page2/”
“webaddress.tld/category2/”
“webaddress.tld/category2/page1/”
“webaddress.tld/category3/”

I think using split creates an Array with SubString:
typeof(urlset)
Array{SubString{String},1}

Thought these would need to be broken apart, so:

for i in urlset
    line = split(i, "/")
    println(line)
end

Output:
SubString{String}[“webaddress.tld”, “category1”, “”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “file.csv”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “file.txt”]
SubString{String}[“webaddress.tld”, “category1”, “page1”, “pdf_file.pdf”]
SubString{String}[“webaddress.tld”, “category1”, “page2”, “”]
SubString{String}[“webaddress.tld”, “category2”, “”]
SubString{String}[“webaddress.tld”, “category2”, “page1”, “”]
SubString{String}[“webaddress.tld”, “category3”, “”]

Am reading the AbstractTrees source to understand how Parent/Child relationships are identified and parsed to create a list of the relationships. Also looking at source by @tkoolen (https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/src/graphs/directed_graph.jl and https://github.com/JuliaRobotics/RigidBodyDynamics.jl/blob/master/test/test_graph.jl).

  • Feel like this is going in right direction, but this has taken a good while and am floundering at this point.
  • Any suggestions & advice welcome!
1 Like

Ok, still haven’t gotten the tree part working, but have learned to use EzXML a bit better and it’s a nice package, created by @bicycle1885 Now having learned it’s advantages (memory management, etc.) over the LightXML.jl package, realize this was the correct choice!

First, regarding parsing a file, this can be done easily from a string. So for the working example above, was able to read easily:

xml_string = """
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.webaddress.tld/category1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/file.csv</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/file.txt</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page1/pdf_file.pdf</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category1/page2/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category2/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category2/page1/</loc>
  </url>
  <url>
    <loc>https://www.webaddress.tld/category3/</loc>
  </url>
</urlset>
"""

Then, simply parse this using the parsexml function:
xml_doc = parsexml(xml_string)
Out:
EzXML.Document(EzXML.Node(<DOCUMENT_NODE@0x000000002d5fb180>))
We set the root_node using the root function:
root_node = root(xml_doc)
Out:
EzXML.Node(<ELEMENT_NODE[urlset]@0x000000002f44ccd0>)
and we can see the elements:
doc_elements = elements(root_node)
Out:

9-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44d9d0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44e6d0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44ced0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44d350>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44d750>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44de50>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f4501d0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44f050>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f450a50>)

We can also get there using the findall function with the namespace. First, find the namespace:
namespaces(xml_doc.root) # show that the namespace has an empty prefix
Out:

1-element Array{Pair{String,String},1}:
 "" => "http://www.sitemaps.org/schemas/sitemap/0.9"

get the namespace for use with findall
ns = namespace(xml_doc.root) # get the namespace
Out:
"http://www.sitemaps.org/schemas/sitemap/0.9"
then use this with findall:
element_array = findall("/x:urlset/x:url", xml_doc.root, ["x"=>ns]) # specify its prefix as "x"
Out:

9-element Array{EzXML.Node,1}:
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44d9d0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44e6d0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44ced0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44d350>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44d750>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44de50>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f4501d0>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f44f050>)
 EzXML.Node(<ELEMENT_NODE[url]@0x000000002f450a50>)

Used similar method above with a loop to get the actual url values and add to an array:

url_list=[]
for i in element_array
    url_element = strip(i.content)
    url_list = [url_list; url_element]
end
url_list

Out:

:
9-element Array{Any,1}:
 "https://www.webaddress.tld/category1/"                  
 "https://www.webaddress.tld/category1/page1/"            
 "https://www.webaddress.tld/category1/page1/file.csv"    
 "https://www.webaddress.tld/category1/page1/file.txt"    
 "https://www.webaddress.tld/category1/page1/pdf_file.pdf"
 "https://www.webaddress.tld/category1/page2/"            
 "https://www.webaddress.tld/category2/"                  
 "https://www.webaddress.tld/category2/page1/"            
 "https://www.webaddress.tld/category3/" 

Hoping to treat these strings as “paths” to use the AbstractTree functionality. The dirname and splitdir functions work, but not the `splitpath’ unless it’s a string:

println("dirname = ", dirname(url_list[3]))
println("basename = ", basename(url_list[3]))
println("splitdir = ", splitdir(url_list[3]))
println("splitpath = ", splitpath(url_list[3]))

Out:

dirname = https://www.webaddress.tld/category1/page1
basename = file.csv
splitdir = ("https://www.webaddress.tld/category1/page1", "file.csv")

MethodError: no method matching splitpath(::SubString{String})
Closest candidates are:
  splitpath(!Matched::String) at path.jl:231

Stacktrace:
 [1] top-level scope at In[377]:4

Ok, so I converted the Array from {Any} to {String}:
url_string = String.(url_list)
and now splitpath works nicely and is cleaner than the split approach earlier:

println("dirname = ", dirname(url_string[3]))
println("basename = ", basename(url_string[3]))
println("splitdir = ", splitdir(url_string[3]))
println("splitpath = ", splitpath(url_string[3]))
splitpath_array = splitpath.(url_string)

Out:

dirname = https://www.webaddress.tld/category1/page1
basename = file.csv
splitdir = ("https://www.webaddress.tld/category1/page1", "file.csv")
splitpath = ["https:/", "www.webaddress.tld", "category1", "page1", "file.csv"]

9-element Array{Array{String,1},1}:
 ["https:/", "www.webaddress.tld", "category1"]                         
 ["https:/", "www.webaddress.tld", "category1", "page1"]                
 ["https:/", "www.webaddress.tld", "category1", "page1", "file.csv"]    
 ["https:/", "www.webaddress.tld", "category1", "page1", "file.txt"]    
 ["https:/", "www.webaddress.tld", "category1", "page1", "pdf_file.pdf"]
 ["https:/", "www.webaddress.tld", "category1", "page2"]                
 ["https:/", "www.webaddress.tld", "category2"]                         
 ["https:/", "www.webaddress.tld", "category2", "page1"]                
 ["https:/", "www.webaddress.tld", "category3"] 

Now, was hoping to use this or a variant of it with the directory segments as the “test” for whether input was a ‘directory’ or ‘file’, and build a Dict for parsing a tree. There is this bit of code in Julia base for file.jl:

    dirs = Vector{eltype(content)}()
    files = Vector{eltype(content)}()
    for name in content
        if isdir(joinpath(root, name))
            push!(dirs, name)
        else
            push!(files, name)
        end
    end

and using the basepath function, if the second element is an empty string, then it’s a path, otherwise, it’s a file. This is because coudln’t get the function isdir to work on these strings to produce a “true” value. But these two example show the difference with [1] being a directory and [3] being an example of a file path:

println("basename = ", basename(url_string[1]))
println("basename = ", basename(url_string[3]))
Out:
basename = 
basename = file.csv

So, if these strings can be parsed into a Dict in the structure needed, was thinking the print_tree function in AbstractTrees would then work. Using this as an example from the source as a guide:

julia> print_tree(STDOUT,Dict("a"=>"b","b"=>['c','d']))
Dict{String,Any}("b"=>['c','d'],"a"=>"b")
├─ b
│  ├─ c
│  └─ d
└─ a
   └─ b

But, running that code doesn’t produce the clean output in the source example, but rather this:
print_tree(Dict("a"=>"b","b"=>['c','d']))
Out:

Dict{String,Any}
├─ Array{Char,1}
│  ├─ 'c'
│  └─ 'd'
└─ "b"

Which doesn’t have the clean tree-look in the OP given as the goal of this exercise (generated by walking an actual local directory).
So, even if I can parse into the proper structure, it will likely produce a wonky output.

Question: is this an appropriate direction, or should I be shifting gears to accomplish the goal of creating a tree-view output from a list of URL’s?

Also, one other thing have tried to get an understanding of the mechanics. Using the ‘fstree.jl’ script above from:
https://github.com/Keno/AbstractTrees.jl/blob/master/examples/fstree.jl
modified the children function to see and understand the flow:

function children(d::Directory)
    contents = readdir(d.path)
    println("*** contents=$contents")
    children = Vector{Union{Directory,File}}(undef,length(contents))
    for (i,c) in enumerate(contents)
        path = joinpath(d.path,c)
        println("*** path=$path")
        println("*** d.path=", d.path)
        println("*** c=$c")
        children[i] = isdir(path) ? Directory(path) : File(path)
    end
    return children
end

This has the empty strings for paths:

contents=String[]

Out:


weburl_tree_example
*** contents=["webaddress.tld"]
*** path=C://weburl_tree_example/webaddress.tld
*** d.path=C://weburl_tree_example
*** c=webaddress.tld
└─ webaddress.tld
*** contents=["category1", "category2", "category3"]
*** path=C://weburl_tree_example/webaddress.tld/category1
*** d.path=C://weburl_tree_example/webaddress.tld
*** c=category1
*** path=C://weburl_tree_example/webaddress.tld/category2
*** d.path=C://weburl_tree_example/webaddress.tld
*** c=category2
*** path=C://weburl_tree_example/webaddress.tld/category3
*** d.path=C://weburl_tree_example/webaddress.tld
*** c=category3
   ├─ category1
*** contents=["page1", "page2"]
*** path=C://weburl_tree_example/webaddress.tld/category1/page1
*** d.path=C://weburl_tree_example/webaddress.tld/category1
*** c=page1
*** path=C://weburl_tree_example/webaddress.tld/category1/page2
*** d.path=C://weburl_tree_example/webaddress.tld/category1
*** c=page2
   │  ├─ page1
*** contents=["file.csv", "file.txt", "pdf_file.pdf"]
*** path=C://weburl_tree_example/webaddress.tld/category1/page1/file.csv
*** d.path=C://weburl_tree_example/webaddress.tld/category1/page1
*** c=file.csv
*** path=C://weburl_tree_example/webaddress.tld/category1/page1/file.txt
*** d.path=C://weburl_tree_example/webaddress.tld/category1/page1
*** c=file.txt
*** path=C://weburl_tree_example/webaddress.tld/category1/page1/pdf_file.pdf
*** d.path=C://weburl_tree_example/webaddress.tld/category1/page1
*** c=pdf_file.pdf
   │  │  ├─ file.csv
   │  │  ├─ file.txt
   │  │  └─ pdf_file.pdf
   │  └─ page2
*** contents=String[]
   ├─ category2
*** contents=["page1"]
*** path=C://weburl_tree_example/webaddress.tld/category2/page1
*** d.path=C://weburl_tree_example/webaddress.tld/category2
*** c=page1
   │  └─ page1
*** contents=String[]
   └─ category3
*** contents=String[]

So I think to do this is a bit more work than originally understood, there doesn’t appear to be a simple command equivalent in URL-land to walkdir in FILESYSTEM-land. Will need to build the parent-child relationships with a graph structure, perhaps recursively. While more work, it’s also well-understood and documented, so will go back to that in time…

For now, have a hack that can work, basically creating a mirror of the URL structure in the local Filesystem, and then running the fstree.jl script above. Can use wget to do this on open directories:

$ wget -r --spider -l depth www.your-target-website.tld

For regular directory structures, can use the command-line sitemap crawler I’ve been using, process with EzXML to extract all the directories, and then use a loop to create these locally in the filesystem, and then walk this.

IRS_nber2
└─ data.nber.org
   └─ tax-stats
      ├─ 990
      ├─ county
      │  ├─ 1989
      │  │  ├─ 89xls
      │  │  ├─ CI89
      │  │  └─ desc
      │  ├─ 1990
      │  │  ├─ 1990CountyIncome
      │  │  └─ desc
...

This doesn’t provide option of showing filenames in the structure, need to do a full mirror for that (rsync), but can move on…

bls-timeseries
└─ download.bls.gov
   └─ pub
      └─ time.series
         ├─ ap
         │  ├─ ap.area
         │  ├─ ap.contacts
         │  ├─ ap.data.0.Current
         │  ├─ ap.data.1.HouseholdFuels
         │  ├─ ap.data.2.Gasoline
         │  ├─ ap.data.3.Food
         │  ├─ ap.footnote
...

Also, lesson learned is to ask for more focused help with a succinct OP.