I’m working on wrapping a specific REST API with a Julia package. I am downloading and parsing the HTML of the API’s documentation page to generate the in-Julia API. Specifically, I’m looking at each of the tables on the website and parsing those tables into DataFrame instances using Gumbo.jl. I have previously asked a question about that process:
I can successfully retrieve a list of tables, but I’d like to find the previous header (h1, h2, h3, etc.) associated with each table. For example, the first table is under the “Common Parameters” header. Does anyone know how to find that with Gumbo, as opposed to visually looking at the webpage?
MWE
using HTTP, Gumbo, Cascadia, DataFrames
const URL = "https://ssd-api.jpl.nasa.gov/doc/horizons.html"
const MANUAL = String(HTTP.get(URL).body)
"""
Given an HTML file, return a list of `DataFrame` instances which
represent each table in the file.
!!! note
This was totally and completely copied from
[sudete](https://discourse.julialang.org/u/sudete)'s Julia
Discourse [comment](https://discourse.julialang.org/t/is-there-a-ready-made-function-to-convert-a-gumbo-jl-parsed-html-table-into-a-table-like-dataframes-dataframe/55973/3)
"""
function htmltables(body::AbstractString)
n = parsehtml(body)
dfs = DataFrame[]
for table in eachmatch(sel"table", n.root)
# Get column names from table
headers = eachmatch(sel"thead tr th", table) .|> nodeText
if !isempty(headers)
# Create dataframe with all columns of String type
df = DataFrame(headers .=> Ref(Any[]))
# Fill dataframe with rows from the table
for row in eachmatch(sel"tbody tr", table)
row_texts = eachmatch(sel"td", row) .|> nodeText
push!(df, row_texts)
end
push!(dfs, df)
end
end
return dfs
end
const TABLES = htmltables(MANUAL)