I’m working on wrapping a specific REST API with a Julia package. I am downloading and parsing the HTML of the API’s documentation page to generate the in-Julia API. Specifically, I’m looking at each of the tables on the website and parsing those tables into DataFrame
instances using Gumbo.jl
. I have previously asked a question about that process:
I can successfully retrieve a list of tables, but I’d like to find the previous header (h1
, h2
, h3
, etc.) associated with each table. For example, the first table is under the “Common Parameters” header. Does anyone know how to find that with Gumbo
, as opposed to visually looking at the webpage?
MWE
using HTTP, Gumbo, Cascadia, DataFrames
const URL = "https://ssd-api.jpl.nasa.gov/doc/horizons.html"
const MANUAL = String(HTTP.get(URL).body)
"""
Given an HTML file, return a list of `DataFrame` instances which
represent each table in the file.
!!! note
This was totally and completely copied from
[sudete](https://discourse.julialang.org/u/sudete)'s Julia
Discourse [comment](https://discourse.julialang.org/t/is-there-a-ready-made-function-to-convert-a-gumbo-jl-parsed-html-table-into-a-table-like-dataframes-dataframe/55973/3)
"""
function htmltables(body::AbstractString)
n = parsehtml(body)
dfs = DataFrame[]
for table in eachmatch(sel"table", n.root)
# Get column names from table
headers = eachmatch(sel"thead tr th", table) .|> nodeText
if !isempty(headers)
# Create dataframe with all columns of String type
df = DataFrame(headers .=> Ref(Any[]))
# Fill dataframe with rows from the table
for row in eachmatch(sel"tbody tr", table)
row_texts = eachmatch(sel"td", row) .|> nodeText
push!(df, row_texts)
end
push!(dfs, df)
end
end
return dfs
end
const TABLES = htmltables(MANUAL)