Find last section header with Gumbo.jl

cadojo · July 12, 2022, 6:44pm

I’m working on wrapping a specific REST API with a Julia package. I am downloading and parsing the HTML of the API’s documentation page to generate the in-Julia API. Specifically, I’m looking at each of the tables on the website and parsing those tables into DataFrame instances using Gumbo.jl. I have previously asked a question about that process:

I can successfully retrieve a list of tables, but I’d like to find the previous header (h1, h2, h3, etc.) associated with each table. For example, the first table is under the “Common Parameters” header. Does anyone know how to find that with Gumbo, as opposed to visually looking at the webpage?

MWE

using HTTP, Gumbo, Cascadia, DataFrames

const URL = "https://ssd-api.jpl.nasa.gov/doc/horizons.html"
const MANUAL = String(HTTP.get(URL).body)


"""
Given an HTML file, return a list of `DataFrame` instances which
represent each table in the file.

!!! note
    This was totally and completely copied from
    [sudete](https://discourse.julialang.org/u/sudete)'s Julia
    Discourse [comment](https://discourse.julialang.org/t/is-there-a-ready-made-function-to-convert-a-gumbo-jl-parsed-html-table-into-a-table-like-dataframes-dataframe/55973/3)

"""
function htmltables(body::AbstractString)

    n = parsehtml(body)

    dfs = DataFrame[]
    for table in eachmatch(sel"table", n.root)

        # Get column names from table
        headers = eachmatch(sel"thead tr th", table) .|> nodeText

        if !isempty(headers)

            # Create dataframe with all columns of String type
            df = DataFrame(headers .=> Ref(Any[]))

            # Fill dataframe with rows from the table
            for row in eachmatch(sel"tbody tr", table)
                row_texts = eachmatch(sel"td", row) .|> nodeText
                push!(df, row_texts)
            end

            push!(dfs, df)

        end

    end

    return dfs
end

const TABLES = htmltables(MANUAL)

Topic		Replies	Views
Is there a ready-made function to convert a Gumbo.jl parsed html table into a table like DataFrames.DataFrame? General Usage	2	950	March 1, 2021
Scraping a html table from a website New to Julia	2	2437	September 29, 2020
Get index corresponding to some number in list of outputs General Usage indexing	24	834	July 10, 2023
Way to transform HTML into Text? General Usage	1	679	March 3, 2020
Scrap table from NASA GCN circulars website Web Stack question , strings , csv , http	30	1126	August 3, 2023

Find last section header with Gumbo.jl

MWE

Related topics