Is there a ready-made function to convert a Gumbo.jl parsed html table into a table like DataFrames.DataFrame?

I tried to read through Gumbo.jl pages but I don’t see a function to convert a table tag to a table.

Does such functions exists or I need to roll my own?

I don’t think this is implemented, there’s talk on https://github.com/JuliaWeb/Gumbo.jl/issues/71 to support it though.

Here’s what I ended up using:

using Gumbo
using Cascadia
using StringEncodings
using DataFrames

function html2df(file; encoding=enc"UTF-8")
    if encoding isa AbstractString
        encoding = Encoding(encoding)
    end

    n = parsehtml(read(file, String, encoding))

    dfs = DataFrame[]
    for table in eachmatch(sel"table", n.root)
        # Get column names from table
        headers = eachmatch(sel"thead tr th", table) .|> nodeText

        # Create dataframe with all columns of String type
        types = fill(String, length(headers))
        df = DataFrame(types, headers)

        # Fill dataframe with rows from the table
        for row in eachmatch(sel"tbody tr", table)
            row_texts = eachmatch(sel"td", row) .|> nodeText
            push!(df, row_texts)
        end

        push!(dfs, df)
    end

    return dfs
end

It takes a file name and returns an array of dataframes, one for each table in the file.

3 Likes