[ANN] TableScraper.jl - an easy way to scrape WELL-FORMED tables from webpages

17 Likes

Can also be scraped here.

3 Likes

Nice. Is it possible to use this to get wikipedia tables as LaTeX?

Specifically this one would look great in my LaTeX equation sheet, as opposed to my current screenshot.

2 Likes

the table looks wellformed so you can scrape it. but u need to know some css and HTML to do it properly i’d say.

I dont unfortunatly… Oh well

just wanted to check out the package quickly, something like this seemed to kind of work

@chain begin
           scrape_tables("https://en.wikipedia.org/wiki/Z-transform", identity)
           _[8]
           DataFrame
           transform(1 => ByRow(nodeText) => :number)
           transform(2:4 .=> ByRow(function(x)
               try
                   x.children[1].children[2].attributes["alt"]
               catch
                   missing
               end
           end) .=> ["Signal", "Z-Transform", "ROC"])
           select(Not(1:4))
        end
4 Likes

Well done I guess it’s not that hard

Just noticed this announcement for TableScraper.jl it looks good.

BTW about an easy way to scrape " WELL-FORMED tables " from webpages ;
I believe you can (mostly) eliminate the caveat/requirement/limitation
of " WELL-FORMED tables " by using tidy-html5 as per >>

Tidy tidies HTML, XML and tidy-HTML5 - Github code repo https://github.com/htacg/tidy-html5 .

About “Regular” Tidy tidies https://www.html-tidy.org/

It can tidy your documents by itself, and developers can easily integrate its
features into even more powerful tools.

And upon reflection I believe you might even guess of the existence of
something like “tidy-html5” because of the fact that browsers
can display almost all tables be they WELL-FORMED , ILL-Formed, or not :slight_smile:

HTH,
Marc

1 Like

Interesting. TIL about these tidy tools. I might look at it if the need arises. Currenlty, the scraper works pretty well for my basic needs.

1 Like