Reading HTML file for parsing

edas · December 19, 2022, 7:41pm

I want to read html file and then parse it (not sure if use word “parse” correctly).

I saved example.com to file example.html
Also using EzXML library

using Cascadia, Gumbo, HTTP,AbstractTrees
using EzXML

r = EzXML.readhtml("example.html")

print(r)

Prints result (html):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> Example Domain

<meta charset="utf-8"/>
<meta http-equiv="Content-type" content="text/html; charset=utf-8"/>    <meta name="viewport" content="width=device-width, initial-scale=1"/>
<style type="text/css"><![CDATA[
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;     

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 2em;
    background-color: #fdfdff;
    border-radius: 0.5em;
    box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    div {
        margin: 0 auto;
        width: auto;
    }
}
]]></style>

Example Domain

This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

More information...

How to work further with this html ?

h = parsehtml(String(r.body))

Gives error:
ERROR: type Document has no field body

Cascadia commands does not work too.

EzXML.readhtml() reads simple html files, but give errors on more complex files.

What library should I use ?
Or I missed some steps ?

Thanks

avik · December 19, 2022, 8:10pm

You seem to be importing Gumbo.jl and Cascadia.jl in your code. So I don’t really understand why you are using EzXML’s parser? Can’t you directly use Gumbo’s parsehtml method?

XML based parsers such as EzXML often have difficulty working with actual html.

Topic		Replies	Views
How to extract links from HTML General Usage	2	398	December 4, 2022
EzXML node question General Usage	2	349	March 24, 2021
charset=Windows-1250, Gumbo, parsing html , Hoow to keep orginal text General Usage	1	703	February 15, 2018
How to get correct formatting using EzXML? General Usage question , package , xml	2	655	August 30, 2022
What library do you suggest to parse HTML page and additionally navigate through the page New to Julia	2	555	December 31, 2019

Reading HTML file for parsing

Example Domain

Related topics