Counting the number of paragraphs ("p") between containers ("div")

Nash · January 5, 2023, 9:49pm

I am trying to parse the content of a website and part of the logic to get the right information is counting the number of times a paragraph of text (represented by the element “p”) occurs between a particular container type (represented by “div.ttt-subhead” (ie., not all divs, but only the one stated))

Can I use Gumbo or Cascadia for this? If so, how? If not, then how can I find the count?

nilshg · January 5, 2023, 11:57pm

What have you tried? Why is it not working as expected? Can you post an MWE?

Nash · January 6, 2023, 12:04am

Well, I am quite out of my depth. I have been able to extract p’s and div’s using the following code, but I don’t know how to “relate” the p’s and the div’s to each other:

n = length(eachmatch(Selector("div.ttt-subhead"),body[1]))
for i = 1:n
    println(eachmatch(Selector("div.ttt-subhead"),body[1])[i])
end

n = length(eachmatch(Selector("p"),body[1]))
for i = 1:n
    println(eachmatch(Selector("p"),body[1])[i])
end

Here is a basic HTML to use:

<!DOCTYPE html>
<html lang="en">

</head>

<body style="text-align: center;">

	 <div class="ttt-subhead"><span>8:00</span> Breakfast</div>


        <div class="ttt-row">
		<p>
			Eggs.
		</p>
		<p>
			Bacon.
		</p>
        </div>


	  <div class="ttt-subhead"><span>12:00</span> Lunch</div>

        <div class="ttt-row">
		<p>
			Burger.
		</p>
		<p>
			Fries.
		</p>
		<p>
			Coke.
		</p>
        </div>

	  <div class="ttt-subhead"><span>18:00</span> Dinner</div>

        <div class="ttt-row">
		<p>
			Salad.
		</p>
        </div>

</body>

</html>

skleinbo · January 6, 2023, 7:50am

Maybe your code works, but the HTML in your example is not correct. Gumbo doesn’t parse your example correctly. It should read

<div class="ttt-subhead">

Then this works

julia> using Cascadia, Gumbo

julia> dom = parsehtml(myhtml)

julia> map(eachmatch(Selector("div.ttt-subhead"), dom.root)) do d
         length(eachmatch(Selector("p"), d))
       end
3-element Vector{Int64}:
 1
 2
 1

where myhtml is the (corrected) example from your post.

Nash · January 6, 2023, 8:29am

Sorry, I must learn to ask the right question. I appreciate your help. Unfortunately, a better description of my problem is given by the following HTML (I have also edited the HTML in my original post):

<!DOCTYPE html>
<html lang="en">

</head>

<body style="text-align: center;">

	 <div class="ttt-subhead"><span>8:00</span> Breakfast</div>


        <div class="ttt-row">
		<p>
			Eggs.
		</p>
		<p>
			Bacon.
		</p>
        </div>


	  <div class="ttt-subhead"><span>12:00</span> Lunch</div>

        <div class="ttt-row">
		<p>
			Burger.
		</p>
		<p>
			Fries.
		</p>
		<p>
			Coke.
		</p>
        </div>

	  <div class="ttt-subhead"><span>18:00</span> Dinner</div>

        <div class="ttt-row">
		<p>
			Salad.
		</p>
        </div>

</body>

</html>

skleinbo · January 6, 2023, 8:36am

Still not sure what you are trying to match. In your revised example, the div.ttt-subhead do not close around the paragraphs.

Nash · January 6, 2023, 9:04am

The answer to my question using the above HTML is

count = [2,3,1]

This vector is obtained by counting the number of paragraphs between the “Breakfast” and “Lunch” container, the “Lunch” and “Dinner” container, and after the “Dinner” container.

skleinbo · January 6, 2023, 9:59am

What about:

<div class="ttt-subhead"><span>8:00</span> Breakfast</div>
        <div class="ttt-row">
		<p>
			Eggs.
		</p>
		<p>
			Bacon.
		</p>
        </div>
        <div class="ttt-row">
		<p>
			Hashbrowns.
		</p>
        </div>
<div class="ttt-subhead"><span>12:00</span> Lunch</div>

I assume this should count as 3 rather than [2,1]?

Nash · January 6, 2023, 10:36am

Yes, in your example, the solution is [2,1].

But it is a little different than my example. In your example, there is two

<div class="ttt-row">"

between the

<div class="ttt-subhead".

In my example, there is only one

<div class="ttt-row">

for each each

<div class="ttt-subhead"

skleinbo · January 6, 2023, 10:39am

Then won’t just changing Selector("div.ttt-subhead") to Selector("div.ttt-row") in my proposal do the trick?

Nash · January 6, 2023, 10:50am

Please see my comment above, since it probably makes a difference.

lawless-m · January 6, 2023, 10:58am

I transliterated Python’s HTML Parser if that helps at all

skleinbo · January 6, 2023, 10:59am

Have you tried it out? If all paragraphs are wrapped in div.ttt-row, this should give you what you want.

Nash · January 6, 2023, 11:08am

Yes, unfortunately it didn’t work in my case. Could you perhaps try and modify your example so it matches mine, and then see if your prior solution works? If it does, then maybe it is because my simplified example does not capture my case.

Nash · January 6, 2023, 11:10am

I am not familiar with that package. Given your knowledge of it, would it work?

skleinbo · January 6, 2023, 11:11am

julia> using Cascadia

julia> myhtml = """<!DOCTYPE html>
       <html lang="en">

       </head>

       <body style="text-align: center;">

                <div class="ttt-subhead"><span>8:00</span> Breakfast</div>


               <div class="ttt-row">
                       <p>
                               Eggs.
                       </p>
                       <p>
                               Bacon.
                       </p>
               </div>


                 <div class="ttt-subhead"><span>12:00</span> Lunch</div>

               <div class="ttt-row">
                       <p>
                               Burger.
                       </p>
                       <p>
                               Fries.
                       </p>
                       <p>
                               Coke.
                       </p>
               </div>

                 <div class="ttt-subhead"><span>18:00</span> Dinner</div>

               <div class="ttt-row">
                       <p>
                               Salad.
                       </p>
               </div>

       </body>

       </html>"""
"<!DOCTYPE html>\n<html lang=\"en\">\n\n</head>\n\n<body style=\"text-align: center;\">\n\n         <div class=\"ttt-subhead\"><span>8:00</span> Breakfast</div>\n\n\n        <div class=\"ttt-row\">\n                <p>\n                        Eggs.\n                </p>\n                <p>\n                        Bacon.\n                </p>\n        </div>\n\n\n          <div class=\"ttt-subhead\"><span>12:00</span> Lunch</div>\n\n        <div class=\"ttt-row\">\n                <p>\n                        Burger.\n                </p>\n                <p>\n                        Fries.\n                </p>\n                <p>\n                        Coke.\n                </p>\n        </div>\n\n          <div class=\"ttt-subhead\"><span>18:00</span> Dinner</div>\n\n        <div class=\"ttt-row\">\n                <p>\n                        Salad.\n                </p>\n        </div>\n\n</body>\n\n</html>"

julia> using Gumbo

julia> dom = parsehtml(myhtml)
HTML Document:
<!DOCTYPE html>
HTMLElement{:HTML}:<HTML lang="en">
  <head></head>
  <body style="text-align: center;">
    <div class="ttt-subhead">
      <span>
        8:00
      </span>
      Breakfast
    </div>
    <div class="ttt-row">
      <p>
        Eggs.
      </p>
      <p>
        Bacon.
      </p>
    </div>
    <div class="ttt-subhead">
      <span>
        12:00
...


julia> map(eachmatch(Selector("div.ttt-row"), dom.root)) do d
         length(eachmatch(Selector("p"), d))
       end
3-element Vector{Int64}:
 2
 3
 1

lawless-m · January 6, 2023, 11:11am

It is certainly more manual, you fire the HTML into it and it uses callbacks on Open Tag, Close Tag, and Data (text inside tags).

You then would inspect the attributes manually and set flags to know where you are in the document.

So it would definitely work, but so do you

Nash · January 6, 2023, 11:30am

I can see that your original proposal works in your example. Unfortunately, it does not work for me. Perhaps the solution you had in mind above (the one that gave 3, and not [2,1]) is the right one in my case. How would your code be changed to accommodate that?

skleinbo · January 6, 2023, 11:35am

It’s verbatim your example!

Nash · January 6, 2023, 11:52am

Yes, you are right. I am sorry for not being very clear.

In my case, it turns out that the number of “p” between each <div class="ttt-subrow" is always the same. So, running your code, I get a vector with many instances of one number.

However, the number of “p” between each div class="ttt-subhead" in my actual case varies because the number of 'div class="ttt-subrow" may vary, like this, for example:

subhead

subrow

subhead

subrow
subrow
subrow

subhead

subrow
subrow

So, I think your solution of 3, rather than [2,1] was right all along. So, my question is, supposing that the answer is 3 in you HTML, what code finds that number? (Again, I am really sorry for botching up my explanation of the problem!)

Topic		Replies	Views
Find last section header with Gumbo.jl General Usage question , package , web , html	0	457	July 12, 2022
Sunday Small challenge General Usage performance	10	610	August 1, 2022
Get index corresponding to some number in list of outputs General Usage indexing	24	834	July 10, 2023
Manipulating HTML DOM using Julia Web Stack question	3	2314	August 8, 2020
What library do you suggest to parse HTML page and additionally navigate through the page New to Julia	2	570	December 31, 2019

Counting the number of paragraphs ("p") between containers ("div")

Related topics