Counting the number of paragraphs ("p") between containers ("div")

I am trying to parse the content of a website and part of the logic to get the right information is counting the number of times a paragraph of text (represented by the element “p”) occurs between a particular container type (represented by “div.ttt-subhead” (ie., not all divs, but only the one stated))

Can I use Gumbo or Cascadia for this? If so, how? If not, then how can I find the count?

What have you tried? Why is it not working as expected? Can you post an MWE?

Well, I am quite out of my depth. I have been able to extract p’s and div’s using the following code, but I don’t know how to “relate” the p’s and the div’s to each other:

n = length(eachmatch(Selector("div.ttt-subhead"),body[1]))
for i = 1:n
    println(eachmatch(Selector("div.ttt-subhead"),body[1])[i])
end

n = length(eachmatch(Selector("p"),body[1]))
for i = 1:n
    println(eachmatch(Selector("p"),body[1])[i])
end

Here is a basic HTML to use:

<!DOCTYPE html>
<html lang="en">

</head>

<body style="text-align: center;">

	 <div class="ttt-subhead"><span>8:00</span> Breakfast</div>


        <div class="ttt-row">
		<p>
			Eggs.
		</p>
		<p>
			Bacon.
		</p>
        </div>


	  <div class="ttt-subhead"><span>12:00</span> Lunch</div>

        <div class="ttt-row">
		<p>
			Burger.
		</p>
		<p>
			Fries.
		</p>
		<p>
			Coke.
		</p>
        </div>

	  <div class="ttt-subhead"><span>18:00</span> Dinner</div>

        <div class="ttt-row">
		<p>
			Salad.
		</p>
        </div>

</body>

</html>
1 Like

Maybe your code works, but the HTML in your example is not correct. Gumbo doesn’t parse your example correctly. It should read

<div class="ttt-subhead">

Then this works

julia> using Cascadia, Gumbo

julia> dom = parsehtml(myhtml)

julia> map(eachmatch(Selector("div.ttt-subhead"), dom.root)) do d
         length(eachmatch(Selector("p"), d))
       end
3-element Vector{Int64}:
 1
 2
 1

where myhtml is the (corrected) example from your post.

1 Like

Sorry, I must learn to ask the right question. I appreciate your help. Unfortunately, a better description of my problem is given by the following HTML (I have also edited the HTML in my original post):

<!DOCTYPE html>
<html lang="en">

</head>

<body style="text-align: center;">

	 <div class="ttt-subhead"><span>8:00</span> Breakfast</div>


        <div class="ttt-row">
		<p>
			Eggs.
		</p>
		<p>
			Bacon.
		</p>
        </div>


	  <div class="ttt-subhead"><span>12:00</span> Lunch</div>

        <div class="ttt-row">
		<p>
			Burger.
		</p>
		<p>
			Fries.
		</p>
		<p>
			Coke.
		</p>
        </div>

	  <div class="ttt-subhead"><span>18:00</span> Dinner</div>

        <div class="ttt-row">
		<p>
			Salad.
		</p>
        </div>

</body>

</html>

Still not sure what you are trying to match. In your revised example, the div.ttt-subhead do not close around the paragraphs.

The answer to my question using the above HTML is

count = [2,3,1]

This vector is obtained by counting the number of paragraphs between the “Breakfast” and “Lunch” container, the “Lunch” and “Dinner” container, and after the “Dinner” container.

What about:

<div class="ttt-subhead"><span>8:00</span> Breakfast</div>
        <div class="ttt-row">
		<p>
			Eggs.
		</p>
		<p>
			Bacon.
		</p>
        </div>
        <div class="ttt-row">
		<p>
			Hashbrowns.
		</p>
        </div>
<div class="ttt-subhead"><span>12:00</span> Lunch</div>

I assume this should count as 3 rather than [2,1]?

1 Like

Yes, in your example, the solution is [2,1].

But it is a little different than my example. In your example, there is two

<div class="ttt-row">"

between the

<div class="ttt-subhead".

In my example, there is only one

<div class="ttt-row">

for each each

<div class="ttt-subhead"

1 Like

Then won’t just changing Selector("div.ttt-subhead") to Selector("div.ttt-row") in my proposal do the trick?

Please see my comment above, since it probably makes a difference.

I transliterated Python’s HTML Parser if that helps at all

Have you tried it out? If all paragraphs are wrapped in div.ttt-row, this should give you what you want.

Yes, unfortunately it didn’t work in my case. Could you perhaps try and modify your example so it matches mine, and then see if your prior solution works? If it does, then maybe it is because my simplified example does not capture my case.

I am not familiar with that package. Given your knowledge of it, would it work?

julia> using Cascadia

julia> myhtml = """<!DOCTYPE html>
       <html lang="en">

       </head>

       <body style="text-align: center;">

                <div class="ttt-subhead"><span>8:00</span> Breakfast</div>


               <div class="ttt-row">
                       <p>
                               Eggs.
                       </p>
                       <p>
                               Bacon.
                       </p>
               </div>


                 <div class="ttt-subhead"><span>12:00</span> Lunch</div>

               <div class="ttt-row">
                       <p>
                               Burger.
                       </p>
                       <p>
                               Fries.
                       </p>
                       <p>
                               Coke.
                       </p>
               </div>

                 <div class="ttt-subhead"><span>18:00</span> Dinner</div>

               <div class="ttt-row">
                       <p>
                               Salad.
                       </p>
               </div>

       </body>

       </html>"""
"<!DOCTYPE html>\n<html lang=\"en\">\n\n</head>\n\n<body style=\"text-align: center;\">\n\n         <div class=\"ttt-subhead\"><span>8:00</span> Breakfast</div>\n\n\n        <div class=\"ttt-row\">\n                <p>\n                        Eggs.\n                </p>\n                <p>\n                        Bacon.\n                </p>\n        </div>\n\n\n          <div class=\"ttt-subhead\"><span>12:00</span> Lunch</div>\n\n        <div class=\"ttt-row\">\n                <p>\n                        Burger.\n                </p>\n                <p>\n                        Fries.\n                </p>\n                <p>\n                        Coke.\n                </p>\n        </div>\n\n          <div class=\"ttt-subhead\"><span>18:00</span> Dinner</div>\n\n        <div class=\"ttt-row\">\n                <p>\n                        Salad.\n                </p>\n        </div>\n\n</body>\n\n</html>"

julia> using Gumbo

julia> dom = parsehtml(myhtml)
HTML Document:
<!DOCTYPE html>
HTMLElement{:HTML}:<HTML lang="en">
  <head></head>
  <body style="text-align: center;">
    <div class="ttt-subhead">
      <span>
        8:00
      </span>
      Breakfast
    </div>
    <div class="ttt-row">
      <p>
        Eggs.
      </p>
      <p>
        Bacon.
      </p>
    </div>
    <div class="ttt-subhead">
      <span>
        12:00
...


julia> map(eachmatch(Selector("div.ttt-row"), dom.root)) do d
         length(eachmatch(Selector("p"), d))
       end
3-element Vector{Int64}:
 2
 3
 1

It is certainly more manual, you fire the HTML into it and it uses callbacks on Open Tag, Close Tag, and Data (text inside tags).

You then would inspect the attributes manually and set flags to know where you are in the document.

So it would definitely work, but so do you :slight_smile:

I can see that your original proposal works in your example. Unfortunately, it does not work for me. Perhaps the solution you had in mind above (the one that gave 3, and not [2,1]) is the right one in my case. How would your code be changed to accommodate that?

It’s verbatim your example!

Yes, you are right. I am sorry for not being very clear.

In my case, it turns out that the number of “p” between each <div class="ttt-subrow" is always the same. So, running your code, I get a vector with many instances of one number.

However, the number of “p” between each div class="ttt-subhead" in my actual case varies because the number of 'div class="ttt-subrow" may vary, like this, for example:

subhead

subrow

subhead

subrow
subrow
subrow

subhead

subrow
subrow

So, I think your solution of 3, rather than [2,1] was right all along. So, my question is, supposing that the answer is 3 in you HTML, what code finds that number? (Again, I am really sorry for botching up my explanation of the problem!)