I am trying to parse the content of a website and part of the logic to get the right information is counting the number of times a paragraph of text (represented by the element “p”) occurs between a particular container type (represented by “div.ttt-subhead” (ie., not all divs, but only the one stated))
Can I use Gumbo or Cascadia for this? If so, how? If not, then how can I find the count?
Well, I am quite out of my depth. I have been able to extract p’s and div’s using the following code, but I don’t know how to “relate” the p’s and the div’s to each other:
n = length(eachmatch(Selector("div.ttt-subhead"),body[1]))
for i = 1:n
println(eachmatch(Selector("div.ttt-subhead"),body[1])[i])
end
n = length(eachmatch(Selector("p"),body[1]))
for i = 1:n
println(eachmatch(Selector("p"),body[1])[i])
end
Maybe your code works, but the HTML in your example is not correct. Gumbo doesn’t parse your example correctly. It should read
<div class="ttt-subhead">
Then this works
julia> using Cascadia, Gumbo
julia> dom = parsehtml(myhtml)
julia> map(eachmatch(Selector("div.ttt-subhead"), dom.root)) do d
length(eachmatch(Selector("p"), d))
end
3-element Vector{Int64}:
1
2
1
where myhtml is the (corrected) example from your post.
Sorry, I must learn to ask the right question. I appreciate your help. Unfortunately, a better description of my problem is given by the following HTML (I have also edited the HTML in my original post):
This vector is obtained by counting the number of paragraphs between the “Breakfast” and “Lunch” container, the “Lunch” and “Dinner” container, and after the “Dinner” container.
Yes, unfortunately it didn’t work in my case. Could you perhaps try and modify your example so it matches mine, and then see if your prior solution works? If it does, then maybe it is because my simplified example does not capture my case.
I can see that your original proposal works in your example. Unfortunately, it does not work for me. Perhaps the solution you had in mind above (the one that gave 3, and not [2,1]) is the right one in my case. How would your code be changed to accommodate that?
Yes, you are right. I am sorry for not being very clear.
In my case, it turns out that the number of “p” between each <div class="ttt-subrow" is always the same. So, running your code, I get a vector with many instances of one number.
However, the number of “p” between each div class="ttt-subhead" in my actual case varies because the number of 'div class="ttt-subrow" may vary, like this, for example:
subhead
subrow
subhead
subrow
subrow
subrow
subhead
subrow
subrow
So, I think your solution of 3, rather than [2,1] was right all along. So, my question is, supposing that the answer is 3 in you HTML, what code finds that number? (Again, I am really sorry for botching up my explanation of the problem!)