To my dismay, I often find that the data I am after on a website disappears with the following line of code:
page = parsehtml(read(download(url), String))
collected = string(page)
Yet, when I use inspect Ctrl-Shift-I in the browser, the information is there, clear as day.
So, my question is simple. How do I download all the data contained in Ctrl-Shift-I, convert it to string, and fetch the parts I want using the regularities that are there?
It is possible that page is uploaded dynamically through the AJAX. In this case you either need to do necessary calls yourself or in a more complicated scenarios you can use something like Selenium.
What do you mean by “make necessary calls yourself” as opposed to using Selenium?
My task is not very complex, I don’t think. Could you ellaborate what you mean by making a call? Perhaps provide an example of such?
I mean, you can go through the page source, identify ajax calls and since these calls just requests to some other resource, use them. Or you can use
Network tab in browser Web Developers Tools (usually you can run it with F12 key) and after refresh you can see all sorts of intermediate calls, which can be used.
As an example, consider this page: https://www.nasdaq.com/market-activity/earnings If you try to
download it, corresponding html will be empty. But, you can turn on Tools, open Networks tab, refresh page and after some investigation you will find, that there is a call https://api.nasdaq.com/api/calendar/earnings?date=2021-05-19 which actually populate the table. So, instead of downloading earnings html page, you can request directly
api.nasdaq.com and process response data.
What @Skoffer refers to is that on modern (responsive) web pages typically the content you see is loaded on a second step.
does only the first step of this process.
Selenium is what you need, as @Skoffer recommended.