Downloading all information from Ctrl-Shift-I

To my dismay, I often find that the data I am after on a website disappears with the following line of code:

using Gumbo
using Cascadia

page = parsehtml(read(download(url), String))
collected = string(page)

Yet, when I use inspect Ctrl-Shift-I in the browser, the information is there, clear as day.

So, my question is simple. How do I download all the data contained in Ctrl-Shift-I, convert it to string, and fetch the parts I want using the regularities that are there?

It is possible that page is uploaded dynamically through the AJAX. In this case you either need to do necessary calls yourself or in a more complicated scenarios you can use something like Selenium.

1 Like

What do you mean by “make necessary calls yourself” as opposed to using Selenium?

My task is not very complex, I don’t think. Could you ellaborate what you mean by making a call? Perhaps provide an example of such?

I mean, you can go through the page source, identify ajax calls and since these calls just requests to some other resource, use them. Or you can use Network tab in browser Web Developers Tools (usually you can run it with F12 key) and after refresh you can see all sorts of intermediate calls, which can be used.

As an example, consider this page: https://www.nasdaq.com/market-activity/earnings If you try to download it, corresponding html will be empty. But, you can turn on Tools, open Networks tab, refresh page and after some investigation you will find, that there is a call https://api.nasdaq.com/api/calendar/earnings?date=2021-05-19 which actually populate the table. So, instead of downloading earnings html page, you can request directly api.nasdaq.com and process response data.

It is different from Selenium, which is basically full browser and it executes all javascripts on the page and you do not need to work through the calls or read source. You can just grab resulting html page.

3 Likes

What @Skoffer refers to is that on modern (responsive) web pages typically the content you see is loaded on a second step.
The first step is that the browser loads html, css and javascript code.
Second step is, that the javascript code is executed and fills additional content into html containers. This additional content is loaded from the servers using javascript AJAX protocol.

download(url)
does only the first step of this process.

Selenium is what you need, as @Skoffer recommended.

2 Likes