Downloading all information from Ctrl-Shift-I

Nash · May 19, 2021, 11:11am

To my dismay, I often find that the data I am after on a website disappears with the following line of code:

using Gumbo
using Cascadia

page = parsehtml(read(download(url), String))
collected = string(page)

Yet, when I use inspect Ctrl-Shift-I in the browser, the information is there, clear as day.

So, my question is simple. How do I download all the data contained in Ctrl-Shift-I, convert it to string, and fetch the parts I want using the regularities that are there?

Skoffer · May 19, 2021, 11:33am

It is possible that page is uploaded dynamically through the AJAX. In this case you either need to do necessary calls yourself or in a more complicated scenarios you can use something like Selenium.

Nash · May 19, 2021, 11:47am

What do you mean by “make necessary calls yourself” as opposed to using Selenium?

My task is not very complex, I don’t think. Could you ellaborate what you mean by making a call? Perhaps provide an example of such?

Skoffer · May 19, 2021, 11:58am

I mean, you can go through the page source, identify ajax calls and since these calls just requests to some other resource, use them. Or you can use Network tab in browser Web Developers Tools (usually you can run it with F12 key) and after refresh you can see all sorts of intermediate calls, which can be used.

As an example, consider this page: https://www.nasdaq.com/market-activity/earnings If you try to download it, corresponding html will be empty. But, you can turn on Tools, open Networks tab, refresh page and after some investigation you will find, that there is a call https://api.nasdaq.com/api/calendar/earnings?date=2021-05-19 which actually populate the table. So, instead of downloading earnings html page, you can request directly api.nasdaq.com and process response data.

It is different from Selenium, which is basically full browser and it executes all javascripts on the page and you do not need to work through the calls or read source. You can just grab resulting html page.

oheil · May 19, 2021, 12:37pm

What @Skoffer refers to is that on modern (responsive) web pages typically the content you see is loaded on a second step.
The first step is that the browser loads html, css and javascript code.
Second step is, that the javascript code is executed and fills additional content into html containers. This additional content is loaded from the servers using javascript AJAX protocol.

download(url)
does only the first step of this process.

Selenium is what you need, as @Skoffer recommended.

Topic		Replies	Views
Scraping a "scrap-unfriendly" web page in Julia General Usage	8	4004	August 1, 2018
How to download the html file of through a browser like Chrome General Usage html	3	225	April 21, 2024
Getting data directly from a website Performance	9	6391	July 9, 2018
Web browsing, button clicks, and login. Is it possible by Julia? New to Julia	1	1227	January 2, 2020
How to do http request to get the whole source page when part of html loaded by javascript? General Usage web , http	0	406	October 10, 2021

Downloading all information from Ctrl-Shift-I

Related topics