I am trying to get the html source of a public web page using the
HTTP.request function, in such a way that it would be feasible to scrap some numeric data.
r = HTTP.request("GET", "https://www.euronext.com/en/products/bonds/NL0000168714-XAMS/market-information")
The data set I want to scrap (and only that set), is always in-between:
<strong>...</strong>, which is rather convenient. Unfortunately, the
HTTP.request response does not carry those
<strong>...</strong>. However, I can get the data, for example, by first saving the web page to a file and afterwards reading it.
I can also use
getPageSource() in Selenium WebDriver in Java, or possibly
Selenium.jl that, I believe, would call Python that calls Java. But this way of doing it, is somewhat undesirable also because one has first to open the web page so that Selenium can get page source.
Is there a simpler way of doing this in Julia?
Those tags aren’t in the HTTP response because they’re not actually present in the page’s source. Try loading the site, right-clicking, and doing “view source”. You’ll see that, for example, the
Perpetual bond label is nowhere to be found.
I’m afraid I don’t know much about using Selenium, but I suspect your options are to either:
Thank you very much for your detailed reply.
If you want to look at a Julia solution, Blink.jl provides something similar.
Given how much effort they are taking to prevent scraping, I’d urge you to check the terms of service for the website. In my experience, scraping exchange websites for pricing data is unfortunately never successful long term.
It seems like this page simply loads parts of HTML after the main layout and scripts have been loaded. Looking at Network tab in Firefox developer tools I see 2 additional requests for HTML:
These return plain HTML with
<strong> tags you look for, so perhaps you can scrap these URLs instead of original page.
I don’t think the website attempts to prevent scrapping, but I’d indeed check their terms and maybe contact directly for clarification if you are going to drive this project to production.
This request solves the problem. In the meantime I had managed to get this data using
PyCall + PyQt5.QtWebEngineWidgets (that implements a web browser “lite” based in Chromium) which although being a generic approach is way too cumbersome and adds huge dependencies…
Thank you very much for all replies. It is very easy to get price data from this exchange website since they provide a download button! It should be commended the easy way in which they provide this information
However, there is no such facility for the factsheet information about each one of many bond issues listed there (>5000). This is the information I am looking for. I contacted them about 2 weeks ago about getting this data but if I understood correctly they outsource the management of their website to an external company. Therefore, it is understandable that it will take some time to get an answer.
Just to add on most websites have a robots.txt at the root (see https://www.euronext.com/robots.txt) which tells you what they would like you to do or not do. I am not sure if it’s enforceable by law as I have heard of a case where LinkedIn tried to stop a company from scraping their website and the judge allowed the scraper to continue operation.