Scraping a "scrap-unfriendly" web page in Julia


#1

I am trying to get the html source of a public web page using the HTTP.request function, in such a way that it would be feasible to scrap some numeric data.

r = HTTP.request("GET", "https://www.euronext.com/en/products/bonds/NL0000168714-XAMS/market-information")

The data set I want to scrap (and only that set), is always in-between: <strong>...</strong>, which is rather convenient. Unfortunately, the HTTP.request response does not carry those <strong>...</strong>. However, I can get the data, for example, by first saving the web page to a file and afterwards reading it.

I can also use getPageSource() in Selenium WebDriver in Java, or possibly Selenium.jl that, I believe, would call Python that calls Java. But this way of doing it, is somewhat undesirable also because one has first to open the web page so that Selenium can get page source.

Is there a simpler way of doing this in Julia?
Thank you


#2

Those tags aren’t in the HTTP response because they’re not actually present in the page’s source. Try loading the site, right-clicking, and doing “view source”. You’ll see that, for example, the Perpetual bond label is nowhere to be found.

What’s happening is that the page is loading some Javascript code that queries their database and then modifies the displayed page. When you download the page from your web browser, you’re getting the page after it’s been modified by javascript.

So, essentially, the data you want doesn’t exist until after some javascript has been run, which can’t happen in a regular HTTP request.

I’m afraid I don’t know much about using Selenium, but I suspect your options are to either:

  1. Scrape after letting javascript run
  2. Call the API that the page’s javascript is querying directly, rather than trying to scrape it from a website.

#3

Thank you very much for your detailed reply.

I see, so it will be difficult to get a simple way of doing this. I suppose therefore that the best solution will be to let Javascript run first. Is there a way to access those Javascript modified pages from Julia, without resorting to save them to a file?


#4

I’m afraid I don’t know. But if I were trying to solve this myself, I would investigate option 2 by trying to figure out what the javascript code itself is doing. If the javascript on the page is just querying some other API, then perhaps you can also query that API directly and get the raw data without any scraping at all.


#5

Yes, so given that the page is created in JavaScript, a HTTP request based scraping method will not work. You essentially need a real browser executing the javascript. Selenium (or Watir) is a good way to accomplish that.

If you want to look at a Julia solution, Blink.jl provides something similar.

Given how much effort they are taking to prevent scraping, I’d urge you to check the terms of service for the website. In my experience, scraping exchange websites for pricing data is unfortunately never successful long term.

Regards

Avik


#6

Another option that lets you scrape the page after Javascript has done its thing is PhantomJS.jl. Here’s an old blog post that demonstrates how to do it.


#7

It seems like this page simply loads parts of HTML after the main layout and scripts have been loaded. Looking at Network tab in Firefox developer tools I see 2 additional requests for HTML:

https://www.euronext.com/en/nyx-market-data-widget?isin=NL0000168714&mic=XAMS&productType=bonds
https://www.euronext.com/en/factsheet-ajax?instrument_id=NL0000168714-XAMS&instrument_type=bonds

These return plain HTML with <strong> tags you look for, so perhaps you can scrap these URLs instead of original page.

I don’t think the website attempts to prevent scrapping, but I’d indeed check their terms and maybe contact directly for clarification if you are going to drive this project to production.


#8

This request solves the problem. In the meantime I had managed to get this data using PyCall + PyQt5.QtWebEngineWidgets (that implements a web browser “lite” based in Chromium) which although being a generic approach is way too cumbersome and adds huge dependencies…

Thank you very much for all replies. It is very easy to get price data from this exchange website since they provide a download button! It should be commended the easy way in which they provide this information

However, there is no such facility for the factsheet information about each one of many bond issues listed there (>5000). This is the information I am looking for. I contacted them about 2 weeks ago about getting this data but if I understood correctly they outsource the management of their website to an external company. Therefore, it is understandable that it will take some time to get an answer.


#9

Just to add on most websites have a robots.txt at the root (see https://www.euronext.com/robots.txt) which tells you what they would like you to do or not do. I am not sure if it’s enforceable by law as I have heard of a case where LinkedIn tried to stop a company from scraping their website and the judge allowed the scraper to continue operation.