To keep track of things, web pages are often the primary source of information. Twitter bots and journalist base much of their news on updated web pages. For example, organizations and businesses update their web pages when:
- changing prices
- adding new (financial) reports
- showing job offers
- updating legislation
Sometimes these things can be tracked via an RSS reader, but often not. Wouldn’t it be nice to have a script which at a regular time interval checks for changes on some web pages?
To solve this, there are some services available such as https://visualping.io/, https://www.wachete.com/, https://versionista.com/ and more. These all cost a few to many dollars a month.
This got me thinking. Basically, monitoring a website for changes requires three things:
- Some storage to remember what the page used to look like.
- Some server to check the page on a schedule.
- A way to send a notification.
This can all be done via GitHub. The storage is a git repository, which has the added benefit of diffs, the schedule can be handled via GitHub Actions and notifications can be done via GitHub issue comments. (I plan to add GitLab support later if enough people use the package.)
So, I’ve created Skans.jl (https://skans.dev). The template to get started is at SkansTemplate.
This package does the following things for each web site that it is set up to track:
- Download the HTML page.
- Sanitize the HTML. (This is aimed to improve the readability of the diff between previous page version and current page version and to reduce the number of false-positive hits, that is, reduce the chance of getting notified when the page didn’t visibly change.)
- Compare the current HTML to the previous HTML.
- If the two HTML strings differ, create a diff and send a notification with the diff.
For example, for testing, I monitor a site which always changes: https://bbc.com. The most recent notification for the BBC looks as follows (source):
My hopes for the future of this package are that it will draw many new people to Julia! For example, if you want to monitor a large numbers of pages, you can either pay $ 99 per month or use Skans.jl and a GitHub runner for free! Optionally, with a GitHub Runner on your own computer. Also, with Skans.jl, you have much more control about what the software is doing exactly.
Thanks again for the Julia community and especially BinaryBuilder for providing Gumbo which can gracefully sanitize “the insanity that passes for HTML out on the wild, wild web” (Gumbo.jl).