Pull data from websites in Julia

Albert_Zevelev · December 1, 2022, 7:25pm

As ski resorts begin to open, they update data on their websites throughout the day, every day.
Two factors skiiers care about are: (1) # trails currently open, (2) current Snow base

Each resort website posts this info:
Park City trails:
Park City snow:

Vail Trails:
Vail Snow:

Whistler Trails:
Whistler Snow:

Can I automatically pull this data from the links above using Julia?
(skicentral.com does this, but not particularly well & I’d like to learn how to do it in Julia if possible)

Jeff_Emanuel · December 1, 2022, 7:34pm

Scraping with Julia: Scraping web pages with Julia HTTP & Gumbo: Tutorial
Someone’s scraping project for ski snow reports, but not Julia: Web Scraping for custom API - DEV Community 👩‍💻👨‍💻

You should be able to combine parts from both to get what you want.

Scraping is notoriously fussy and fragile.

You can get more results at once from Snow Report | Colorado Ski Country USA, for example

Albert_Zevelev · December 1, 2022, 8:52pm

Here is my very raw, very naive attempt.
What’s amazing is I have zero experience “webscraping” & I still got it to work.

using HTTP;
l_pc ="https://www.parkcitymountain.com/the-mountain/mountain-conditions/terrain-and-lift-status.aspx";
l_v  ="https://www.vail.com/the-mountain/mountain-conditions/terrain-and-lift-status.aspx";
l_w  = "https://www.whistlerblackcomb.com/the-mountain/mountain-conditions/terrain-and-lift-status.aspx";

# Number of "runs" open in Park City
r  = HTTP.get(l_pc); # get link 
rs = String(r.body);  # make into long string 
a1="id=\"runs\">\r\n\r\n                                            <div class=\"terrain_summary__circle\"\r\n                                                    data-open="
a2=findfirst(a1, rs) # findall 
a3 = rs[( a2[end] +2 ) : ( a2[end] +4)] 
#################

# Number of "runs" open in  Vail 
r  = HTTP.get(l_v); # get link 
rs = String(r.body);  # make into long string 
a1="id=\"runs\">\r\n\r\n                                            <div class=\"terrain_summary__circle\"\r\n                                                    data-open="
a2=findfirst(a1, rs) # findall 
a3 = rs[( a2[end] +2 ) : ( a2[end] +4)] 
#################

# Number of "runs" open in Whistler
r  = HTTP.get(l_w); # get link 
rs = String(r.body);  # make into long string 
a1="id=\"runs\">\r\n\r\n                                            <div class=\"terrain_summary__circle\"\r\n                                                    data-open="
a2=findfirst(a1, rs) # findall 
a3 = rs[( a2[end] +2 ) : ( a2[end] +4)] 
#################

Jeff_Emanuel · December 1, 2022, 9:00pm

All three of those are owned by Vail Resorts, so it’s not too surprising that the snow report html is similar. You’ll need something different for other resorts.

Albert_Zevelev · December 1, 2022, 9:02pm

Yeah, I’m currently on Epic pass…
Here is a cleaner way to do it w/ a String. (still no good & today morning I didn’t know what “webscraping” meant)

using HTTP;
l_pc ="https://www.parkcitymountain.com/the-mountain/mountain-conditions/terrain-and-lift-status.aspx";
l_v  ="https://www.vail.com/the-mountain/mountain-conditions/terrain-and-lift-status.aspx";
l_w  = "https://www.whistlerblackcomb.com/the-mountain/mountain-conditions/terrain-and-lift-status.aspx";

#location of string w/ # Runs open.
a1="id=\"runs\">\r\n\r\n                                            <div class=\"terrain_summary__circle\"\r\n                                                    data-open="
RUNS = [];

for resort in [l_pc l_v l_w] 
    r  = HTTP.get(resort); # get link
    rs = String(r.body);  # make into string 
    a2=findfirst(a1, rs) # findall get index w/ Number of Runs open
    num_runs_open = rs[( a2[end] +2 ) : ( a2[end] +4)]  # get # Runs open 
    push!(RUNS, num_runs_open)
end 

julia> RUNS
3-element Vector{Any}:
 "120"
 "86\""
 "30\""

chris-b1 · December 1, 2022, 9:39pm

Here’s a little more robust solution using an HTML/XML parser. The heavy lifting is the "//div"... line, which is an XPath query to search XML - in this case finding a div with the attribute data-terrain-status-id equal to "runs" then selecting the next child element.

using EzXML
using HTTP

function get_open(url)
    r = HTTP.get(url)
    tree = EzXML.parsehtml(r.body)
    node = findfirst("//div[@data-terrain-status-id=\"runs\"]/div", tree)
    val = node["data-open"]
    return parse(Int, val)
end

j-fu · December 1, 2022, 10:19pm

Try Gumbo.jl. It parses HTML and gives you structured access to the elements.

Here is an example from the time of the coronaplotting craze.

Albert_Zevelev · December 2, 2022, 12:27am

@chris-b1 thanks!

using HTTP, EzXML;
s1 = "https://www.";
names   = ["parkcitymountain" "vail" "whistlerblackcomb" "beavercreek" "breckenridge" "northstarcalifornia"];
sruns = ".com/the-mountain/mountain-conditions/terrain-and-lift-status.aspx"
ssnow = ".com/the-mountain/mountain-conditions/snow-and-weather-report.aspx"
RUNS = [];
for resort in names
    link_runs = s1 * resort * sruns 
    r  = HTTP.get(link_runs)
    tree = EzXML.parsehtml(r.body)
    #
    node_runs = findfirst("//div[@data-terrain-status-id=\"runs\"]/div", tree)
    num_runsopen  = node_runs["data-open"]   |> x -> parse(Int, x)
    num_runstotal = node_runs["data-total"]  |> x -> parse(Int, x)
    #
    push!(RUNS, [resort num_runsopen num_runstotal])
    #
    link_snow = s1 * resort * ssnow 
    r  = HTTP.get(link_snow)
    tree = EzXML.parsehtml(r.body)
    # How do we get: snowfall.Depth.Inches etc???
end 
pushfirst!(RUNS, ["Resort" "Runs Open" "Runs Total"])
RUNS = vcat(RUNS...)

Gives

I wonder if there is an easy way to scrape all the ski resort websites from https://www.epicpass.com/?

Also it looks more complicated to scrape snowfall “BASE DEPTH” & “CURRENT SEASON”.
(not sure the approach used for RUNS above would work…)

Jeff_Emanuel · December 2, 2022, 3:43pm

Try scraping for the individual links at Snow and Weather Reports | Snow.com

chris-b1 · December 2, 2022, 4:26pm

Also it looks more complicated to scrape snowfall “BASE DEPTH” & “CURRENT SEASON”.
(not sure the approach used for RUNS above would work…)

A trick which can be helpful, if is you select the element in Chrome (right click, Inspect) there is an option to copy an XPath query to an element. Sometimes you’ll want to clean up or modify that query, but it can be a helpful starting point.

//*[@id="snow_report_1"]/div[2]/ul/li[6]/div/h5/text()

Albert_Zevelev · December 2, 2022, 8:33pm

mehhh, no luck pulling snow data (base depth & current season), but thanks anyways…

Topic		Replies	Views
[ANN] Harbest.jl - Simple web scraping with Julia Package Announcements	5	1082	December 25, 2022
Getting data directly from a website Performance	9	6311	July 9, 2018
Scraping a "scrap-unfriendly" web page in Julia General Usage	8	3943	August 1, 2018
Web-scraping with Julia General Usage package	1	422	December 14, 2022
Scraping Site New to Julia	9	2180	November 25, 2020

Pull data from websites in Julia

Related topics