Getting data directly from a website


#1

Hi. I’m working on a script in which I do some analysis about energy prices. The first step to process these values is to get them. They are available on a public web site, through a table of five columns with historical prices. Today, what I do is to write them manually in a .csv or a .dat file, then I read it with Julia.

I think that this procedure is very inefficient, so I’d like to know if there is a way to read the table from this website with Julia. Data is available here.

Thanks.


#2

That site renders all of its content using javascript, so downloading and manually parsing html isn’t an option. Your best bet if you want a julia solution is probably going to be using pycall.jl with one of the many python packages for scraping the web. Only a few support javascript-rendered stuff; I’d probably start by trying out selenium.


#3

Try this URL instead.

In general, any data loaded to a page asynchronously still should have its source. In many cases, you can inspect network requests that your browser does (e.g. in Firefox: Menu -> Web Developer -> Toggle Tools -> Network tab in the newly opened window) and find the one that loads the data. Corresponding resource is often static and easy to grab, like the one I posted above.


#4

@non-Jedi thank you very much! I’ll read more about pycall and selenium, it seems very useful.

@dfdx very interesting! But to get the data from this URL, do you suggest to use one of @non-Jedi options or something else?


#5

Ah! Very good. If the resource is available statically, probably beautiful soup or one of the many python libraries built for scraping but simpler than selenium would be better. However, for the simple task of parsing html for the data you need, I’d try one of the html parsing libraries available in Julia rather than python: https://github.com/search?q=html%20language%3Ajulia

Possibly look at EzXML or Gumbo.


#6

Edit:
cookies parameter help to get proper data:

r = HTTP.get("https://www.ccee.org.br/preco/precoMedio.do", cookies=true);

HTTP.jl was unsuccesful here too:

0.7.0-beta.189> r = HTTP.get("https://www.ccee.org.br/preco/precoMedio.do");

0.7.0-beta.189> String(r.body)
"<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">\n<html><head>\n<title>302 Found</title>\n</head><body>\n<h1>Found</h1>\n<p>The document has moved <a href=\"https://login.ccee.org.br:443/oam/server/obrareq.cgi?encquery%3DZYUTYHbstezO3EAE%2FL6JC1pA2hoxJ4M8DcqXXvRIS%2FzRduVNGiWRZT1vy1ILgB9%2BuSxtZZU6dWtKAFdaB7wCNm2ozR1%2FU56731QJK7A0iDOFSIAqPBoLc6t61ju%2Fi8HpDE0UxlDKXQR61q794LYOF7GT%2BN15JSTZHZZjeT5TqJ7cw5A6li0zMZQcekPGrLscV00v5VNu3vsiS3svRYbL6dU4L1CjEFpgpSwZZKtfMKdGl4rzSfQZjx4ah1OYH5Tn3B9mPFRw7dVk3DqAsszZUXHv4fuPyzmPKak1eSBDL%2F78lCo6L25l8KreuL3t%2BZVt%20agentid%3Dagentewg%20ver%3D1%20crmethod%3D2\">here</a>.</p>\n</body></html>\n"

BTW it could be 4 liner in Python:

import requests
import pandas as pd
r = requests.get("https://www.ccee.org.br/preco/precoMedio.do")
df = pd.read_html(io.StringIO(r.text))[0]

It could be simpler but pd.read_html ended with similar problem like HTTP.jl(without cookies parameter) -> HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.


#8

I am learning Julia too, so I just tried to get data with “brutal force”.

I would really understand if you wont like it. :wink:

using HTTP, DataFrames

r = HTTP.get("https://www.ccee.org.br/preco/precoMedio.do", cookies=true);
data = String(r.body);

function replacer(a, b) for i in b a=replace(a, i) end; a end;

csv_data = "Mês,SE/CO,S,NE,N\n" *  # adding header
   replacer(
     data[findfirst(r"<tbody.*?>(.*)</tbody>"s, data)],   # just data for table body
     ["\t"=>"",   # clear the mess
      "\r\n"=>"",  
      r"<tr.*?>"=>"",  
      "</tr>"=>"\n",   # <tr>...</tr> as lines
      r"</td><td>"=>"\",\"",  # comma between values + quotas
      r"<td>(.*?)</td>"=>s"\"\1\"",  # quotas to begin and end of line
      r"<.?tbody>"=>""]);  # clear the mess

df = readtable(IOBuffer(csv_data))
182×5 DataFrame
│ Row │ Mês     │ SE_CO  │ S      │ NE     │ N      │
├─────┼─────────┼────────┼────────┼────────┼────────┤
│ 1   │ 06/2018 │ 472,87 │ 472,87 │ 441,96 │ 441,96 │
│ 2   │ 05/2018 │ 325,46 │ 325,46 │ 211,57 │ 159,47 │
│ 3   │ 04/2018 │ 109,71 │ 109,71 │ 108,64 │ 51,13  │
│ 4   │ 03/2018 │ 219,23 │ 219,23 │ 218,14 │ 40,16  │
│ 5   │ 02/2018 │ 188,79 │ 188,54 │ 178,54 │ 42,91  │
⋮
│ 177 │ 10/2003 │ 25,82  │ 26,30  │ 19,88  │ 25,73  │

#9

@Liso WOW! Thank you very much!! It was very very helpful. But unfornunately I had a problem running your code. I got this with csv_data

“Mês,SE/CO,S,NE,N\n”\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"\1"\n"

Then I got this error with df:

ERROR: DimensionMismatch(“Number of columns (1) and number of column names (5) are not equal”)

Could you help just one more time, please? :grin:

Thanks


#10

Seems Julia 0.6.x has some problems with replace regex. This is working for me in 0.6.1 ->

julia> csv_data = "Mês,SE/CO,S,NE,N\n" *  # adding header
         replacer(
            data[findfirst(r"<tbody.*?>(.*)</tbody>"s, data)],   # just data for table body
            ["\t"=>"",   # clear the mess
             "\r\n"=>"",  
             r"<tr.*?>"=>"",  
             "</tr>"=>"\n",   # <tr>...</tr> as lines
             r"</td><td>"=>"\",\"",  # comma between values + quotas
             r"<[/]?td>"=>"\"",  # quotas to begin and end of line
             r"<[/]?tbody>"=>""]);  # clear the mess

#11

@Liso It worked perfectly now! Thank you very much, you helped me a lot. :grinning: