I’m trying to learn how to do some really simple web stuff, like automatically filling out forms, with Julia and am kind of at loss. I found this python article, which has a simple example using DuckDuckGo that I am trying to replicate. The
HTTP.jl examples mostly cover servers and the POST example is for a REST API, so I’ve only gotten as far as
url = "https://html.duckduckgo.com/html" # URL for submitting the POST
payload = Dict("q" => "julialang") # name of the search field => the search term
req = HTTP.request("POST", url,... ????) # Proper format here?
results = String(req.body) # Figure out way to extract results
As another example, the forms I would like to access are more in the form of web calculators, like this one for example, where the inputs/outputs are more well-defined. For that calculator (based on the DuckDuckGo one), the parameters would be
url = "https://www.brewingcalculators.com/celsius-c-to-fahrenheit-f/?" # URL for submitting the POST
payload = Dict("?" => 34) # name of the field => a number
So, my first question is, what would be the proper format for the
POST in these cases? And then, is there a well-defined way to get the results out of the response?
Thanks for any help!
HTTP.jl could use more docs/examples - but this should work, wrapping your payload in a
HTTP.Form - from api docs here
payload = Dict("q" => "julialang")
form = HTTP.Form(payload)
resp = HTTP.post("https://html.duckduckgo.com/html", , form)
html = String(resp.body)
In terms of doing something with the results - that’s a project of its own - but could use something like EzXML.jl to parse the html and pull out data. Python has some more mature html/xml scraping libraries so could also call out to them with
But here’s a simple example with EzXML, pulling the text out all the h2 tags in the html
tree = parsehtml(html)
titles = strip.(nodecontent.(findall("//h2", tree)))
"The Julia Programming Language"
"GitHub - JuliaLang/julia: The Julia Programming Language"
"The Julia Language - Posts | Facebook"
"The Julia Language (@JuliaLanguage) | Твиттер"
Great thanks! I’ll play around with EzXML.
Look to Chrome’s developer tools… you can actually record the outgoing POST with your form submission and see the exact format of what is sent (along with the headers). It’s a bit brute force, but you can then lovingly handcraft the header and data portions of the POST request.
That at least gets you a minimum working example to expand from.
thanks for this example. I understand XPath far better than anything else, and it helped me pull out the href’s from a huge html table:
for n in findall("//tr/td/a/@href", doc)