Scraping Site

I have the following simple python script that works:

import requests

url = "https://ABCDEFG.com/user/login"

url2= "https://ABCDEFG.com/en/admin/events/page-download"

payload = {'name':'FOO',
           'user_pin':'1234',
           'destination':'https://ABC.com/en',
           'commit':'Login'}

with requests.Session() as session:
    post = session.post(url,data=payload)
    r = session.get(url2)
    print(r.text) #or do something else..
    

Is it possible to do something similar in Julia using HTTP.jl or another package?

EDIT:
Using HTTP.jl I’ve tried the following without success:

using HTTP, JSON

url = "https://ABCDEFG.com/user/login"

url2= "https://ABCDEFG.com/en/admin/events/page-download"

payload = json(Dict("name"=> "FOO",
           "user_pin" => "1234",
           "destination" => "https://ABC.com/en",
           "commit" => "Login"))

r = HTTP.post(url,body=payload, cookies=true)

julia>r.status
200
# so far things look promising

julia>HTTP.get(url2)
ERROR: HTTP.ExceptionRequest.StatusError(403, HTTP.Messages.Response:
HTTP/1.1 403 Forbidden...
1 Like

@avik Is it possible to open a session in HTTP.jl similar to request.py?

what do you mean exactly?

I’m not sure what is actually unclear in the initial question.

I’m unable to provide a perfect MWE in order to protect the credentials and site.

What do I need to clarify?

Edit: I may be using the wrong terms given that I’m not very familiar with networking and HTTP. ‘Open a session’ is likely the wrong use of terms. I’m referring to the python code where a session is created…

I think the problem is that, you already got the result stored in varibale r, in the line

HTTP.get(url2)

You’re not actually pasing the payload, rght?

Providing the payload in the get request (these are just the credentials) still results in the same error.

Sorry for the slow response here; looking things over a bit, it seems like doing cookies=true should work for you (in terms of replicating the requests.Session functionality). There might be some kind of bug in the code here: https://github.com/JuliaWeb/HTTP.jl/blob/master/src/CookieRequest.jl. If you could provide more details, we could perhaps figure out exactly what’s going wrong here. My personal approach would be to hopefully find a way to do the equivalent of HTTP.jl verbose=2 with the requests to see exactly the request/response headers that are getting sent/coming back. If you could see which cookies/headers the requests library is storing/sending, then it should be straightforward to see which ones HTTP.jl isn’t sending. We could then figure out why not, or at least as a work-around you could pass the right header/cookies yourself.

I wish the video also had the code visible to accompany the description

Some code for avik video

From https://github.com/Algocircle/Cascadia.jl

using Cascadia, Gumbo, HTTP

r = HTTP.get("http://stackoverflow.com/questions/tagged/julia-lang")
h = parsehtml(String(r.body))

qs = eachmatch(Selector(".question-summary"),h.root)

println("StackOverflow Julia Questions (votes  answered?  url)")

for q in qs
    votes = nodeText(eachmatch(Selector(".votes .vote-count-post "), q)[1])
    answered = length(eachmatch(Selector(".status.answered"), q)) > 0
    href = eachmatch(Selector(".question-hyperlink"), q)[1].attributes["href"]
    println("$votes  $answered  http://stackoverflow.com$href")
end
2 Likes