I want to scrap GCN - Circulars for optical events.
For example i want to get following information this web page
and similarly for other web pages containing optical events and store in ccv or excel format.
I want to scrap GCN - Circulars for optical events.
For example i want to get following information this web page
and similarly for other web pages containing optical events and store in ccv or excel format.
Fortunately many NASA pages provide JSON versions as well, so you can just use a JSON package such as GitHub - quinnj/JSON3.jl.
How can i extract information and compile in csv table for many web pages automatically ?
By âautomaticallyâ do you mean âwithout writing any codeâ? You would probably still have to write some code to guide the process.
My code for web page looks like :
using HTTP , Gumbo , Cascadia, AbstractTrees
url="https://gcn.nasa.gov/circulars/34030"
r=HTTP.get(url)
h=parsehtml(String(r.body))
body=h.root[2]
eachmatch(Selector("p"), body)
Div=eachmatch(Selector(".usa-accordion__button.usa-banner__button"), body)
Div[1]
see below Pluto
I want to scrap that text information for every web page show below :
At this point, it looks like you have already managed to extract the relevant content/text from HTML.
Gumbo/Cascadia will not help to get the text into formatted data (since you have raw text, not some HTML table or other elements).
Gumbo.jl conveniently provides the text
function that extracts the text from any HTML element. In your scenario, text(Div[1])
.
However, this will output a string that is still not yet formatted per your needs - and Gumbo.jl has no helper functions for transforming a raw string into structured data.
A very simple parser for the format above can look like this:
using DataFrames
txt = """
JD (mid) | Telescope | Filter | Exposure (s) | Magnitude (AB) |
----------------------------------------------------------------------
2460115.3875 | OHP-T120 | R | 3900 | 20.70 +/- 0.12 |
2460115.413706 | OHP-T193/MISTRAL | r' | 4560 | 20.84 +/- 0.04 |
2460115.440972 | OHP-T120 | V | 4200 | 20.85 +/- 0.07 |"""
lines = split(txt, "\n")
parseline(line) = strip.(split(line, "|"))[1:end-1]
header = parseline(lines[1])
rows = parseline.(lines[3:end])
d = Dict(k => [getindex(row, i) for row in rows] for (i, k) in enumerate(header))
DataFrame(d)
And will produce something like this:
Now, if the pages contain the same text somewhere in the content, you can create some matching pattern to get the start and the end of the desired text and use something similar to the code above to extract it as a data frame (and finally as CSV).
However, please note that this is beyond Gumbo.jl capabilities.
What are the other possible ways to extract data ? I mean are there some other library which can help in getting out desired data?
I think with the matching pattern thing youâre pretty much thone.
Btw, for the web scraping section, you can also use Harbest
using Harbest
html = read_html("https://gcn.nasa.gov/circulars/34030")
data = html_elements(html,["main","div"])[12]
html_text3(data)
For simple examples like the one you provided, I think the best way is to stick with pure Julia (see the example I provided - you can improve on that, I just put something together for you to get you started). For example, you can see that I didnât convert things from strings to numerical values - the goal was to give you a minimal example regarding data extraction.
I donât have working experience with specific parser libraries in Julia, but I stumbled upon andrewcooke/ParserCombinator.jl a few times.
I think it is easier and faster just to put together your own parser for scenarios like the above. However, if somebody else is aware of a better way to do this, I am curious myself if there are some good parsing libraries in Julia (besides the language-related parsers).
Ok, so i will learn ParserCombinator.jl . Thank You
Donât get me wrong - I mentioned that package because of the familiarity alone - I am not suggesting that is the right solution for your specific problem: as I said, I consider writing some Julia specific for your use-case is a better approach (a parser combinator might be overkill).
However, learning will not hurt - I am just ensuring I am not pointing you in the wrong direction.
Also - please check the code I shared - that is already working for your specific use case (although - it might not be complete - you might want to add some conversations and make sure you isolate the snippet from the larger text content).
Have fun.
I have been able to do following things using ParserCombinator.jl
It shows Parser Exception for line 1 of text. Please see last line of picture above .
I suggest you go deeper into the documentation of the package.
For example, parse_one
returns a single result or throws a ParserException
. This is exactly what happened when the parser failed to detect the intended pattern (âDateâ) - and the behavior is the correct/intended one.
Maybe try to review the tutorials and examples provided by the package documentation.
Why parse_one terminate after only first index and donât move forward in the for loop ? I mean it should give output for line 5 when marching in for loop.
I am able to handle exception.