Web scraping of GCN NASA circulars TEXT

raman_kumar · June 25, 2023, 12:29pm

I want to scrap GCN - Circulars for optical events.

For example i want to get following information this web page

and similarly for other web pages containing optical events and store in ccv or excel format.

cormullion · June 25, 2023, 12:39pm

Fortunately many NASA pages provide JSON versions as well, so you can just use a JSON package such as GitHub - quinnj/JSON3.jl.

raman_kumar · June 25, 2023, 1:05pm

How can i extract information and compile in csv table for many web pages automatically ?

cormullion · June 25, 2023, 1:39pm

By “automatically” do you mean “without writing any code”? You would probably still have to write some code to guide the process.

raman_kumar · June 25, 2023, 2:07pm

My code for web page looks like :

using HTTP , Gumbo , Cascadia, AbstractTrees
url="https://gcn.nasa.gov/circulars/34030"
r=HTTP.get(url)
h=parsehtml(String(r.body))
body=h.root[2]
eachmatch(Selector("p"), body)
Div=eachmatch(Selector(".usa-accordion__button.usa-banner__button"), body)
Div[1]

see below Pluto

raman_kumar · June 26, 2023, 5:18am

I want to scrap that text information for every web page show below :

algunion · June 26, 2023, 6:26am

At this point, it looks like you have already managed to extract the relevant content/text from HTML.

Gumbo/Cascadia will not help to get the text into formatted data (since you have raw text, not some HTML table or other elements).

Gumbo.jl conveniently provides the text function that extracts the text from any HTML element. In your scenario, text(Div[1]).

However, this will output a string that is still not yet formatted per your needs - and Gumbo.jl has no helper functions for transforming a raw string into structured data.

A very simple parser for the format above can look like this:

using DataFrames

txt = """
JD (mid) | Telescope |  Filter | Exposure (s) | Magnitude (AB) |
----------------------------------------------------------------------
2460115.3875 | OHP-T120 | R | 3900 | 20.70 +/- 0.12 | 
2460115.413706 | OHP-T193/MISTRAL | r' | 4560 | 20.84 +/- 0.04 | 
2460115.440972 | OHP-T120 | V | 4200 | 20.85 +/- 0.07 |"""

lines = split(txt, "\n")
parseline(line) = strip.(split(line, "|"))[1:end-1]
header = parseline(lines[1])
rows = parseline.(lines[3:end])
d = Dict(k => [getindex(row, i) for row in rows] for (i, k) in enumerate(header))
DataFrame(d)

And will produce something like this:

Now, if the pages contain the same text somewhere in the content, you can create some matching pattern to get the start and the end of the desired text and use something similar to the code above to extract it as a data frame (and finally as CSV).

However, please note that this is beyond Gumbo.jl capabilities.

raman_kumar · June 26, 2023, 6:51am

What are the other possible ways to extract data ? I mean are there some other library which can help in getting out desired data?

Jose_Diaz · June 26, 2023, 10:46am

I think with the matching pattern thing you’re pretty much thone.

Btw, for the web scraping section, you can also use Harbest

using Harbest

html = read_html("https://gcn.nasa.gov/circulars/34030")

data = html_elements(html,["main","div"])[12]
html_text3(data)

algunion · June 26, 2023, 11:03am

For simple examples like the one you provided, I think the best way is to stick with pure Julia (see the example I provided - you can improve on that, I just put something together for you to get you started). For example, you can see that I didn’t convert things from strings to numerical values - the goal was to give you a minimal example regarding data extraction.

I don’t have working experience with specific parser libraries in Julia, but I stumbled upon andrewcooke/ParserCombinator.jl a few times.

I think it is easier and faster just to put together your own parser for scenarios like the above. However, if somebody else is aware of a better way to do this, I am curious myself if there are some good parsing libraries in Julia (besides the language-related parsers).

raman_kumar · June 26, 2023, 6:21pm

Ok, so i will learn ParserCombinator.jl . Thank You

algunion · June 26, 2023, 6:23pm

Don’t get me wrong - I mentioned that package because of the familiarity alone - I am not suggesting that is the right solution for your specific problem: as I said, I consider writing some Julia specific for your use-case is a better approach (a parser combinator might be overkill).

However, learning will not hurt - I am just ensuring I am not pointing you in the wrong direction.

Also - please check the code I shared - that is already working for your specific use case (although - it might not be complete - you might want to add some conversations and make sure you isolate the snippet from the larger text content).

Have fun.

raman_kumar · June 27, 2023, 10:05am

I have been able to do following things using ParserCombinator.jl

It shows Parser Exception for line 1 of text. Please see last line of picture above .

algunion · June 28, 2023, 5:22pm

I suggest you go deeper into the documentation of the package.

For example, parse_one returns a single result or throws a ParserException. This is exactly what happened when the parser failed to detect the intended pattern (“Date”) - and the behavior is the correct/intended one.

Maybe try to review the tutorials and examples provided by the package documentation.

raman_kumar · June 28, 2023, 5:38pm

Why parse_one terminate after only first index and don’t move forward in the for loop ? I mean it should give output for line 5 when marching in for loop.

algunion · June 28, 2023, 5:40pm

It throws an exception: if you want it to continue iterating, you must manage the exception.
See 1.

raman_kumar · June 30, 2023, 10:02am

I am able to handle exception.

Topic		Replies	Views
Scrap table from NASA GCN circulars website Web Stack question , strings , csv , http	30	1167	August 3, 2023
What library do you suggest to parse HTML page and additionally navigate through the page New to Julia	2	590	December 31, 2019
Get index corresponding to some number in list of outputs General Usage indexing	24	857	July 10, 2023
Getting data directly from a website Performance	9	6422	July 9, 2018
Extracting information from https://caps.fool.com/Ticker/MSFT.aspx New to Julia	5	534	February 25, 2021

Web scraping of GCN NASA circulars TEXT

Related topics