Web scraping of GCN NASA circulars TEXT

I want to scrap GCN - Circulars for optical events.

For example i want to get following information this web page :blush:

image

and similarly for other web pages containing optical events and store in ccv or excel format.

Fortunately many NASA pages provide JSON versions as well, so you can just use a JSON package such as GitHub - quinnj/JSON3.jl.

1 Like

How can i extract information and compile in csv table for many web pages automatically ? :face_with_monocle:

By “automatically” do you mean “without writing any code”? You would probably still have to write some code to guide the process.

My code for web page looks like :

using HTTP , Gumbo , Cascadia, AbstractTrees
url="https://gcn.nasa.gov/circulars/34030"
r=HTTP.get(url)
h=parsehtml(String(r.body))
body=h.root[2]
eachmatch(Selector("p"), body)
Div=eachmatch(Selector(".usa-accordion__button.usa-banner__button"), body)
Div[1]

see below Pluto :blush:

I want to scrap that text information for every web page show below : :face_with_monocle:

image

At this point, it looks like you have already managed to extract the relevant content/text from HTML.

Gumbo/Cascadia will not help to get the text into formatted data (since you have raw text, not some HTML table or other elements).

Gumbo.jl conveniently provides the text function that extracts the text from any HTML element. In your scenario, text(Div[1]).

However, this will output a string that is still not yet formatted per your needs - and Gumbo.jl has no helper functions for transforming a raw string into structured data.

A very simple parser for the format above can look like this:

using DataFrames

txt = """
JD (mid) | Telescope |  Filter | Exposure (s) | Magnitude (AB) |
----------------------------------------------------------------------
2460115.3875 | OHP-T120 | R | 3900 | 20.70 +/- 0.12 | 
2460115.413706 | OHP-T193/MISTRAL | r' | 4560 | 20.84 +/- 0.04 | 
2460115.440972 | OHP-T120 | V | 4200 | 20.85 +/- 0.07 |"""

lines = split(txt, "\n")
parseline(line) = strip.(split(line, "|"))[1:end-1]
header = parseline(lines[1])
rows = parseline.(lines[3:end])
d = Dict(k => [getindex(row, i) for row in rows] for (i, k) in enumerate(header))
DataFrame(d)

And will produce something like this:

Now, if the pages contain the same text somewhere in the content, you can create some matching pattern to get the start and the end of the desired text and use something similar to the code above to extract it as a data frame (and finally as CSV).

However, please note that this is beyond Gumbo.jl capabilities.

1 Like

What are the other possible ways to extract data ? I mean are there some other library which can help in getting out desired data? :thinking:

I think with the matching pattern thing you’re pretty much thone.

Btw, for the web scraping section, you can also use Harbest

using Harbest

html = read_html("https://gcn.nasa.gov/circulars/34030")

data = html_elements(html,["main","div"])[12]
html_text3(data)

1 Like

For simple examples like the one you provided, I think the best way is to stick with pure Julia (see the example I provided - you can improve on that, I just put something together for you to get you started). For example, you can see that I didn’t convert things from strings to numerical values - the goal was to give you a minimal example regarding data extraction.

I don’t have working experience with specific parser libraries in Julia, but I stumbled upon andrewcooke/ParserCombinator.jl a few times.

I think it is easier and faster just to put together your own parser for scenarios like the above. However, if somebody else is aware of a better way to do this, I am curious myself if there are some good parsing libraries in Julia (besides the language-related parsers).

1 Like

Ok, so i will learn ParserCombinator.jl . Thank You

Don’t get me wrong - I mentioned that package because of the familiarity alone - I am not suggesting that is the right solution for your specific problem: as I said, I consider writing some Julia specific for your use-case is a better approach (a parser combinator might be overkill).

However, learning will not hurt - I am just ensuring I am not pointing you in the wrong direction.

Also - please check the code I shared - that is already working for your specific use case (although - it might not be complete - you might want to add some conversations and make sure you isolate the snippet from the larger text content).

Have fun.

I have been able to do following things using ParserCombinator.jl :blush:

It shows Parser Exception for line 1 of text. Please see last line of picture above . :thinking:

I suggest you go deeper into the documentation of the package.

For example, parse_one returns a single result or throws a ParserException. This is exactly what happened when the parser failed to detect the intended pattern (“Date”) - and the behavior is the correct/intended one.

Maybe try to review the tutorials and examples provided by the package documentation.

1 Like

Why parse_one terminate after only first index and don’t move forward in the for loop ? I mean it should give output for line 5 when marching in for loop.

  1. It throws an exception: if you want it to continue iterating, you must manage the exception.
  2. See 1.
1 Like

I am able to handle exception. :smiley:

image

1 Like