Web scraping of GCN NASA circulars TEXT

algunion · June 26, 2023, 6:26am

At this point, it looks like you have already managed to extract the relevant content/text from HTML.

Gumbo/Cascadia will not help to get the text into formatted data (since you have raw text, not some HTML table or other elements).

Gumbo.jl conveniently provides the text function that extracts the text from any HTML element. In your scenario, text(Div[1]).

However, this will output a string that is still not yet formatted per your needs - and Gumbo.jl has no helper functions for transforming a raw string into structured data.

A very simple parser for the format above can look like this:

using DataFrames

txt = """
JD (mid) | Telescope |  Filter | Exposure (s) | Magnitude (AB) |
----------------------------------------------------------------------
2460115.3875 | OHP-T120 | R | 3900 | 20.70 +/- 0.12 | 
2460115.413706 | OHP-T193/MISTRAL | r' | 4560 | 20.84 +/- 0.04 | 
2460115.440972 | OHP-T120 | V | 4200 | 20.85 +/- 0.07 |"""

lines = split(txt, "\n")
parseline(line) = strip.(split(line, "|"))[1:end-1]
header = parseline(lines[1])
rows = parseline.(lines[3:end])
d = Dict(k => [getindex(row, i) for row in rows] for (i, k) in enumerate(header))
DataFrame(d)

And will produce something like this:

Now, if the pages contain the same text somewhere in the content, you can create some matching pattern to get the start and the end of the desired text and use something similar to the code above to extract it as a data frame (and finally as CSV).

However, please note that this is beyond Gumbo.jl capabilities.

Topic		Replies	Views
Scrap table from NASA GCN circulars website Web Stack question , strings , csv , http	30	1249	August 3, 2023
Get index corresponding to some number in list of outputs General Usage indexing	24	895	July 10, 2023
Combining data of different GCNs in a single file Data strings , data , loops , dataframes	21	727	August 1, 2023
What library do you suggest to parse HTML page and additionally navigate through the page New to Julia	2	623	December 31, 2019
Extracting information from https://caps.fool.com/Ticker/MSFT.aspx New to Julia	5	560	February 25, 2021

Web scraping of GCN NASA circulars TEXT

Related topics