I want scrap table from these websites which have content separated by | so i need some way to parse string from | . if occursin("MASTER", txt)
is used to pick only Master satellite website.
The code that i have prepared so far is :
using HTTP,DataFrames,CSV
function doanalysis()
dfg=nothing
for x in 34010:34038
print("\r peeking at $x ")
try
url = "https://gcn.nasa.gov/circulars/$x"
resp = HTTP.get(url)
status=resp.status
print(" ",status," ");
if status == 404 ; println("status=",status); continue; end
txt = String(resp.body)
if occursin("MASTER", txt)
println(" Master report")
hb, he = findfirst(r"^Tmid-T0 "im, txt)
lr, _ = findnext("\n\nThe", txt, he)
cltxt = replace(txt[hb:lr], " +/- " => "+/-", r" +(\w)" => s"\t\1", r" +(>)" => s"\t>")
cltxt = replace(cltxt,">" => "\t>")
# println("cltxt=");print(cltxt)
df = CSV.read(IOBuffer(cltxt), DataFrame, delim='\t')
df.x=[x for i in 1:nrow(df)]
if isnothing(dfg) # x == 33037
dfg=df
else
dfg=vcat(dfg,df)
end # if x is first
CSV.write("data-$(x).csv",df)
end # if occursin
catch e
println("error ")
end # trycatch
end # for loop
println()
if !isnothing(dfg)
CSV.write("data-all.csv",dfg)
else
@info "no dfg to write"
end # !isnothing
end # function doanalysis
doanalysis()
I suggest adding some additional context/detail to your MWE.
Because in this state, those who want to help need to run the code and spend a great deal of time checking if the result is correct (by comparing the result with lots of tables from all those URLs).
Ideas:
- try to indicate what goes wrong (vs. your expectations/goal)
- are there variations in the tables, and you need help with parsing?
- is the code producing a specific error that you need help fixing?
1 Like
you have to specify the words that delimit the table, otherwise they look more โcleanโ tables than the others.
using CSV, DataFrames, HTTP
url="https://gcn.nasa.gov/circulars/34021/raw"
txt=String((HTTP.get(url)))
#m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the whole string.
he=first(findfirst(r"^Tmid"im,txt))
lr=first(findnext("\nFilter",txt,he))-1
cltxt=txt[he:lr]
julia> df=CSV.read(IOBuffer(cltxt), DataFrame, delim='|', skipto=3)
9ร8 DataFrame
Row โ Tmid-T0 Date Time Site Coord (J2000) Filt. Expt. Limit Comment
โ Int64 String31 String31 String String7 Int64 Float64 String15
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 22662 2023-06-19 09:49:13 MASTER- (23h 42m 22.90s , +81d 39m 09.5s) C 180 18.7
2 โ 22862 2023-06-19 09:52:33 MASTER- (00h 35m 00.64s , +81d 38m 18.1s) C 180 16.6
3 โ 23062 2023-06-19 09:55:53 MASTER- (23h 45m 21.09s , +79d 46m 41.0s) C 180 18.5
4 โ 23263 2023-06-19 09:59:13 MASTER- (00h 28m 33.71s , +79d 44m 16.3s) C 180 16.1
5 โ 23463 2023-06-19 10:02:34 MASTER- (23h 41m 51.58s , +81d 41m 09.0s) C 180 18.4
6 โ 23664 2023-06-19 10:05:55 MASTER- (00h 34m 39.01s , +81d 38m 46.3s) C 180 16.4
7 โ 23865 2023-06-19 10:09:15 MASTER- (23h 44m 51.68s , +79d 45m 15.0s) C 180 18.5
8 โ 23938 2023-06-19 10:10:29 MASTER-OAFA (04h 21m 32.87s , -20d 16m 47.6s) C 180 12.5
9 โ 24065 2023-06-19 10:12:36 MASTER- (00h 28m 16.55s , +79d 43m 58.3s) C 180 16.3
julia> df=CSV.read(IOBuffer(cltxt), DataFrame, delim='|', skipto=3, dateformat="yyyy-mm-dd HH:MM:SS")
9ร8 DataFrame
Row โ Tmid-T0 Date Time Site Coord (J2000) Filt. Expt. Limit Comment
โ Int64 Dates.DateTime String31 String String7 Int64 Float64 String15
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 22662 2023-06-19T09:49:13 MASTER- (23h 42m 22.90s , +81d 39m 09.5s) C 180 18.7
2 โ 22862 2023-06-19T09:52:33 MASTER- (00h 35m 00.64s , +81d 38m 18.1s) C 180 16.6
3 โ 23062 2023-06-19T09:55:53 MASTER- (23h 45m 21.09s , +79d 46m 41.0s) C 180 18.5
4 โ 23263 2023-06-19T09:59:13 MASTER- (00h 28m 33.71s , +79d 44m 16.3s) C 180 16.1
5 โ 23463 2023-06-19T10:02:34 MASTER- (23h 41m 51.58s , +81d 41m 09.0s) C 180 18.4
6 โ 23664 2023-06-19T10:05:55 MASTER- (00h 34m 39.01s , +81d 38m 46.3s) C 180 16.4
7 โ 23865 2023-06-19T10:09:15 MASTER- (23h 44m 51.68s , +79d 45m 15.0s) C 180 18.5
8 โ 23938 2023-06-19T10:10:29 MASTER-OAFA (04h 21m 32.87s , -20d 16m 47.6s) C 180 12.5
9 โ 24065 2023-06-19T10:12:36 MASTER- (00h 28m 16.55s , +79d 43m 58.3s) C 180 16.3
1 Like
Many earlier GCN donโt have | as delimiter. How to make code effective for them also ? Can you send me some links about tokens in Julia \n, \w ,\s etc. ? I am still having difficulty in learning regular expressions even after reading Julia manual. My final code looks like
using HTTP,DataFrames,CSV
function doanalysis()
dfg=nothing
for x in 34020:34038
print("\r peeking at GCN $x ")
try
url = "https://gcn.nasa.gov/circulars/$x/raw"
resp = HTTP.get(url)
status=resp.status
print(" ",status," ");
if status == 404 ; println("status=",status); continue; end
txt = String(resp.body)
if occursin("V. Lipunov", txt)
println(" MASTER report")
he=first(findfirst(r"^Tmid"im,txt))
lr=first(findnext("\nFilter",txt,he))-1
cltxt=txt[he:lr]
df=CSV.read(IOBuffer(cltxt), DataFrame, delim='|', skipto=3)
df.x=[x for i in 1:nrow(df)]
if isnothing(dfg)
dfg=df
else
dfg=vcat(dfg,df)
end # if x is first
CSV.write("data-$(x).csv",df)
end # if occursin
catch e
println("error ")
end # trycatch
end # for loop
println()
if !isnothing(dfg)
CSV.write("data-all.csv",dfg)
else
@info "no dfg to write"
end # !isnothing
end # function doanalysis
doanalysis()
Can i use readuntil
function instead of above lr
line ?
although Iโve heard good things about it, Iโve never had a chance to use this feature.
From what I understand unlike findall (which gives the starting point of the part of the string you are interested in) it captures a string from the start of the stream up to the delimiter used (ie the part of the string implicitly dropped off by findfirst).
readuntil(IOBuffer(txt),"\nTmid-T0")
May be used after the first findfirst โฆ
readuntil(IOBuffer(txt[he:end]),"\nFilter")
# lr=first(findnext("\nFilter",txt,he))-1
# cltxt=txt[he:lr]
@rocco_sprmnt21 @rafael.guerra @algunion
Can you please tell the way to scrap tabular data from GROND telescope ? My code for GCN 21200 looks like
using HTTP , CSV, DataFrames
function doanalysis()
dfg=nothing
for x in 21200
print("\r peeking at GCN $x ")
try
url = "https://gcn.nasa.gov/circulars/$x/raw"
resp = HTTP.get(url)
status=resp.status
print(" ",status," ");
if status == 404 ; println("status=",status); continue; end
txt = String(resp.body)
if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
print(m.match)
end
if occursin("GROND observations", txt)
println("GROND report")
he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K) ="im,txt))
lr=first(findnext(r"^(?:[\t ]*(?:\r?\n|\r))+"im,txt,he))
cltxt=txt[he:lr]
df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t', skipto=3 ,header=0)
df.GCN=[x for i in 1:nrow(df)]
df.GRB=[m.match for i in 1:nrow(df)]
if isnothing(dfg)
dfg=df
else
dfg=vcat(dfg,df)
end
if isnothing(dfg)
dfg=df
else
@show dfg=vcat(dfg,df)
end # if x is first
end # if occursin
catch e
println("error ")
end # trycatch
end # for loop
end
doanalysis()
but i want to generalise it for other GROND telescope data extraction. It is printing two times .
If I understand correctly what you are looking for, I think these are the right tools.
Iโve never used them so far
Hey @raman_kumar, I am answering this because you mentioned me in your post.
I am eager to answer any of these types of questions:
- Julia-related questions (e.g., you donโt understand some behavior/concept, getting an error, etc.)
- Juliaโs package-related questions (missing documentation, errors, package suggestions, etc.)
Usually, if not a conceptual question, you must include certain MWE that fails or produces some unexpected output.
However, in your scenario, it seems like your above request doesnโt fit any of the above scenarios: to answer this successfully, somebody needs to go and do the work of understanding the HTML structure and then think about a way to parse that (and finally either work to adapt your code or provide specific instructions concerning the code you should write).
Maybe others might find this a legitimate question and might want to invest the time and answer you - however, my feeling is that it is always better to help somebody by contributing to refining their fishing tools instead of doing the fishing (or even part of the fishing) for them.
My advice for you is to go deeper into the structure of the HTML you want to parse and then start working on adapting your existing code - if your attempt fails and you get either errors or unexpected results, we might take it from there and answer targeted issues with your work.
From what I was able to conclude from 1-2 HTML files inspections, the content is entirely as raw text under a single div
- so HTML/CSS related tools will not help much past retrieving the big text chunk - so you might actually need to parse the text and extract the relevant data in the desired format.
6 Likes
I want to search text starting from gโ to end of K row in below image https://gcn.nasa.gov/circulars/21200?query=GROND
Please tell me about changes i need to do in code below.
he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K)"im,txt))
lr=first(findnext(r"^(?:[\t ]*(?:\r?\n|\r))+"im,txt,he))
@rocco_sprmnt21
you could try like this, but I donโt know if itโs generic (but specific) enough for all your cases.
You have to see it with a little patience
he=first(findfirst(r"\n\n^(\w')"im,txt))+2
lr=first(findnext(r"(\.)\n\n"m,txt,he))-1
using HTTP, CSV, DataFrames
function doanalysis()
dfg=nothing
for x in 21200
print("\r peeking at GCN $x ")
try
url = "https://gcn.nasa.gov/circulars/$x/raw"
resp = HTTP.get(url)
status=resp.status
print(" ",status," ");
if status == 404 ; println("status=",status); continue; end
txt = String(resp.body)
if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
print(m.match)
end
if occursin("GROND observations", txt)
println(" GROND report")
he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K)"m,txt))
lr=first(findnext(r"^(?:[\t ]*(?:\r?\n|\r))+"m,txt,he))
cltxt=replace(txt[he:lr], r" ?(=|>)"=>"|" , "+/-"=>"|")
df=CSV.read(IOBuffer(cltxt), DataFrame, delim="|" ,header=0)
df.GCN=[x for i in 1:nrow(df)]
df.GRB=[m.match for i in 1:nrow(df)]
if isnothing(dfg)
@show dfg=df
else
@show dfg=vcat(dfg,df)
end # if x is first
end # if occursin
catch e
println("error ")
end # trycatch
end # for loop
end
doanalysis()
give output shown below GCN 21200
Missing some lines was due to kwarg skip=3, the duplication depends on the if then else you put most likely.
Write a script with no checksums, test that it does what you want, then add the checks bit by bit and test them one by one.
julia> function doanalysis()
dfg=nothing
for x in 30574
print("\r peeking at GCN $x ")
try
url = "https://gcn.nasa.gov/circulars/$x/raw"
resp = HTTP.get(url)
status=resp.status
print(" ",status," ");
if status == 404 ; println("status=",status); continue; end
txt = String(resp.body)
if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
print(m.match)
end
if occursin("GROND observations", txt)
println("GROND report")
he=first(findfirst(r"\n\n^(\w')"im,txt))+2
lr=first(findnext(r"(\.)\n\n"m,txt,he))-1
cltxt=txt[he:lr]
df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t' ,header=0)
df.GCN=[x for i in 1:nrow(df)]
df.GRB=[m.match for i in 1:nrow(df)]
julia> doanalysis()
peeking at GCN 30574 200 GRB 210731AGROND report
dfg = vcat(dfg, df) = 14ร3 DataFrame
Row โ Column1 GCN GRB
โ String31 Int64 SubStrinโฆ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g' = 18.71 +/- 0.01 mag, 30574 GRB 210731A
2 โ r' = 18.44 +/- 0.01 mag, 30574 GRB 210731A
3 โ i' = 18.19 +/- 0.01 mag, 30574 GRB 210731A
4 โ z' = 18.01 +/- 0.01 mag, 30574 GRB 210731A
5 โ J = 17.66 +/- 0.02 mag, 30574 GRB 210731A
6 โ H = 17.38 +/- 0.02 mag, and 30574 GRB 210731A
7 โ K = 17.14 +/- 0.15 mag 30574 GRB 210731A
8 โ g' = 18.71 +/- 0.01 mag, 30574 GRB 210731A
9 โ r' = 18.44 +/- 0.01 mag, 30574 GRB 210731A
10 โ i' = 18.19 +/- 0.01 mag, 30574 GRB 210731A
11 โ z' = 18.01 +/- 0.01 mag, 30574 GRB 210731A
12 โ J = 17.66 +/- 0.02 mag, 30574 GRB 210731A
13 โ H = 17.38 +/- 0.02 mag, and 30574 GRB 210731A
14 โ K = 17.14 +/- 0.15 mag 30574 GRB 210731A
No, Please see my final code in last edited post. Now, my code is working fine. Your code is still giving double table- 14 rows instead of actual 7 row in text.
1 Like
using HTTP , CSV, DataFrames, JSON3
julia> function scrapjson(gcn)
grurl="https://gcn.nasa.gov/circulars/$gcn/json"
gresp = HTTP.get(grurl)
js1=JSON3.read(IOBuffer(String(gresp.body)))
txt=js1.body
he=first(findfirst(r"\n\n^( *\w')"im,txt))+2
lr=first(findnext(r"\n\n"m,txt,he))-1
cltxt=txt[he:lr]
df=CSV.read(IOBuffer(cltxt), DataFrame, header=0)
if startswith(js1.subject,"GRB")
df.GRB .= readuntil(IOBuffer(js1.subject), ',')
else
grb=findfirst(r"GRB *\d+\w",js1.subject)
df.GRB .= js1.subject[grb]
end
df.GCN .= gcn
"Column2" โ names(df) ? df[:,Not(:Column2)] : df
end
scrapjson (generic function with 1 method)
julia> df1=scrapjson(31522)
6ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 String Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g' > 23.2 GRB 220117A 31522
2 โ r' > 23.4 GRB 220117A 31522
3 โ i' > 23.0 GRB 220117A 31522
4 โ J > 21.4 GRB 220117A 31522
5 โ H > 21.1 GRB 220117A 31522
6 โ K > 19.3. GRB 220117A 31522
julia> df2=scrapjson(32383)
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String31 String Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g' > 23.0 GRB 220711B 32383
2 โ r' > 23.5 GRB 220711B 32383
3 โ i' > 23.2 GRB 220711B 32383
4 โ z' > 19.8 GRB 220711B 32383
5 โ J > 21.7 GRB 220711B 32383
6 โ H > 21.1 GRB 220711B 32383
7 โ K > 20.1 (AB mag; 3 sigma). GRB 220711B 32383
julia> df3=scrapjson(21200)
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String31 String Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g' = 20.75 +/- 0.08 GRB170604A 21200
2 โ r' = 20.46 +/- 0.05 GRB170604A 21200
3 โ i' = 20.35 +/- 0.06 GRB170604A 21200
4 โ z' = 20.21 +/- 0.08 GRB170604A 21200
5 โ J = 19.6 +/- 0.1 GRB170604A 21200
6 โ H = 19.6 +/- 0.3 GRB170604A 21200
7 โ K > 19.6 GRB170604A 21200
1 Like
I tried to automate the table capture operation a little more, using these packages (maybe someone who has experience of how to do these things, can intervene to give some indications on how to do it โbetterโ).
Since the source of the data is very โmessyโ, there is still a little tinkering to manage the โrecoverableโ situations
urlb="https://gcn.nasa.gov/circulars?query=GROND" # page=1&limit=100
#urlq="https://gcn.nasa.gov/circulars?page=2&limit=50&query=GROND"
using HTTP , CSV, DataFrames, JSON3
using Cascadia, Gumbo
gresp = HTTP.get(urlb)
h = parsehtml(String(gresp.body))
s1=sel"ol li" # buona questa!
qs = eachmatch(s1,h.root)
res=Tuple{String, String}[]
for q in qs
txt=q.children[1].children[1].text
if contains(txt,"GROND") && contains(txt,"GRB")
push!(res, (q.attributes["value"],txt))
end
end
#----------
#------------
function scrapjson(gcn)
grurl="https://gcn.nasa.gov/circulars/$gcn/json"
gresp = HTTP.get(grurl)
js1=JSON3.read(IOBuffer(String(gresp.body)))
txt=js1.body
he=first(findfirst(r"\n\n^( *g'*)"m,txt))+2
lr=first(findnext(r"\n\n"m,txt,he))-1
cltxt=replace(txt[he:lr],' '=>"")
df=CSV.read(IOBuffer(cltxt), DataFrame, header=0)
ptn=r"(GRB *\d+\w)[:|,]*"
df.GRB .= match(ptn,js1.subject).match
df.GCN .= gcn
"Column2" โ names(df) ? df[:,Not(:Column2)] : df
end
df1=scrapjson(31522)
df2=scrapjson(30703)
df3=scrapjson(26066)
df4=scrapjson(23814)
for (gcn, _) in res[1:25]
try println(scrapjson(gcn)) catch e; println("\n"*gcn*"--->NOK\n") end
end
Some results
julia> for (gcn, _) in res[1:25]
try println(scrapjson(gcn)) catch e; println("\n"*gcn*"--->NOK\n") end
end
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String31 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>23.0 GRB 220711B 32383
2 โ r'>23.5 GRB 220711B 32383
3 โ i'>23.2 GRB 220711B 32383
4 โ z'>19.8 GRB 220711B 32383
5 โ J>21.7 GRB 220711B 32383
6 โ H>21.1 GRB 220711B 32383
7 โ K>20.1(ABmag;3sigma). GRB 220711B 32383
6ร3 DataFrame
Row โ Column1 GRB GCN
โ String7 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>25.2 GRB 220706A 32339
2 โ r'>24.9 GRB 220706A 32339
3 โ i'>24.2 GRB 220706A 32339
4 โ J>21.7 GRB 220706A 32339
5 โ H>21.0 GRB 220706A 32339
6 โ K>19.9. GRB 220706A 32339
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'=23.31+/-0.11 GRB 220627A 32304
2 โ r'=22.70+/-0.06 GRB 220627A 32304
3 โ i'=22.50+/-0.12 GRB 220627A 32304
4 โ z'=22.23+/-0.21 GRB 220627A 32304
5 โ J>21.4 GRB 220627A 32304
6 โ H>20.6 GRB 220627A 32304
7 โ K>19.7. GRB 220627A 32304
6ร3 DataFrame
Row โ Column1 GRB GCN
โ String7 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>23.2 GRB 220117A 31522
2 โ r'>23.4 GRB 220117A 31522
3 โ i'>23.0 GRB 220117A 31522
4 โ J>21.4 GRB 220117A 31522
5 โ H>21.1 GRB 220117A 31522
6 โ K>19.3. GRB 220117A 31522
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String7 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>24.8 GRB 211106A 31069
2 โ r'>25.0 GRB 211106A 31069
3 โ i'>24.0 GRB 211106A 31069
4 โ z'>22.0 GRB 211106A 31069
5 โ J>21.7 GRB 211106A 31069
6 โ H>21.1 GRB 211106A 31069
7 โ K>19.8. GRB 211106A 31069
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>24.3 GRB 210905A 30781
2 โ r'>24.3 GRB 210905A 30781
3 โ i'>23.8 GRB 210905A 30781
4 โ z'=21.6+/-0.2 GRB 210905A 30781
5 โ J=20.2+/-0.2 GRB 210905A 30781
6 โ H=20.1+/-0.2 GRB 210905A 30781
7 โ K>18.2. GRB 210905A 30781
6ร3 DataFrame
Row โ Column1 GRB GCN
โ String7 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>23.5 GRB 210901A 30755
2 โ r'>23.6 GRB 210901A 30755
3 โ i'>22.6 GRB 210901A 30755
4 โ J>20.1 GRB 210901A 30755
5 โ H>19.7 GRB 210901A 30755
6 โ K>16.3. GRB 210901A 30755
3ร3 DataFrame
Row โ Column1 GRB GCN
โ String31 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'=20.37+/-0.09 GRB 210822A 30703
2 โ r'=20.10+/-0.05 GRB 210822A 30703
3 โ i'=19.92+/-0.05. GRB 210822A 30703
1ร3 DataFrame
Row โ Column1 GRB GCN
โ String7 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>22.6 GRB 210820A 30695
30584--->NOK
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String31 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'=18.71+/-0.01mag GRB 210731A 30574
2 โ r'=18.44+/-0.01mag GRB 210731A 30574
3 โ i'=18.19+/-0.01mag GRB 210731A 30574
4 โ z'=18.01+/-0.01mag GRB 210731A 30574
5 โ J=17.66+/-0.02mag GRB 210731A 30574
6 โ H=17.38+/-0.02mag GRB 210731A 30574
7 โ K=17.14+/-0.15mag. GRB 210731A 30574
4ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g๏ฟฝ>23.3 GRB 191004A 26324
2 โ r๏ฟฝ>23.7 GRB 191004A 26324
3 โ i๏ฟฝ>23.1 GRB 191004A 26324
4 โ z๏ฟฝ>22.7 GRB 191004A 26324
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'=16.81+/-0.03 GRB191016 26176
2 โ r'=16.33+/-0.03 GRB191016 26176
3 โ i'=15.84+/-0.04 GRB191016 26176
4 โ z'=15.51+/-0.04 GRB191016 26176
5 โ J=15.28+/-0.05 GRB191016 26176
6 โ H=14.80+/-0.05 GRB191016 26176
7 โ K=14.83+/-0.08 GRB191016 26176
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>25.5mag GRB 191024A 26066
2 โ r'>25.6mag GRB 191024A 26066
3 โ i'>24.8mag GRB 191024A 26066
4 โ z'>23.4mag GRB 191024A 26066
5 โ J>21.9mag GRB 191024A 26066
6 โ H>21.4mag GRB 191024A 26066
7 โ K>20.2mag GRB 191024A 26066
26042--->NOK
25992--->NOK
25960--->NOK
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'>23.7mag GRB 191004A 25959
2 โ r'>24.2mag GRB 191004A 25959
3 โ i'>23.5mag GRB 191004A 25959
4 โ z'>23.3mag GRB 191004A 25959
5 โ J>21.4mag GRB 191004A 25959
6 โ H>20.4mag GRB 191004A 25959
7 โ K>19.9mag GRB 191004A 25959
25791--->NOK
25789--->NOK
25652--->NOK
25651--->NOK
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String31 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g๏ฟฝ๏ฟฝ๏ฟฝ=20.30+/-0.03 GRB 190829A 25569
2 โ r๏ฟฝ๏ฟฝ๏ฟฝ=19.34+/-0.03 GRB 190829A 25569
3 โ i๏ฟฝ๏ฟฝ๏ฟฝ=18.77+/-0.03 GRB 190829A 25569
4 โ z๏ฟฝ๏ฟฝ๏ฟฝ=18.21+/-0.03 GRB 190829A 25569
5 โ J=17.34+/-0.06 GRB 190829A 25569
6 โ H=16.68+/-0.06 GRB 190829A 25569
7 โ Ks=16.40+/-0.08. GRB 190829A 25569
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String31 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g'=22.79+/-0.05mag GRB 190613B 24831
2 โ r'=22.05+/-0.04mag GRB 190613B 24831
3 โ i'=21.56+/-0.05mag GRB 190613B 24831
4 โ z'=21.15+/-0.07mag GRB 190613B 24831
5 โ J=20.5+/-0.1mag GRB 190613B 24831
6 โ H=20.2+/-0.2mag GRB 190613B 24831
7 โ K>19.8mag GRB 190613B 24831
7ร3 DataFrame
Row โ Column1 GRB GCN
โ String15 SubStrinโฆ String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ g<23.0mag GRB 190129B 23814
2 โ r=22.7+/-0.3mag GRB 190129B 23814
3 โ i=21.8+/-0.3mag GRB 190129B 23814
4 โ z=21.7+/-0.4mag GRB 190129B 23814
5 โ J=20.4+/-0.4mag GRB 190129B 23814
6 โ H=19.1+/-0.2mag GRB 190129B 23814
7 โ K=19.0+/-0.4mag GRB 190129B 23814
PS
You can, of course, adapt the scripts to dig into subsequent pages as well
1 Like
I donโt understand anything about the contents of the tables, but I think that in this form the data is easier to read
julia> urlb="https://gcn.nasa.gov/circulars?query=GROND" # page=1&limit=100
"https://gcn.nasa.gov/circulars?query=GROND"
julia> using HTTP , CSV, DataFrames, JSON3
julia> using Cascadia, Gumbo
julia> gresp = HTTP.get(urlb);
julia> h = parsehtml(String(gresp.body));
julia> s1=sel"ol li" # buona questa!
Selector(Cascadia.var"#51#52"{Selector, Selector}(Selector(Cascadia.var"#5#6"{String}("ol")), Selector(Cascadia.var"#5#6"{String}("li"))))
julia> qs = eachmatch(s1,h.root);
julia> res=Tuple{String, String}[]
Tuple{String, String}[]
julia> for q in qs
txt=q.children[1].children[1].text
if contains(txt,"GROND") && contains(txt,"GRB")
push!(res, (q.attributes["value"],txt))
end
end
julia> function scrapjson(gcn)
grurl="https://gcn.nasa.gov/circulars/$gcn/json"
gresp = HTTP.get(grurl)
js1=JSON3.read(IOBuffer(String(gresp.body)))
txt=js1.body
he=first(findfirst(r"\n\n^( *g'*)"m,txt))+2
lr=first(findnext(r"\n\n"m,txt,he))-1
cltxt=replace(txt[he:lr],' '=>"")
df=CSV.read(IOBuffer(cltxt), DataFrame, header=0)
ptn=r"(GRB *\d+\w)[:|,]*"
df.GRB .= match(ptn,js1.subject)[1]
df.GCN .= gcn
"Column2" โ names(df) ? df[:,Not(:Column2)] : df
end
scrapjson (generic function with 1 method)
julia> df=scrapjson(res[1][1]);
julia> dfnok=DataFrame(GCN=String[])
0ร1 DataFrame
Row โ GCN
โ String
โโโโโโดโโโโโโโโ
julia> for (gcn, _) in res[2:25]
try df=vcat(df,scrapjson(gcn), cols=:union) catch e; push!(dfnok,(GCN=gcn,)) end
end
julia> df.mag=replace.(df.Column1, "'"=>"",'๏ฟฝ'=>"");
julia> df=select(df,[:GCN, :GRB],:mag=>ByRow(x->[x[1],x[2:end]])=>[:tel,:mag1]);
julia> udf=unstack(df,[:GCN, :GRB],:tel,:mag1)
17ร9 DataFrame
...
julia> vcat(udf,dfnok,cols=:union)
25ร9 DataFrame
Row โ GCN GRB g r i z J H โฏ
โ String SubStrinโฆ? String? String? String? String? String? String? โฏ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 32383 GRB 220711B >23.0 >23.5 >23.2 >19.8 >21.7 >21.1 โฏ
2 โ 32339 GRB 220706A >25.2 >24.9 >24.2 missing >21.7 >21.0
3 โ 32304 GRB 220627A =23.31+/-0.11 =22.70+/-0.06 =22.50+/-0.12 =22.23+/-0.21 >21.4 >20.6
4 โ 31522 GRB 220117A >23.2 >23.4 >23.0 missing >21.4 >21.1
5 โ 31069 GRB 211106A >24.8 >25.0 >24.0 >22.0 >21.7 >21.1 โฏ
6 โ 30781 GRB 210905A >24.3 >24.3 >23.8 =21.6+/-0.2 =20.2+/-0.2 =20.1+/-0.2
7 โ 30755 GRB 210901A >23.5 >23.6 >22.6 missing >20.1 >19.7
8 โ 30703 GRB 210822A =20.37+/-0.09 =20.10+/-0.05 =19.92+/-0.05. missing missing missing
9 โ 30695 GRB 210820A >22.6 missing missing missing missing missing โฏ
10 โ 30574 GRB 210731A =18.71+/-0.01mag =18.44+/-0.01mag =18.19+/-0.01mag =18.01+/-0.01mag =17.66+/-0.02mag =17.38+/-0.
11 โ 26324 GRB 191004A >23.3 >23.7 >23.1 >22.7 missing missing
โฎ โ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฑ
16 โ 24831 GRB 190613B =22.79+/-0.05mag =22.05+/-0.04mag =21.56+/-0.05mag =21.15+/-0.07mag =20.5+/-0.1mag =20.2+/-0.2
17 โ 23814 GRB 190129B <23.0mag =22.7+/-0.3mag =21.8+/-0.3mag =21.7+/-0.4mag =20.4+/-0.4mag =19.1+/-0.2 โฏ
18 โ 30584 missing missing missing missing missing missing missing
19 โ 26042 missing missing missing missing missing missing missing
20 โ 25992 missing missing missing missing missing missing missing
21 โ 25960 missing missing missing missing missing missing missing โฏ
22 โ 25791 missing missing missing missing missing missing missing
23 โ 25789 missing missing missing missing missing missing missing
24 โ 25652 missing missing missing missing missing missing missing
25 โ 25651 missing missing missing missing missing missing missing โฏ
2 columns and 4 rows omitted
1 Like
Why you are using JSON3 ? Is there any special advantage over my 12th post code in this discussion ?
julia> js1=JSON3.read(IOBuffer(String(gresp.body)))
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 6 entries:
:subject => "GROND observations of GRB 180620B"
:createdOn => 1529579936000
:submitter => "Patricia Schady at MPE/Swift <pschady@mpe.mpg.de>"
:circularId => 22819
:email => "pschady@mpe.mpg.de"
:body => "Tassilo Schweyer and Patricia Schady (MPE Garching) report:\n\nWe observed the field of GRB 180620B (Swift trigg
it is not essential, but it seemed to me more convenient to access the โ:subjectโ and โ:bodyโ (e potresti aggiungere facilmente le info sul :submitter altre) fields to obtain the information to put in the tables