Scrap table from NASA GCN circulars website

@rocco_sprmnt21 @rafael.guerra @algunion
Can you please tell the way to scrap tabular data from GROND telescope ? My code for GCN 21200 looks like :point_down:

using HTTP , CSV, DataFrames
function doanalysis()
                            dfg=nothing
                            for x in 21200
                          print("\r peeking at GCN $x ")
                         try
                             url = "https://gcn.nasa.gov/circulars/$x/raw"
                             resp = HTTP.get(url) 
                             status=resp.status
                             print(" ",status," "); 
                             if status == 404 ; println("status=",status); continue; end          
                             txt = String(resp.body)
                             if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				                  m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				                  print(m.match)
			                  end

                             if occursin("GROND observations", txt)
                                 println("GROND report")
                                 
                                 he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K) ="im,txt))
                                lr=first(findnext(r"^(?:[\t ]*(?:\r?\n|\r))+"im,txt,he))
                                 cltxt=txt[he:lr]
                                 
                                 df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t', skipto=3 ,header=0)
                                 df.GCN=[x for i in 1:nrow(df)]
                                 df.GRB=[m.match for i in 1:nrow(df)]
                                 if isnothing(dfg) 
                                     dfg=df
                                 else
                                     dfg=vcat(dfg,df)
                                 end
				                      if isnothing(dfg) 
                                          dfg=df
                                      else
                                          @show dfg=vcat(dfg,df)
                                      end # if x is first
        end # if occursin
                         catch e
                             println("error ")
                            
                         end # trycatch
                     end # for loop
							 end
doanalysis()

but i want to generalise it for other GROND telescope data extraction. It is printing two times :upside_down_face: .
image

If I understand correctly what you are looking for, I think these are the right tools.
I’ve never used them so far

Hey @raman_kumar, I am answering this because you mentioned me in your post.

I am eager to answer any of these types of questions:

  1. Julia-related questions (e.g., you don’t understand some behavior/concept, getting an error, etc.)
  2. Julia’s package-related questions (missing documentation, errors, package suggestions, etc.)

Usually, if not a conceptual question, you must include certain MWE that fails or produces some unexpected output.

However, in your scenario, it seems like your above request doesn’t fit any of the above scenarios: to answer this successfully, somebody needs to go and do the work of understanding the HTML structure and then think about a way to parse that (and finally either work to adapt your code or provide specific instructions concerning the code you should write).

Maybe others might find this a legitimate question and might want to invest the time and answer you - however, my feeling is that it is always better to help somebody by contributing to refining their fishing tools instead of doing the fishing (or even part of the fishing) for them.

My advice for you is to go deeper into the structure of the HTML you want to parse and then start working on adapting your existing code - if your attempt fails and you get either errors or unexpected results, we might take it from there and answer targeted issues with your work.

From what I was able to conclude from 1-2 HTML files inspections, the content is entirely as raw text under a single div - so HTML/CSS related tools will not help much past retrieving the big text chunk - so you might actually need to parse the text and extract the relevant data in the desired format.

6 Likes

I want to search text starting from g’ to end of K row in below image https://gcn.nasa.gov/circulars/21200?query=GROND
image

Please tell me about changes i need to do in code below.

 he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K)"im,txt))
 lr=first(findnext(r"^(?:[\t ]*(?:\r?\n|\r))+"im,txt,he))

@rocco_sprmnt21

you could try like this, but I don’t know if it’s generic (but specific) enough for all your cases.
You have to see it with a little patience

he=first(findfirst(r"\n\n^(\w')"im,txt))+2
lr=first(findnext(r"(\.)\n\n"m,txt,he))-1
using HTTP, CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 21200
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("GROND observations", txt)
                println(" GROND report")                
                he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K)"m,txt))
                lr=first(findnext(r"^(?:[\t ]*(?:\r?\n|\r))+"m,txt,he))
                cltxt=replace(txt[he:lr], r" ?(=|>)"=>"|" , "+/-"=>"|")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim="|" ,header=0)
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)]                  
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error ")                    
        end # trycatch
    end # for loop
end
doanalysis()

give output shown below :point_down: GCN 21200
image

Missing some lines was due to kwarg skip=3, the duplication depends on the if then else you put most likely.
Write a script with no checksums, test that it does what you want, then add the checks bit by bit and test them one by one.

julia> function doanalysis()
                                   dfg=nothing
                                   for x in 30574
                                 print("\r peeking at GCN $x ")
                                try
                                    url = "https://gcn.nasa.gov/circulars/$x/raw"
                                    resp = HTTP.get(url)
                                    status=resp.status
                                    print(" ",status," ");
                                    if status == 404 ; println("status=",status); continue; end
                                    txt = String(resp.body)
                                    if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
                                                  m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
                                                  print(m.match)
                                          end

                                    if occursin("GROND observations", txt)
                                        println("GROND report")

                                       he=first(findfirst(r"\n\n^(\w')"im,txt))+2
                                       lr=first(findnext(r"(\.)\n\n"m,txt,he))-1
                                        cltxt=txt[he:lr]

                                        df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t' ,header=0)
                                        df.GCN=[x for i in 1:nrow(df)]
                                        df.GRB=[m.match for i in 1:nrow(df)]
julia> doanalysis()
 peeking at GCN 30574  200 GRB 210731AGROND report
dfg = vcat(dfg, df) = 14×3 DataFrame
 Row │ Column1                       GCN    GRB
     │ String31                      Int64  SubStrin…
─────┼──────────────────────────────────────────────────
   1 │ g' = 18.71 +/- 0.01 mag,      30574  GRB 210731A
   2 │ r' = 18.44 +/- 0.01 mag,      30574  GRB 210731A
   3 │ i' = 18.19 +/- 0.01 mag,      30574  GRB 210731A
   4 │ z' = 18.01 +/- 0.01 mag,      30574  GRB 210731A
   5 │ J  = 17.66 +/- 0.02 mag,      30574  GRB 210731A
   6 │ H  = 17.38 +/- 0.02 mag, and  30574  GRB 210731A
   7 │ K  = 17.14 +/- 0.15 mag       30574  GRB 210731A
   8 │ g' = 18.71 +/- 0.01 mag,      30574  GRB 210731A
   9 │ r' = 18.44 +/- 0.01 mag,      30574  GRB 210731A
  10 │ i' = 18.19 +/- 0.01 mag,      30574  GRB 210731A
  11 │ z' = 18.01 +/- 0.01 mag,      30574  GRB 210731A
  12 │ J  = 17.66 +/- 0.02 mag,      30574  GRB 210731A
  13 │ H  = 17.38 +/- 0.02 mag, and  30574  GRB 210731A
  14 │ K  = 17.14 +/- 0.15 mag       30574  GRB 210731A

No, Please see my final code in last edited post. Now, my code is working fine. :laughing: Your code is still giving double table- 14 rows instead of actual 7 row in text.

1 Like

Your code is not parsing properly for GCN 21200. :rofl: :joy: :stuck_out_tongue_winking_eye: and mine code is not working for GCN 32383.

using HTTP , CSV, DataFrames, JSON3

julia> function scrapjson(gcn)
           grurl="https://gcn.nasa.gov/circulars/$gcn/json"
           gresp = HTTP.get(grurl)
           js1=JSON3.read(IOBuffer(String(gresp.body)))
           txt=js1.body
           he=first(findfirst(r"\n\n^( *\w')"im,txt))+2
           lr=first(findnext(r"\n\n"m,txt,he))-1
           cltxt=txt[he:lr]
           df=CSV.read(IOBuffer(cltxt), DataFrame, header=0)
           if startswith(js1.subject,"GRB")
               df.GRB .= readuntil(IOBuffer(js1.subject), ',')
           else
               grb=findfirst(r"GRB *\d+\w",js1.subject)
               df.GRB .= js1.subject[grb]
           end
           df.GCN .= gcn
           "Column2" ∈ names(df) ? df[:,Not(:Column2)] : df
       end
scrapjson (generic function with 1 method)

julia> df1=scrapjson(31522)
6×3 DataFrame
 Row │ Column1     GRB          GCN   
     │ String15    String       Int64
─────┼────────────────────────────────
   1 │ g' > 23.2   GRB 220117A  31522
   2 │ r' > 23.4   GRB 220117A  31522
   3 │ i' > 23.0   GRB 220117A  31522
   4 │ J  > 21.4   GRB 220117A  31522
   5 │ H  > 21.1   GRB 220117A  31522
   6 │ K  > 19.3.  GRB 220117A  31522

julia> df2=scrapjson(32383)
7×3 DataFrame
 Row │ Column1                          GRB          GCN   
     │ String31                         String       Int64
─────┼─────────────────────────────────────────────────────
   1 │   g' > 23.0                      GRB 220711B  32383
   2 │   r' > 23.5                      GRB 220711B  32383
   3 │   i' > 23.2                      GRB 220711B  32383
   4 │   z' > 19.8                      GRB 220711B  32383
   5 │   J  > 21.7                      GRB 220711B  32383
   6 │   H  > 21.1                      GRB 220711B  32383
   7 │   K  > 20.1  (AB mag; 3 sigma).  GRB 220711B  32383

julia> df3=scrapjson(21200)
7×3 DataFrame
 Row │ Column1              GRB         GCN   
     │ String31             String      Int64
─────┼────────────────────────────────────────
   1 │ g' = 20.75 +/- 0.08  GRB170604A  21200
   2 │ r' = 20.46 +/- 0.05  GRB170604A  21200
   3 │ i' = 20.35 +/- 0.06  GRB170604A  21200
   4 │ z' = 20.21 +/- 0.08  GRB170604A  21200
   5 │ J = 19.6 +/- 0.1     GRB170604A  21200
   6 │ H = 19.6 +/- 0.3     GRB170604A  21200
   7 │ K > 19.6             GRB170604A  21200
1 Like

I tried to automate the table capture operation a little more, using these packages (maybe someone who has experience of how to do these things, can intervene to give some indications on how to do it “better”).
Since the source of the data is very “messy”, there is still a little tinkering to manage the “recoverable” situations

urlb="https://gcn.nasa.gov/circulars?query=GROND" # page=1&limit=100

#urlq="https://gcn.nasa.gov/circulars?page=2&limit=50&query=GROND"

using HTTP , CSV, DataFrames, JSON3

using Cascadia, Gumbo

gresp = HTTP.get(urlb)
h = parsehtml(String(gresp.body)) 

s1=sel"ol li"  # buona questa!

qs = eachmatch(s1,h.root)

res=Tuple{String, String}[]

for q in qs
    txt=q.children[1].children[1].text
    if contains(txt,"GROND") && contains(txt,"GRB")
    push!(res, (q.attributes["value"],txt))
    end
end


#----------
#------------


function scrapjson(gcn)
    grurl="https://gcn.nasa.gov/circulars/$gcn/json"
    gresp = HTTP.get(grurl)
    js1=JSON3.read(IOBuffer(String(gresp.body)))
    txt=js1.body
    he=first(findfirst(r"\n\n^( *g'*)"m,txt))+2
    lr=first(findnext(r"\n\n"m,txt,he))-1
    cltxt=replace(txt[he:lr],' '=>"")
    df=CSV.read(IOBuffer(cltxt), DataFrame, header=0)
    ptn=r"(GRB *\d+\w)[:|,]*"
    df.GRB .= match(ptn,js1.subject).match
    df.GCN .= gcn
    "Column2" ∈ names(df) ? df[:,Not(:Column2)] : df
end

df1=scrapjson(31522)
df2=scrapjson(30703)
df3=scrapjson(26066)
df4=scrapjson(23814)

for (gcn, _) in res[1:25]
    try println(scrapjson(gcn)) catch e; println("\n"*gcn*"--->NOK\n") end
end

Some results
julia> for (gcn, _) in res[1:25]
           try println(scrapjson(gcn)) catch e; println("\n"*gcn*"--->NOK\n") end
       end
7×3 DataFrame
 Row │ Column1                GRB          GCN    
     │ String31               SubStrin…    String
─────┼────────────────────────────────────────────
   1 │ g'>23.0                GRB 220711B  32383
   2 │ r'>23.5                GRB 220711B  32383
   3 │ i'>23.2                GRB 220711B  32383
   4 │ z'>19.8                GRB 220711B  32383
   5 │ J>21.7                 GRB 220711B  32383
   6 │ H>21.1                 GRB 220711B  32383
   7 │ K>20.1(ABmag;3sigma).  GRB 220711B  32383
6×3 DataFrame
 Row │ Column1  GRB          GCN    
     │ String7  SubStrin…    String
─────┼──────────────────────────────
   1 │ g'>25.2  GRB 220706A  32339
   2 │ r'>24.9  GRB 220706A  32339
   3 │ i'>24.2  GRB 220706A  32339
   4 │ J>21.7   GRB 220706A  32339
   5 │ H>21.0   GRB 220706A  32339
   6 │ K>19.9.  GRB 220706A  32339
7×3 DataFrame
 Row │ Column1          GRB          GCN    
     │ String15         SubStrin…    String
─────┼──────────────────────────────────────
   1 │ g'=23.31+/-0.11  GRB 220627A  32304
   2 │ r'=22.70+/-0.06  GRB 220627A  32304
   3 │ i'=22.50+/-0.12  GRB 220627A  32304
   4 │ z'=22.23+/-0.21  GRB 220627A  32304
   5 │ J>21.4           GRB 220627A  32304
   6 │ H>20.6           GRB 220627A  32304
   7 │ K>19.7.          GRB 220627A  32304
6×3 DataFrame
 Row │ Column1  GRB          GCN    
     │ String7  SubStrin…    String
─────┼──────────────────────────────
   1 │ g'>23.2  GRB 220117A  31522
   2 │ r'>23.4  GRB 220117A  31522
   3 │ i'>23.0  GRB 220117A  31522
   4 │ J>21.4   GRB 220117A  31522
   5 │ H>21.1   GRB 220117A  31522
   6 │ K>19.3.  GRB 220117A  31522
7×3 DataFrame
 Row │ Column1  GRB          GCN    
     │ String7  SubStrin…    String
─────┼──────────────────────────────
   1 │ g'>24.8  GRB 211106A  31069
   2 │ r'>25.0  GRB 211106A  31069
   3 │ i'>24.0  GRB 211106A  31069
   4 │ z'>22.0  GRB 211106A  31069
   5 │ J>21.7   GRB 211106A  31069
   6 │ H>21.1   GRB 211106A  31069
   7 │ K>19.8.  GRB 211106A  31069
7×3 DataFrame
 Row │ Column1        GRB          GCN    
     │ String15       SubStrin…    String
─────┼────────────────────────────────────
   1 │ g'>24.3        GRB 210905A  30781
   2 │ r'>24.3        GRB 210905A  30781
   3 │ i'>23.8        GRB 210905A  30781
   4 │ z'=21.6+/-0.2  GRB 210905A  30781
   5 │ J=20.2+/-0.2   GRB 210905A  30781
   6 │ H=20.1+/-0.2   GRB 210905A  30781
   7 │ K>18.2.        GRB 210905A  30781
6×3 DataFrame
 Row │ Column1  GRB          GCN    
     │ String7  SubStrin…    String
─────┼──────────────────────────────
   1 │ g'>23.5  GRB 210901A  30755
   2 │ r'>23.6  GRB 210901A  30755
   3 │ i'>22.6  GRB 210901A  30755
   4 │ J>20.1   GRB 210901A  30755
   5 │ H>19.7   GRB 210901A  30755
   6 │ K>16.3.  GRB 210901A  30755
3×3 DataFrame
 Row │ Column1           GRB          GCN    
     │ String31          SubStrin…    String
─────┼───────────────────────────────────────
   1 │ g'=20.37+/-0.09   GRB 210822A  30703
   2 │ r'=20.10+/-0.05   GRB 210822A  30703
   3 │ i'=19.92+/-0.05.  GRB 210822A  30703
1×3 DataFrame
 Row │ Column1  GRB          GCN    
     │ String7  SubStrin…    String
─────┼──────────────────────────────
   1 │ g'>22.6  GRB 210820A  30695

30584--->NOK

7×3 DataFrame
 Row │ Column1             GRB          GCN    
     │ String31            SubStrin…    String
─────┼─────────────────────────────────────────
   1 │ g'=18.71+/-0.01mag  GRB 210731A  30574
   2 │ r'=18.44+/-0.01mag  GRB 210731A  30574
   3 │ i'=18.19+/-0.01mag  GRB 210731A  30574
   4 │ z'=18.01+/-0.01mag  GRB 210731A  30574
   5 │ J=17.66+/-0.02mag   GRB 210731A  30574
   6 │ H=17.38+/-0.02mag   GRB 210731A  30574
   7 │ K=17.14+/-0.15mag.  GRB 210731A  30574
4×3 DataFrame
 Row │ Column1   GRB          GCN    
     │ String15  SubStrin…    String
─────┼───────────────────────────────
   1 │ g�>23.3   GRB 191004A  26324
   2 │ r�>23.7   GRB 191004A  26324
   3 │ i�>23.1   GRB 191004A  26324
   4 │ z�>22.7   GRB 191004A  26324
7×3 DataFrame
 Row │ Column1          GRB        GCN    
     │ String15         SubStrin…  String
─────┼────────────────────────────────────
   1 │ g'=16.81+/-0.03  GRB191016  26176
   2 │ r'=16.33+/-0.03  GRB191016  26176
   3 │ i'=15.84+/-0.04  GRB191016  26176
   4 │ z'=15.51+/-0.04  GRB191016  26176
   5 │ J=15.28+/-0.05   GRB191016  26176
   6 │ H=14.80+/-0.05   GRB191016  26176
   7 │ K=14.83+/-0.08   GRB191016  26176
7×3 DataFrame
 Row │ Column1     GRB          GCN    
     │ String15    SubStrin…    String
─────┼─────────────────────────────────
   1 │ g'>25.5mag  GRB 191024A  26066
   2 │ r'>25.6mag  GRB 191024A  26066
   3 │ i'>24.8mag  GRB 191024A  26066
   4 │ z'>23.4mag  GRB 191024A  26066
   5 │ J>21.9mag   GRB 191024A  26066
   6 │ H>21.4mag   GRB 191024A  26066
   7 │ K>20.2mag   GRB 191024A  26066

26042--->NOK


25992--->NOK


25960--->NOK

7×3 DataFrame
 Row │ Column1     GRB          GCN    
     │ String15    SubStrin…    String
─────┼─────────────────────────────────
   1 │ g'>23.7mag  GRB 191004A  25959
   2 │ r'>24.2mag  GRB 191004A  25959
   3 │ i'>23.5mag  GRB 191004A  25959
   4 │ z'>23.3mag  GRB 191004A  25959
   5 │ J>21.4mag   GRB 191004A  25959
   6 │ H>20.4mag   GRB 191004A  25959
   7 │ K>19.9mag   GRB 191004A  25959

25791--->NOK


25789--->NOK


25652--->NOK


25651--->NOK

7×3 DataFrame
 Row │ Column1            GRB          GCN    
     │ String31           SubStrin…    String
─────┼────────────────────────────────────────
   1 │ g���=20.30+/-0.03  GRB 190829A  25569
   2 │ r���=19.34+/-0.03  GRB 190829A  25569
   3 │ i���=18.77+/-0.03  GRB 190829A  25569
   4 │ z���=18.21+/-0.03  GRB 190829A  25569
   5 │ J=17.34+/-0.06     GRB 190829A  25569
   6 │ H=16.68+/-0.06     GRB 190829A  25569
   7 │ Ks=16.40+/-0.08.   GRB 190829A  25569
7×3 DataFrame
 Row │ Column1             GRB          GCN    
     │ String31            SubStrin…    String
─────┼─────────────────────────────────────────
   1 │ g'=22.79+/-0.05mag  GRB 190613B  24831
   2 │ r'=22.05+/-0.04mag  GRB 190613B  24831
   3 │ i'=21.56+/-0.05mag  GRB 190613B  24831
   4 │ z'=21.15+/-0.07mag  GRB 190613B  24831
   5 │ J=20.5+/-0.1mag     GRB 190613B  24831
   6 │ H=20.2+/-0.2mag     GRB 190613B  24831
   7 │ K>19.8mag           GRB 190613B  24831
7×3 DataFrame
 Row │ Column1          GRB          GCN    
     │ String15         SubStrin…    String
─────┼──────────────────────────────────────
   1 │ g<23.0mag        GRB 190129B  23814
   2 │ r=22.7+/-0.3mag  GRB 190129B  23814
   3 │ i=21.8+/-0.3mag  GRB 190129B  23814
   4 │ z=21.7+/-0.4mag  GRB 190129B  23814
   5 │ J=20.4+/-0.4mag  GRB 190129B  23814
   6 │ H=19.1+/-0.2mag  GRB 190129B  23814
   7 │ K=19.0+/-0.4mag  GRB 190129B  23814

PS
You can, of course, adapt the scripts to dig into subsequent pages as well

1 Like

I don’t understand anything about the contents of the tables, but I think that in this form the data is easier to read

julia> urlb="https://gcn.nasa.gov/circulars?query=GROND" # page=1&limit=100
"https://gcn.nasa.gov/circulars?query=GROND"

julia> using HTTP , CSV, DataFrames, JSON3

julia> using Cascadia, Gumbo

julia> gresp = HTTP.get(urlb);

julia> h = parsehtml(String(gresp.body));

julia> s1=sel"ol li"  # buona questa!
Selector(Cascadia.var"#51#52"{Selector, Selector}(Selector(Cascadia.var"#5#6"{String}("ol")), Selector(Cascadia.var"#5#6"{String}("li"))))

julia> qs = eachmatch(s1,h.root);

julia> res=Tuple{String, String}[]
Tuple{String, String}[]

julia> for q in qs
           txt=q.children[1].children[1].text
           if contains(txt,"GROND") && contains(txt,"GRB")
           push!(res, (q.attributes["value"],txt))
           end
       end

julia> function scrapjson(gcn)
           grurl="https://gcn.nasa.gov/circulars/$gcn/json"
           gresp = HTTP.get(grurl)
           js1=JSON3.read(IOBuffer(String(gresp.body)))
           txt=js1.body
           he=first(findfirst(r"\n\n^( *g'*)"m,txt))+2
           lr=first(findnext(r"\n\n"m,txt,he))-1
           cltxt=replace(txt[he:lr],' '=>"")
           df=CSV.read(IOBuffer(cltxt), DataFrame, header=0)
           ptn=r"(GRB *\d+\w)[:|,]*"
           df.GRB .= match(ptn,js1.subject)[1]
           df.GCN .= gcn
           "Column2" ∈ names(df) ? df[:,Not(:Column2)] : df
       end
scrapjson (generic function with 1 method)

julia> df=scrapjson(res[1][1]);

julia> dfnok=DataFrame(GCN=String[])
0×1 DataFrame
 Row │ GCN    
     │ String
─────┴────────

julia> for (gcn, _) in res[2:25]
           try df=vcat(df,scrapjson(gcn), cols=:union) catch e; push!(dfnok,(GCN=gcn,)) end
       end

julia> df.mag=replace.(df.Column1, "'"=>"",'�'=>"");

julia> df=select(df,[:GCN, :GRB],:mag=>ByRow(x->[x[1],x[2:end]])=>[:tel,:mag1]);

julia> udf=unstack(df,[:GCN, :GRB],:tel,:mag1)
17×9 DataFrame
...
julia> vcat(udf,dfnok,cols=:union)
25×9 DataFrame
 Row │ GCN     GRB          g                 r                 i                 z                 J                 H           ⋯
     │ String  SubStrin…?   String?           String?           String?           String?           String?           String?     ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 32383   GRB 220711B  >23.0             >23.5             >23.2             >19.8             >21.7             >21.1       ⋯
   2 │ 32339   GRB 220706A  >25.2             >24.9             >24.2             missing           >21.7             >21.0        
   3 │ 32304   GRB 220627A  =23.31+/-0.11     =22.70+/-0.06     =22.50+/-0.12     =22.23+/-0.21     >21.4             >20.6        
   4 │ 31522   GRB 220117A  >23.2             >23.4             >23.0             missing           >21.4             >21.1        
   5 │ 31069   GRB 211106A  >24.8             >25.0             >24.0             >22.0             >21.7             >21.1       ⋯
   6 │ 30781   GRB 210905A  >24.3             >24.3             >23.8             =21.6+/-0.2       =20.2+/-0.2       =20.1+/-0.2  
   7 │ 30755   GRB 210901A  >23.5             >23.6             >22.6             missing           >20.1             >19.7        
   8 │ 30703   GRB 210822A  =20.37+/-0.09     =20.10+/-0.05     =19.92+/-0.05.    missing           missing           missing      
   9 │ 30695   GRB 210820A  >22.6             missing           missing           missing           missing           missing     ⋯
  10 │ 30574   GRB 210731A  =18.71+/-0.01mag  =18.44+/-0.01mag  =18.19+/-0.01mag  =18.01+/-0.01mag  =17.66+/-0.02mag  =17.38+/-0.  
  11 │ 26324   GRB 191004A  >23.3             >23.7             >23.1             >22.7             missing           missing      
  ⋮  │   ⋮          ⋮              ⋮                 ⋮                 ⋮                 ⋮                 ⋮                 ⋮    ⋱
  16 │ 24831   GRB 190613B  =22.79+/-0.05mag  =22.05+/-0.04mag  =21.56+/-0.05mag  =21.15+/-0.07mag  =20.5+/-0.1mag    =20.2+/-0.2  
  17 │ 23814   GRB 190129B  <23.0mag          =22.7+/-0.3mag    =21.8+/-0.3mag    =21.7+/-0.4mag    =20.4+/-0.4mag    =19.1+/-0.2 ⋯
  18 │ 30584   missing      missing           missing           missing           missing           missing           missing      
  19 │ 26042   missing      missing           missing           missing           missing           missing           missing      
  20 │ 25992   missing      missing           missing           missing           missing           missing           missing      
  21 │ 25960   missing      missing           missing           missing           missing           missing           missing     ⋯
  22 │ 25791   missing      missing           missing           missing           missing           missing           missing      
  23 │ 25789   missing      missing           missing           missing           missing           missing           missing      
  24 │ 25652   missing      missing           missing           missing           missing           missing           missing      
  25 │ 25651   missing      missing           missing           missing           missing           missing           missing     ⋯
                                                                                                       2 columns and 4 rows omitted
1 Like

Why you are using JSON3 ? Is there any special advantage over my 12th post code in this discussion ? :face_with_diagonal_mouth: :worried:

julia>    js1=JSON3.read(IOBuffer(String(gresp.body)))
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 6 entries:
  :subject    => "GROND observations of GRB 180620B"
  :createdOn  => 1529579936000
  :submitter  => "Patricia Schady at MPE/Swift  <pschady@mpe.mpg.de>"
  :circularId => 22819
  :email      => "pschady@mpe.mpg.de"
  :body       => "Tassilo Schweyer and Patricia Schady (MPE Garching) report:\n\nWe observed the field of GRB 180620B (Swift trigg

it is not essential, but it seemed to me more convenient to access the “:subject” and “:body” (e potresti aggiungere facilmente le info sul :submitter altre) fields to obtain the information to put in the tables

Can you scrap for GCN - Circulars - 31493: GRB 220118A: MITSuME Akeno optical observation ? In this case both starting and end position are T0+ .

you should write the resulting table well and post it. You don’t understand anything even with the RAW view

For GCN 31418 table should look like :

T0+[hours]    MID-UT      T-EXP[sec]5-sigma  FILTER   limits
22.0    2022-01-08 18:58:13   11820            g'       20.7
22.0    2022-01-08 18:58:13   11820            Rc       20.9
22.0    2022-01-08 18:58:13   11820            Ic       20.1
julia> urlmessy="https://gcn.nasa.gov/circulars/31493/raw"
"https://gcn.nasa.gov/circulars/31493/raw"

julia> txt=String((HTTP.get(urlmessy)))
"HTTP/1.1 200 OK\r\nContent-Type: text/plain;charset=UTF-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nDate: Sat, 29 Jul 2023 17:08:47 GMT\r\nLink: <https://gcn.nasa.gov/circulars/31493>; rel=\"canonical\"\r\nApigw-Requestid: I1gg8imKoAMEJpg=\r\nContent-Encoding: gzip\r\nVary: Accept-Encoding\r\nX-Cache: Miss from cloudfront\r\nVia: 1.1 c651b6f427de520af17b746abf0c7ee6.cloudfront.net (CloudFront)\r\nX-Amz-Cf-Pop: MXP64-P2\r\nX-Amz-Cf-Id: QOaNnew_Vj-jdlhBHyq" ⋯ 2168 bytes ⋯ ">19.9 |\n-----------------------------------------------------------------------------------------------------------------\nT0+ : Elapsed time after the burst\nT-EXP: Total Exposure time\n\nWe used PS1 catalog for flux calibration.\nThe magnitudes are expressed in the AB system.\nThe images were processed in real-time through the MITSuME GPU\nreduction pipeline (Niwano et al. 2021, PASJ, Vol.73, Issue 1, Pages\n4-24; https://github.com/MNiwano/Eclaire)."

julia>     he=first(findfirst(r"^\nT0+"im,txt))+1
1651

julia>     lr=first(findnext("|\n---",txt,he))
2624

julia>     cltxt=txt[he:lr]
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits\n-----------------------------------------------------------------------------------------------------------------\n56   | 2022-01-18 18:21:34 | 10   |g'>15.9,         Rc>16.8,\nIc>16.8        | g'>15.9, Rc>16.8, Ic>16.8 |\n84   | 2022-01-18 18:22:02 | 40   |g'=18.36+/-0.70, Rc=18.57+/-0.78,\nIc>17.5        | g'>17.3, Rc>17.9, Ic>17.5 |\n171  | 2022-01-18 18:23:29 | 50   |g'=17.45+/" ⋯ 75 bytes ⋯ " 2022-01-18 18:24:54 | 60   |g'=17.31+/-0.18, Rc=16.87+/-0.09,\nIc=16.64+/-0.11| g'>17.7, Rc>18.5, Ic>18.1 |\n325  | 2022-01-18 18:26:03 | 60   |g'=17.31+/-0.22, Rc=17.31+/-0.13,\nIc=17.08+/-0.15| g'>17.7, Rc>18.4, Ic>18.1 |\n535  | 2022-01-18 18:29:33 | 300  |g'=18.21+/-0.17, Rc=17.80+/-0.10,\nIc=17.47+/-0.10| g'>18.7, Rc>19.5, Ic>19.1 |\n1471 | 2022-01-18 18:45:09 | 1020 |g'=19.82+/-0.33, Rc=19.26+/-0.14,\nIc=18.81+/-0.15| g'>19.5, Rc>20.3, Ic>19.9 |"

julia>     clcltxt=replace(cltxt, r"\n(Ic)"=>s"\1")
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits\n-----------------------------------------------------------------------------------------------------------------\n56   | 2022-01-18 18:21:34 | 10   |g'>15.9,         Rc>16.8,Ic>16.8        | g'>15.9, Rc>16.8, Ic>16.8 |\n84   | 2022-01-18 18:22:02 | 40   |g'=18.36+/-0.70, Rc=18.57+/-0.78,Ic>17.5        | g'>17.3, Rc>17.9, Ic>17.5 |\n171  | 2022-01-18 18:23:29 | 50   |g'=17.45+/-0" ⋯ 68 bytes ⋯ "6  | 2022-01-18 18:24:54 | 60   |g'=17.31+/-0.18, Rc=16.87+/-0.09,Ic=16.64+/-0.11| g'>17.7, Rc>18.5, Ic>18.1 |\n325  | 2022-01-18 18:26:03 | 60   |g'=17.31+/-0.22, Rc=17.31+/-0.13,Ic=17.08+/-0.15| g'>17.7, Rc>18.4, Ic>18.1 |\n535  | 2022-01-18 18:29:33 | 300  |g'=18.21+/-0.17, Rc=17.80+/-0.10,Ic=17.47+/-0.10| g'>18.7, Rc>19.5, Ic>19.1 |\n1471 | 2022-01-18 18:45:09 | 1020 |g'=19.82+/-0.33, Rc=19.26+/-0.14,Ic=18.81+/-0.15| g'>19.5, Rc>20.3, Ic>19.9 |"

julia>     df=CSV.read(IOBuffer(clcltxt), DataFrame, delim='|',header=false, skipto=3, dateformat="yyyy-mm-dd HH:MM:SS")
7×6 DataFrame
 Row │ Column1  Column2              Column3  Column4                            Column5                      Column6 
     │ Int64    DateTime             Int64    String                             String31                     Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │      56  2022-01-18T18:21:34       10  g'>15.9,         Rc>16.8,Ic>16.8…   g'>15.9, Rc>16.8, Ic>16.8   missing
   2 │      84  2022-01-18T18:22:02       40  g'=18.36+/-0.70, Rc=18.57+/-0.78…   g'>17.3, Rc>17.9, Ic>17.5   missing
   3 │     171  2022-01-18T18:23:29       50  g'=17.45+/-0.20, Rc=16.90+/-0.15…   g'>17.5, Rc>18.1, Ic>17.5   missing
   4 │     256  2022-01-18T18:24:54       60  g'=17.31+/-0.18, Rc=16.87+/-0.09…   g'>17.7, Rc>18.5, Ic>18.1   missing
   5 │     325  2022-01-18T18:26:03       60  g'=17.31+/-0.22, Rc=17.31+/-0.13…   g'>17.7, Rc>18.4, Ic>18.1   missing
   6 │     535  2022-01-18T18:29:33      300  g'=18.21+/-0.17, Rc=17.80+/-0.10…   g'>18.7, Rc>19.5, Ic>19.1   missing
   7 │    1471  2022-01-18T18:45:09     1020  g'=19.82+/-0.33, Rc=19.26+/-0.14…   g'>19.5, Rc>20.3, Ic>19.9   missing

julia>     h=readuntil(IOBuffer(cltxt), '\n')
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits"

julia>     hd=["T0+[sec]", "MID-UT", "T-EXP[sec]", "magnitude(or 5-sigma limits)", "5-sigma limits","m"]
6-element Vector{String}:
 "T0+[sec]"
 "MID-UT"
 "T-EXP[sec]"
 "magnitude(or 5-sigma limits)"
 "5-sigma limits"
 "m"

julia>     rename!(df,hd)
7×6 DataFrame
 Row │ T0+[sec]  MID-UT               T-EXP[sec]  magnitude(or 5-sigma limits)       5-sigma limits               m       
     │ Int64     DateTime             Int64       String                             String31                     Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       56  2022-01-18T18:21:34          10  g'>15.9,         Rc>16.8,Ic>16.8…   g'>15.9, Rc>16.8, Ic>16.8   missing
   2 │       84  2022-01-18T18:22:02          40  g'=18.36+/-0.70, Rc=18.57+/-0.78…   g'>17.3, Rc>17.9, Ic>17.5   missing
   3 │      171  2022-01-18T18:23:29          50  g'=17.45+/-0.20, Rc=16.90+/-0.15…   g'>17.5, Rc>18.1, Ic>17.5   missing
   4 │      256  2022-01-18T18:24:54          60  g'=17.31+/-0.18, Rc=16.87+/-0.09…   g'>17.7, Rc>18.5, Ic>18.1   missing
   5 │      325  2022-01-18T18:26:03          60  g'=17.31+/-0.22, Rc=17.31+/-0.13…   g'>17.7, Rc>18.4, Ic>18.1   missing
   6 │      535  2022-01-18T18:29:33         300  g'=18.21+/-0.17, Rc=17.80+/-0.10…   g'>18.7, Rc>19.5, Ic>19.1   missing
   7 │     1471  2022-01-18T18:45:09        1020  g'=19.82+/-0.33, Rc=19.26+/-0.14…   g'>19.5, Rc>20.3, Ic>19.9   missing

In regular expression r"^\nT0+"im you are already using ^(which stands for starting of line) so why do you need \n ? For my code it is not working.

using HTTP, CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 31493
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("MITSuME", txt)
                println(" MITSuME report")                
                he=first(findfirst(r"^\nT0+"im,txt))+1
                lr=first(findnext("|\n---",txt,he))
                cltxt=replace(txt[he:lr], ","=>" ",r"(�+)"=>" " , ">"=>" ", "-"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim=" ")
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)]                  
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error at try")                    
        end # trycatch
    end # for loop
end
doanalysis()

image

@rocco_sprmnt21 I want to combine GCNs of different telescopes in a single file so i don’t want to switch to use your code containing JSON3.

@rocco_sprmnt21
Following code is not able to scrap table from GCN due to occurrence of J. Bolmer on upper line.

See code
using HTTP , CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 26176
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("GROND", txt)
                println(" GROND report")                
                he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K)"m,txt))
                lr=first(findnext(r"^(((?:[\t ]*(?:\r?\n|\r))+)|(The)|(Given))"m,txt,he))-1
                cltxt=replace(txt[he:lr], "mag"=>" ",","=>"|",r" ?(=|>)"=>"|" , "+/-"=>"|","�"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim="|" ,header=0)
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)] 
				rename!(df,"Column1" => "Filter","Column2" => "Mag","Column3" => "Mag_err")
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error ")                    
        end # trycatch
    end # for loop
end
doanalysis()

Above code is also not working for GCN-20843