Scrap table from NASA GCN circulars website

Can you scrap for GCN - Circulars - 31493: GRB 220118A: MITSuME Akeno optical observation ? In this case both starting and end position are T0+ .

you should write the resulting table well and post it. You don’t understand anything even with the RAW view

For GCN 31418 table should look like :

T0+[hours]    MID-UT      T-EXP[sec]5-sigma  FILTER   limits
22.0    2022-01-08 18:58:13   11820            g'       20.7
22.0    2022-01-08 18:58:13   11820            Rc       20.9
22.0    2022-01-08 18:58:13   11820            Ic       20.1
julia> urlmessy="https://gcn.nasa.gov/circulars/31493/raw"
"https://gcn.nasa.gov/circulars/31493/raw"

julia> txt=String((HTTP.get(urlmessy)))
"HTTP/1.1 200 OK\r\nContent-Type: text/plain;charset=UTF-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nDate: Sat, 29 Jul 2023 17:08:47 GMT\r\nLink: <https://gcn.nasa.gov/circulars/31493>; rel=\"canonical\"\r\nApigw-Requestid: I1gg8imKoAMEJpg=\r\nContent-Encoding: gzip\r\nVary: Accept-Encoding\r\nX-Cache: Miss from cloudfront\r\nVia: 1.1 c651b6f427de520af17b746abf0c7ee6.cloudfront.net (CloudFront)\r\nX-Amz-Cf-Pop: MXP64-P2\r\nX-Amz-Cf-Id: QOaNnew_Vj-jdlhBHyq" β‹― 2168 bytes β‹― ">19.9 |\n-----------------------------------------------------------------------------------------------------------------\nT0+ : Elapsed time after the burst\nT-EXP: Total Exposure time\n\nWe used PS1 catalog for flux calibration.\nThe magnitudes are expressed in the AB system.\nThe images were processed in real-time through the MITSuME GPU\nreduction pipeline (Niwano et al. 2021, PASJ, Vol.73, Issue 1, Pages\n4-24; https://github.com/MNiwano/Eclaire)."

julia>     he=first(findfirst(r"^\nT0+"im,txt))+1
1651

julia>     lr=first(findnext("|\n---",txt,he))
2624

julia>     cltxt=txt[he:lr]
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits\n-----------------------------------------------------------------------------------------------------------------\n56   | 2022-01-18 18:21:34 | 10   |g'>15.9,         Rc>16.8,\nIc>16.8        | g'>15.9, Rc>16.8, Ic>16.8 |\n84   | 2022-01-18 18:22:02 | 40   |g'=18.36+/-0.70, Rc=18.57+/-0.78,\nIc>17.5        | g'>17.3, Rc>17.9, Ic>17.5 |\n171  | 2022-01-18 18:23:29 | 50   |g'=17.45+/" β‹― 75 bytes β‹― " 2022-01-18 18:24:54 | 60   |g'=17.31+/-0.18, Rc=16.87+/-0.09,\nIc=16.64+/-0.11| g'>17.7, Rc>18.5, Ic>18.1 |\n325  | 2022-01-18 18:26:03 | 60   |g'=17.31+/-0.22, Rc=17.31+/-0.13,\nIc=17.08+/-0.15| g'>17.7, Rc>18.4, Ic>18.1 |\n535  | 2022-01-18 18:29:33 | 300  |g'=18.21+/-0.17, Rc=17.80+/-0.10,\nIc=17.47+/-0.10| g'>18.7, Rc>19.5, Ic>19.1 |\n1471 | 2022-01-18 18:45:09 | 1020 |g'=19.82+/-0.33, Rc=19.26+/-0.14,\nIc=18.81+/-0.15| g'>19.5, Rc>20.3, Ic>19.9 |"

julia>     clcltxt=replace(cltxt, r"\n(Ic)"=>s"\1")
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits\n-----------------------------------------------------------------------------------------------------------------\n56   | 2022-01-18 18:21:34 | 10   |g'>15.9,         Rc>16.8,Ic>16.8        | g'>15.9, Rc>16.8, Ic>16.8 |\n84   | 2022-01-18 18:22:02 | 40   |g'=18.36+/-0.70, Rc=18.57+/-0.78,Ic>17.5        | g'>17.3, Rc>17.9, Ic>17.5 |\n171  | 2022-01-18 18:23:29 | 50   |g'=17.45+/-0" β‹― 68 bytes β‹― "6  | 2022-01-18 18:24:54 | 60   |g'=17.31+/-0.18, Rc=16.87+/-0.09,Ic=16.64+/-0.11| g'>17.7, Rc>18.5, Ic>18.1 |\n325  | 2022-01-18 18:26:03 | 60   |g'=17.31+/-0.22, Rc=17.31+/-0.13,Ic=17.08+/-0.15| g'>17.7, Rc>18.4, Ic>18.1 |\n535  | 2022-01-18 18:29:33 | 300  |g'=18.21+/-0.17, Rc=17.80+/-0.10,Ic=17.47+/-0.10| g'>18.7, Rc>19.5, Ic>19.1 |\n1471 | 2022-01-18 18:45:09 | 1020 |g'=19.82+/-0.33, Rc=19.26+/-0.14,Ic=18.81+/-0.15| g'>19.5, Rc>20.3, Ic>19.9 |"

julia>     df=CSV.read(IOBuffer(clcltxt), DataFrame, delim='|',header=false, skipto=3, dateformat="yyyy-mm-dd HH:MM:SS")
7Γ—6 DataFrame
 Row β”‚ Column1  Column2              Column3  Column4                            Column5                      Column6 
     β”‚ Int64    DateTime             Int64    String                             String31                     Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚      56  2022-01-18T18:21:34       10  g'>15.9,         Rc>16.8,Ic>16.8…   g'>15.9, Rc>16.8, Ic>16.8   missing
   2 β”‚      84  2022-01-18T18:22:02       40  g'=18.36+/-0.70, Rc=18.57+/-0.78…   g'>17.3, Rc>17.9, Ic>17.5   missing
   3 β”‚     171  2022-01-18T18:23:29       50  g'=17.45+/-0.20, Rc=16.90+/-0.15…   g'>17.5, Rc>18.1, Ic>17.5   missing
   4 β”‚     256  2022-01-18T18:24:54       60  g'=17.31+/-0.18, Rc=16.87+/-0.09…   g'>17.7, Rc>18.5, Ic>18.1   missing
   5 β”‚     325  2022-01-18T18:26:03       60  g'=17.31+/-0.22, Rc=17.31+/-0.13…   g'>17.7, Rc>18.4, Ic>18.1   missing
   6 β”‚     535  2022-01-18T18:29:33      300  g'=18.21+/-0.17, Rc=17.80+/-0.10…   g'>18.7, Rc>19.5, Ic>19.1   missing
   7 β”‚    1471  2022-01-18T18:45:09     1020  g'=19.82+/-0.33, Rc=19.26+/-0.14…   g'>19.5, Rc>20.3, Ic>19.9   missing

julia>     h=readuntil(IOBuffer(cltxt), '\n')
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits"

julia>     hd=["T0+[sec]", "MID-UT", "T-EXP[sec]", "magnitude(or 5-sigma limits)", "5-sigma limits","m"]
6-element Vector{String}:
 "T0+[sec]"
 "MID-UT"
 "T-EXP[sec]"
 "magnitude(or 5-sigma limits)"
 "5-sigma limits"
 "m"

julia>     rename!(df,hd)
7Γ—6 DataFrame
 Row β”‚ T0+[sec]  MID-UT               T-EXP[sec]  magnitude(or 5-sigma limits)       5-sigma limits               m       
     β”‚ Int64     DateTime             Int64       String                             String31                     Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚       56  2022-01-18T18:21:34          10  g'>15.9,         Rc>16.8,Ic>16.8…   g'>15.9, Rc>16.8, Ic>16.8   missing
   2 β”‚       84  2022-01-18T18:22:02          40  g'=18.36+/-0.70, Rc=18.57+/-0.78…   g'>17.3, Rc>17.9, Ic>17.5   missing
   3 β”‚      171  2022-01-18T18:23:29          50  g'=17.45+/-0.20, Rc=16.90+/-0.15…   g'>17.5, Rc>18.1, Ic>17.5   missing
   4 β”‚      256  2022-01-18T18:24:54          60  g'=17.31+/-0.18, Rc=16.87+/-0.09…   g'>17.7, Rc>18.5, Ic>18.1   missing
   5 β”‚      325  2022-01-18T18:26:03          60  g'=17.31+/-0.22, Rc=17.31+/-0.13…   g'>17.7, Rc>18.4, Ic>18.1   missing
   6 β”‚      535  2022-01-18T18:29:33         300  g'=18.21+/-0.17, Rc=17.80+/-0.10…   g'>18.7, Rc>19.5, Ic>19.1   missing
   7 β”‚     1471  2022-01-18T18:45:09        1020  g'=19.82+/-0.33, Rc=19.26+/-0.14…   g'>19.5, Rc>20.3, Ic>19.9   missing

In regular expression r"^\nT0+"im you are already using ^(which stands for starting of line) so why do you need \n ? For my code it is not working.

using HTTP, CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 31493
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("MITSuME", txt)
                println(" MITSuME report")                
                he=first(findfirst(r"^\nT0+"im,txt))+1
                lr=first(findnext("|\n---",txt,he))
                cltxt=replace(txt[he:lr], ","=>" ",r"(οΏ½+)"=>" " , ">"=>" ", "-"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim=" ")
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)]                  
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error at try")                    
        end # trycatch
    end # for loop
end
doanalysis()

image

@rocco_sprmnt21 I want to combine GCNs of different telescopes in a single file so i don’t want to switch to use your code containing JSON3.

@rocco_sprmnt21
Following code is not able to scrap table from GCN due to occurrence of J. Bolmer on upper line.

See code
using HTTP , CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 26176
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("GROND", txt)
                println(" GROND report")                
                he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K)"m,txt))
                lr=first(findnext(r"^(((?:[\t ]*(?:\r?\n|\r))+)|(The)|(Given))"m,txt,he))-1
                cltxt=replace(txt[he:lr], "mag"=>" ",","=>"|",r" ?(=|>)"=>"|" , "+/-"=>"|","οΏ½"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim="|" ,header=0)
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)] 
				rename!(df,"Column1" => "Filter","Column2" => "Mag","Column3" => "Mag_err")
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error ")                    
        end # trycatch
    end # for loop
end
doanalysis()

Above code is also not working for GCN-20843

using HTTP , CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 26066
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("GROND", txt)
                println(" GROND report")                
                he=first(findfirst(r"^(g'|r'|i'|z'|J[^\.]|H|K)"m,txt))
                lr=first(findnext(r"^(((?:[\t ]*(?:\r?\n|\r))+)|(The)|(Given))"m,txt,he))-1
                cltxt=replace(txt[he:lr], "mag"=>" ",","=>"|",r" ?(=|>)"=>"|" , "+/-"=>"|","οΏ½"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim="|" ,header=0)
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)] 
				rename!(df,"Column1" => "Filter","Column2" => "Mag")# ,"Column3" => "Mag_err")
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error ")                    
        end # trycatch
    end # for loop
end
doanalysis()

about the other page : Page not found

PS
Excuse me, could you explain what is the meaning of this work?

since retrieving this data is so messy, isn’t it better if you do it by hand?

2 Likes

I have to scrap thousands of web pages.
I want to compile data so that it can be used in research.
Your code is not working.
Other page is 20843

se vuoi essere aiutatao non basta dire il codice non funziona, devi almeno mostrare dove.
Ti avevo giΓ  consigliato di eliminare tutti i controlli in fase di test e lasciare solo le espressioni principali che catturano il testo.

E’ scomparsa la pagina!

https://gcn.nasa.gov/circulars/26066/raw

quindi le pagine, oltre ad avere un formato variabile, ogni tanto, si nascondo?

1 Like

Please type in English. GCN web site has removed the option of raw . Now they have placed Text icon.