Scrap table from NASA GCN circulars website

raman_kumar · July 29, 2023, 4:40pm

Can you scrap for GCN - Circulars - 31493: GRB 220118A: MITSuME Akeno optical observation ? In this case both starting and end position are T0+ .

rocco_sprmnt21 · July 29, 2023, 4:44pm

you should write the resulting table well and post it. You don’t understand anything even with the RAW view

raman_kumar · July 29, 2023, 4:57pm

For GCN 31418 table should look like :

T0+[hours]    MID-UT      T-EXP[sec]5-sigma  FILTER   limits
22.0    2022-01-08 18:58:13   11820            g'       20.7
22.0    2022-01-08 18:58:13   11820            Rc       20.9
22.0    2022-01-08 18:58:13   11820            Ic       20.1

rocco_sprmnt21 · July 29, 2023, 5:09pm

julia> urlmessy="https://gcn.nasa.gov/circulars/31493/raw"
"https://gcn.nasa.gov/circulars/31493/raw"

julia> txt=String((HTTP.get(urlmessy)))
"HTTP/1.1 200 OK\r\nContent-Type: text/plain;charset=UTF-8\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nDate: Sat, 29 Jul 2023 17:08:47 GMT\r\nLink: <https://gcn.nasa.gov/circulars/31493>; rel=\"canonical\"\r\nApigw-Requestid: I1gg8imKoAMEJpg=\r\nContent-Encoding: gzip\r\nVary: Accept-Encoding\r\nX-Cache: Miss from cloudfront\r\nVia: 1.1 c651b6f427de520af17b746abf0c7ee6.cloudfront.net (CloudFront)\r\nX-Amz-Cf-Pop: MXP64-P2\r\nX-Amz-Cf-Id: QOaNnew_Vj-jdlhBHyq" ⋯ 2168 bytes ⋯ ">19.9 |\n-----------------------------------------------------------------------------------------------------------------\nT0+ : Elapsed time after the burst\nT-EXP: Total Exposure time\n\nWe used PS1 catalog for flux calibration.\nThe magnitudes are expressed in the AB system.\nThe images were processed in real-time through the MITSuME GPU\nreduction pipeline (Niwano et al. 2021, PASJ, Vol.73, Issue 1, Pages\n4-24; https://github.com/MNiwano/Eclaire)."

julia>     he=first(findfirst(r"^\nT0+"im,txt))+1
1651

julia>     lr=first(findnext("|\n---",txt,he))
2624

julia>     cltxt=txt[he:lr]
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits\n-----------------------------------------------------------------------------------------------------------------\n56   | 2022-01-18 18:21:34 | 10   |g'>15.9,         Rc>16.8,\nIc>16.8        | g'>15.9, Rc>16.8, Ic>16.8 |\n84   | 2022-01-18 18:22:02 | 40   |g'=18.36+/-0.70, Rc=18.57+/-0.78,\nIc>17.5        | g'>17.3, Rc>17.9, Ic>17.5 |\n171  | 2022-01-18 18:23:29 | 50   |g'=17.45+/" ⋯ 75 bytes ⋯ " 2022-01-18 18:24:54 | 60   |g'=17.31+/-0.18, Rc=16.87+/-0.09,\nIc=16.64+/-0.11| g'>17.7, Rc>18.5, Ic>18.1 |\n325  | 2022-01-18 18:26:03 | 60   |g'=17.31+/-0.22, Rc=17.31+/-0.13,\nIc=17.08+/-0.15| g'>17.7, Rc>18.4, Ic>18.1 |\n535  | 2022-01-18 18:29:33 | 300  |g'=18.21+/-0.17, Rc=17.80+/-0.10,\nIc=17.47+/-0.10| g'>18.7, Rc>19.5, Ic>19.1 |\n1471 | 2022-01-18 18:45:09 | 1020 |g'=19.82+/-0.33, Rc=19.26+/-0.14,\nIc=18.81+/-0.15| g'>19.5, Rc>20.3, Ic>19.9 |"

julia>     clcltxt=replace(cltxt, r"\n(Ic)"=>s"\1")
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits\n-----------------------------------------------------------------------------------------------------------------\n56   | 2022-01-18 18:21:34 | 10   |g'>15.9,         Rc>16.8,Ic>16.8        | g'>15.9, Rc>16.8, Ic>16.8 |\n84   | 2022-01-18 18:22:02 | 40   |g'=18.36+/-0.70, Rc=18.57+/-0.78,Ic>17.5        | g'>17.3, Rc>17.9, Ic>17.5 |\n171  | 2022-01-18 18:23:29 | 50   |g'=17.45+/-0" ⋯ 68 bytes ⋯ "6  | 2022-01-18 18:24:54 | 60   |g'=17.31+/-0.18, Rc=16.87+/-0.09,Ic=16.64+/-0.11| g'>17.7, Rc>18.5, Ic>18.1 |\n325  | 2022-01-18 18:26:03 | 60   |g'=17.31+/-0.22, Rc=17.31+/-0.13,Ic=17.08+/-0.15| g'>17.7, Rc>18.4, Ic>18.1 |\n535  | 2022-01-18 18:29:33 | 300  |g'=18.21+/-0.17, Rc=17.80+/-0.10,Ic=17.47+/-0.10| g'>18.7, Rc>19.5, Ic>19.1 |\n1471 | 2022-01-18 18:45:09 | 1020 |g'=19.82+/-0.33, Rc=19.26+/-0.14,Ic=18.81+/-0.15| g'>19.5, Rc>20.3, Ic>19.9 |"

julia>     df=CSV.read(IOBuffer(clcltxt), DataFrame, delim='|',header=false, skipto=3, dateformat="yyyy-mm-dd HH:MM:SS")
7×6 DataFrame
 Row │ Column1  Column2              Column3  Column4                            Column5                      Column6 
     │ Int64    DateTime             Int64    String                             String31                     Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │      56  2022-01-18T18:21:34       10  g'>15.9,         Rc>16.8,Ic>16.8…   g'>15.9, Rc>16.8, Ic>16.8   missing
   2 │      84  2022-01-18T18:22:02       40  g'=18.36+/-0.70, Rc=18.57+/-0.78…   g'>17.3, Rc>17.9, Ic>17.5   missing
   3 │     171  2022-01-18T18:23:29       50  g'=17.45+/-0.20, Rc=16.90+/-0.15…   g'>17.5, Rc>18.1, Ic>17.5   missing
   4 │     256  2022-01-18T18:24:54       60  g'=17.31+/-0.18, Rc=16.87+/-0.09…   g'>17.7, Rc>18.5, Ic>18.1   missing
   5 │     325  2022-01-18T18:26:03       60  g'=17.31+/-0.22, Rc=17.31+/-0.13…   g'>17.7, Rc>18.4, Ic>18.1   missing
   6 │     535  2022-01-18T18:29:33      300  g'=18.21+/-0.17, Rc=17.80+/-0.10…   g'>18.7, Rc>19.5, Ic>19.1   missing
   7 │    1471  2022-01-18T18:45:09     1020  g'=19.82+/-0.33, Rc=19.26+/-0.14…   g'>19.5, Rc>20.3, Ic>19.9   missing

julia>     h=readuntil(IOBuffer(cltxt), '\n')
"T0+[sec] MID-UT T-EXP[sec] magnitude(or 5-sigma limits) 5-sigma limits"

julia>     hd=["T0+[sec]", "MID-UT", "T-EXP[sec]", "magnitude(or 5-sigma limits)", "5-sigma limits","m"]
6-element Vector{String}:
 "T0+[sec]"
 "MID-UT"
 "T-EXP[sec]"
 "magnitude(or 5-sigma limits)"
 "5-sigma limits"
 "m"

julia>     rename!(df,hd)
7×6 DataFrame
 Row │ T0+[sec]  MID-UT               T-EXP[sec]  magnitude(or 5-sigma limits)       5-sigma limits               m       
     │ Int64     DateTime             Int64       String                             String31                     Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       56  2022-01-18T18:21:34          10  g'>15.9,         Rc>16.8,Ic>16.8…   g'>15.9, Rc>16.8, Ic>16.8   missing
   2 │       84  2022-01-18T18:22:02          40  g'=18.36+/-0.70, Rc=18.57+/-0.78…   g'>17.3, Rc>17.9, Ic>17.5   missing
   3 │      171  2022-01-18T18:23:29          50  g'=17.45+/-0.20, Rc=16.90+/-0.15…   g'>17.5, Rc>18.1, Ic>17.5   missing
   4 │      256  2022-01-18T18:24:54          60  g'=17.31+/-0.18, Rc=16.87+/-0.09…   g'>17.7, Rc>18.5, Ic>18.1   missing
   5 │      325  2022-01-18T18:26:03          60  g'=17.31+/-0.22, Rc=17.31+/-0.13…   g'>17.7, Rc>18.4, Ic>18.1   missing
   6 │      535  2022-01-18T18:29:33         300  g'=18.21+/-0.17, Rc=17.80+/-0.10…   g'>18.7, Rc>19.5, Ic>19.1   missing
   7 │     1471  2022-01-18T18:45:09        1020  g'=19.82+/-0.33, Rc=19.26+/-0.14…   g'>19.5, Rc>20.3, Ic>19.9   missing

raman_kumar · July 29, 2023, 5:13pm

In regular expression r"^\nT0+"im you are already using ^(which stands for starting of line) so why do you need \n ? For my code it is not working.

using HTTP, CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 31493
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("MITSuME", txt)
                println(" MITSuME report")                
                he=first(findfirst(r"^\nT0+"im,txt))+1
                lr=first(findnext("|\n---",txt,he))
                cltxt=replace(txt[he:lr], ","=>" ",r"(�+)"=>" " , ">"=>" ", "-"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim=" ")
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)]                  
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error at try")                    
        end # trycatch
    end # for loop
end
doanalysis()

@rocco_sprmnt21 I want to combine GCNs of different telescopes in a single file so i don’t want to switch to use your code containing JSON3.

raman_kumar · August 3, 2023, 2:17pm

@rocco_sprmnt21
Following code is not able to scrap table from GCN due to occurrence of J. Bolmer on upper line.

See code

using HTTP , CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 26176
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("GROND", txt)
                println(" GROND report")                
                he=first(findfirst(r"^(g'|r'|i'|z'|J|H|K)"m,txt))
                lr=first(findnext(r"^(((?:[\t ]*(?:\r?\n|\r))+)|(The)|(Given))"m,txt,he))-1
                cltxt=replace(txt[he:lr], "mag"=>" ",","=>"|",r" ?(=|>)"=>"|" , "+/-"=>"|","�"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim="|" ,header=0)
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)] 
				rename!(df,"Column1" => "Filter","Column2" => "Mag","Column3" => "Mag_err")
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error ")                    
        end # trycatch
    end # for loop
end
doanalysis()

Above code is also not working for GCN-20843

rocco_sprmnt21 · August 3, 2023, 3:27pm

using HTTP , CSV, DataFrames
function doanalysis()
    dfg=nothing
    for x in 26066
    print("\r peeking at GCN $x ")
        try
            url = "https://gcn.nasa.gov/circulars/$x/raw"
            resp = HTTP.get(url) 
            status=resp.status
            print(" ",status," "); 
            if status == 404 ; println("status=",status); continue; end          
            txt = String(resp.body)
            if occursin(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				m=match(r"GRB ?\d{6}([A-G]|(\.\d{2}))?",txt)
				print(m.match)
			end

            if occursin("GROND", txt)
                println(" GROND report")                
                he=first(findfirst(r"^(g'|r'|i'|z'|J[^\.]|H|K)"m,txt))
                lr=first(findnext(r"^(((?:[\t ]*(?:\r?\n|\r))+)|(The)|(Given))"m,txt,he))-1
                cltxt=replace(txt[he:lr], "mag"=>" ",","=>"|",r" ?(=|>)"=>"|" , "+/-"=>"|","�"=>" ")
                df=CSV.read(IOBuffer(cltxt), DataFrame, delim="|" ,header=0)
                df.GCN=[x for i in 1:nrow(df)]
                df.GRB=[m.match for i in 1:nrow(df)] 
				rename!(df,"Column1" => "Filter","Column2" => "Mag")# ,"Column3" => "Mag_err")
				if isnothing(dfg) 
                    @show dfg=df
                else
                    @show dfg=vcat(dfg,df)
                end # if x is first
            end # if occursin
        catch e
            println("error ")                    
        end # trycatch
    end # for loop
end
doanalysis()

about the other page : Page not found

PS
Excuse me, could you explain what is the meaning of this work?

since retrieving this data is so messy, isn’t it better if you do it by hand?

raman_kumar · August 3, 2023, 3:33pm

I have to scrap thousands of web pages.
I want to compile data so that it can be used in research.
Your code is not working.
Other page is 20843

rocco_sprmnt21 · August 3, 2023, 4:54pm

se vuoi essere aiutatao non basta dire il codice non funziona, devi almeno mostrare dove.
Ti avevo già consigliato di eliminare tutti i controlli in fase di test e lasciare solo le espressioni principali che catturano il testo.

E’ scomparsa la pagina!

https://gcn.nasa.gov/circulars/26066/raw

quindi le pagine, oltre ad avere un formato variabile, ogni tanto, si nascondo?

raman_kumar · August 3, 2023, 4:57pm

Please type in English. GCN web site has removed the option of raw . Now they have placed Text icon.

rocco_sprmnt21 · August 3, 2023, 5:15pm

Topic		Replies	Views
Web scraping of GCN NASA circulars TEXT General Usage http	16	542	June 30, 2023
Combining data of different GCNs in a single file Data strings , data , loops , dataframes	21	639	August 1, 2023
How to handle HTTP.Exceptions.StatusError Web Stack http	16	443	July 16, 2023
How to repleace some chars in a HTTP --> CSV --> DataFrame workflow? General Usage dataframes , csv , http	4	433	March 30, 2021
Fatal error while reading in messy data using DataFrames, CSV Data dataframes , csv	6	620	May 25, 2021

Scrap table from NASA GCN circulars website

Related topics