Get index corresponding to some number in list of outputs

raman_kumar · July 2, 2023, 4:23am

I want to get indices in lines corresponding to A=0 in output.
My goal is to get out the data between row index 14 and 23 of the lines variable output in terms of variable i.

using HTTP , Gumbo , Cascadia, DataFrames,ParserCombinator, Profile
url="https://gcn.nasa.gov/circulars/33733"
r=HTTP.get(url)
h=parsehtml(String(r.body))
body=h.root[2]
Div=eachmatch(Selector(".text-pre-wrap.margin-top-2"), body)
Div[1]
txt=text(Div[1])

lines = split(txt, "\n")

28

length(lines)

13

for i in 1:length(lines)
	try
	    parse_one(lines[i], Equal("Filter")) == ["Filter"]  || continue
	catch e
		if isa(e,ParserException)
		end
		else
		print(i)
	end
end

13

i=13

for x in i:length(lines)
	@show A=ncodeunits(lines[x])
end

rocco_sprmnt21 · July 2, 2023, 7:05am

I don’t know if all the packages you invoked are needed. I’d do this (and I’m sure there are even more direct ways to get to the table you’re looking for):

using CSV, DataFrames
lines=split(String((HTTP.get(url))),'\n')

fl=findfirst(l->startswith(l,"Filter"),lines)

ll=findnext(==(""),lines,fl+2)-1

tbl=map(l->replace(l, " +/- "=>"+/-", "&gt;"=>'>'), lines[fl:ll])

data=join(tbl.*'\n')

df=CSV.read(IOBuffer(data), DataFrame, delim=' ', ignorerepeated=true)

julia> df=CSV.read(IOBuffer(data), DataFrame, delim=' ', ignorerepeated=true)  
8×5 DataFrame
 Row │ Filter   T_start(s)  T_stop(s)  Exp(s)  Mag
     │ String7  Int64       Int64      Int64   String15
─────┼──────────────────────────────────────────────────────
   1 │ white           113        271     147  20.19+/-0.24
   2 │ white           614        807      38  >19.8
   3 │ u               334        583     247  >19.9
   4 │ u             10511      11899     738  20.66+/-0.30
   5 │ b               589        783      39  >19.1
   6 │ uvw1            714        734      20  >17.6
   7 │ uvm2            688        708      20  >17.2
   8 │ uvw2            639        658      20  >19.6

raman_kumar · July 2, 2023, 8:08am

You are wizard. How you got that level of understanding ? Please suggest some resources for me to study. Please also add HTTP package in your answer. Thank You.

rocco_sprmnt21 · July 2, 2023, 9:52am

Exaggerated: the real “wizards” in this forum are others.
My sources are essentially this forum where I randomly follow some topics without any specific interest: just curiosity.
I also bought some books but I’ve never read them (too lazy and unmotivated if I don’t have a specific problem to apply myself to I don’t study the theory).
In this case I made several attempts before arriving at the script I posted. Since I’ve used CSV before, I tried to see if it was applicable in this case too.
In fact, the published idea didn’t satisfy me because first it split the text into various lines and then it had to be rejoined for the part of the table.
I now propose (I also add using HTTP) a different idea that avoids split() and join(). An even more elegant solution could make use of regular expressions to select the part of the text of interest.

using CSV, DataFrames, HTTP
url="https://gcn.nasa.gov/circulars/33733"
txt=String((HTTP.get(url)))

hb,he=findfirst("Filter",txt)
# _,fr=findnext("\n\n",txt,fe)
# lr,_=findnext("\n\n",txt,fr)
lr,_=findnext("\n\nThe",txt,he)

cltxt=replace(txt[hb:lr], " +/- "=>"+/-", "&gt;"=>'>')
julia> df=CSV.read(IOBuffer(cltxt), DataFrame, delim=' ', ignorerepeated=true)
8×5 DataFrame
 Row │ Filter   T_start(s)  T_stop(s)  Exp(s)  Mag
     │ String7  Int64       Int64      Int64   String15     
─────┼──────────────────────────────────────────────────────
   1 │ white           113        271     147  20.19+/-0.24
   2 │ white           614        807      38  >19.8
   3 │ u               334        583     247  >19.9
   4 │ u             10511      11899     738  20.66+/-0.30
   5 │ b               589        783      39  >19.1
   6 │ uvw1            714        734      20  >17.6
   7 │ uvm2            688        708      20  >17.2
   8 │ uvw2            639        658      20  >19.6

PS
I take the opportunity to ask if in doing the split it was possible to keep the delimiter used to divide the various pieces.

raman_kumar · July 6, 2023, 10:36am

For url= GCN - Circulars - 34135: GRB 230628E: Swift/UVOT Detection there is issue of space before (FC).

How to overcome these issues ?

kellertuer · July 6, 2023, 10:41am

Without having Julia at hand at the moment: Try setting the delimiter to delim=' ' or replace all 4-spaces with a , and use the comma as delimiter.

raman_kumar · July 6, 2023, 11:07am

I am already using delim=’ ’

df=CSV.read(IOBuffer(cltxt), DataFrame, delim=' ', ignorerepeated=true)

Is there any way to ignore things with ( ) so that (FC) can be removed ?

Dan · July 6, 2023, 11:17am

The table in the URL in question has fixed column widths. Parsing using separators is less ideal than simply splitting each line at the correct positions.

Finding the correct column ranges for each field automatically is a little trickier. Perhaps the header line, sensing numeric columns could allow it to be done. Maybe regular expressions are the way to go.

If it is a single type of table to be parsed many times, just manually getting the column ranges once and hard-coding them might be a solution.

kellertuer · July 6, 2023, 11:22am

Ah that was a typo I meant ' ' (4 spaces) which might not work, since it seems to just be one character.

raman_kumar · July 6, 2023, 11:35am

Not working anyway , you can try code given below

using CSV, DataFrames, HTTP
url="https://gcn.nasa.gov/circulars/34135"
txt=String((HTTP.get(url)))

hb,he=findfirst("Filter",txt)
lr,_=findnext("\n\nThe",txt,he)

cltxt=replace(txt[hb:lr], " +/- "=>"+/-", "&gt;"=>'>')
df=CSV.read(IOBuffer(cltxt), DataFrame, delim=' ', ignorerepeated=true)

Is there any way to ignore things with ( ) bracket so that (FC) can be removed ?

kellertuer · July 6, 2023, 11:54am

As I already said: the delimiter is a character so it will most probably not work and proposed to replace four spaces with a comma instead.
Another approach is of course to use a regular expression to remove the brackets and their content, sure.

Otherwise try the approaches by Dan.

But – similar to previous questions you posted – repeating your question does not help.

rocco_sprmnt21 · July 6, 2023, 4:13pm

cltxt=replace(txt[hb:lr], " +/- "=>"+/-", "&gt;"=>'>',r"(\.\d+) +"=>s"\1") # also suppress blanks after '.abc'

df=CSV.read(IOBuffer(cltxt), DataFrame, delim="  ", ignorerepeated=true)  #delim has 2 blanks


julia> cltxt=replace(txt[hb:lr], " +/- "=>"+/-", r"  +(\w)"=>s"\t\1",r"  +(&gt;)"=>s"\t>");

julia> df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t')
8×5 DataFrame
 Row │ Filter      T_start(s)  T_stop(s)  Exp(s)  Mag
     │ String15    Int64       Int64      Int64   String31
─────┼────────────────────────────────────────────────────────────────
   1 │ white (FC)         221        371     167  20.10+/-0.20
   2 │ wh                 501       1194     223  20.69+/-0.34
   3 │ v                  379       1243      68  >19.1
   4 │ b                  477       1170      78  >19.9
   5 │ u                  452       1145      78  >19.6
   6 │ w1                 428       1120      78  >19.4
   7 │ m2                1249       1269      19  >18.2
   8 │ w2                 700        720      19  >18.5

raman_kumar · July 6, 2023, 4:34pm

This is also working

cltxt=replace(txt[hb:lr], " +/- "=>"+/-", "&gt;"=>'>', "(FC)"=>"")

df=CSV.read(IOBuffer(cltxt), DataFrame, delim=" ", ignorerepeated=true)

raman_kumar · July 6, 2023, 4:57pm

Header are not shown properly although (FC) is printed from your code.

rocco_sprmnt21 · July 6, 2023, 8:25pm

here the output of the two scripts. The second is able to correctly parse the number and string types

neither format the header wrong.

julia> cltxt=replace(txt[hb:lr], " +/- "=>"+/-", "&gt;"=>'>',r"(\.\d+) +"=>s"\1")
"Filter         T_start(s)   T_stop(s)      Exp(s)           Mag\n\nwhite (FC)         221          371          167         20.10+/- 0.20\nwh                 501         1194          223         20.69+/- 0.34\nv                  379         1243           68        >19.1\nb                  477         1170           78        >19.9\nu                  452         1145           78        >19.6\nw1              
   428         1120           78        >19.4\nm2                1249         1269           19        >18.2\nw2                 700          720           19        >18.5\n"

julia> df=CSV.read(IOBuffer(cltxt), DataFrame, delim="  ", ignorerepeated=true)
8×5 DataFrame
 Row │ Filter       T_start(s)   T_stop(s)  Exp(s)    Mag
     │ String15    String7      String7     String3  String15
─────┼──────────────────────────────────────────────────────────────
   1 │ white (FC)   221         371         167       20.10+/- 0.20
   2 │ wh           501          1194       223       20.69+/- 0.34
   3 │ v           379           1243        68      >19.1
   4 │ b           477           1170        78      >19.9
   5 │ u           452           1145        78      >19.6
   6 │ w1           428          1120        78      >19.4
   7 │ m2          1249          1269        19      >18.2
   8 │ w2           700         720          19      >18.5

julia> cltxt=replace(txt[hb:lr], " +/- "=>"+/-", r"  +(\w)"=>s"\t\1",r"  +(&gt;)"=>s"\t>")
"Filter\tT_start(s)\tT_stop(s)\tExp(s)\tMag\n\nwhite (FC)\t221\t371\t167\t20.10+/-0.20\nwh\t501\t1194\t223\t20.69+/-0.34       \nv\t379\t1243\t68\t>19.1\nb\t477\t1170\t78\t>19.9\nu\t452\t1145\t78\t>19.6\nw1\t428\t1120\t78\t>19.4\nm2\t1249\t1269\t19\t>18.2\nw2\t700\t720\t19\t>18.5\n"

julia> df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t')
8×5 DataFrame
 Row │ Filter      T_start(s)  T_stop(s)  Exp(s)  Mag
     │ String15    Int64       Int64      Int64   String31
─────┼────────────────────────────────────────────────────────────────
   1 │ white (FC)         221        371     167  20.10+/-0.20
   2 │ wh                 501       1194     223  20.69+/-0.34
   3 │ v                  379       1243      68  >19.1
   4 │ b                  477       1170      78  >19.9
   5 │ u                  452       1145      78  >19.6
   6 │ w1                 428       1120      78  >19.4
   7 │ m2                1249       1269      19  >18.2
   8 │ w2                 700        720      19  >18.5

raman_kumar · July 7, 2023, 9:45am

Please see my new post Make function ignore Case-sensitivity of its input which is problem due to url=GCN - Circulars - 34049: ZTF23aaoohpy (AT2023lcr/ATLAS23msn): Swift/UVOT detection

raman_kumar · July 8, 2023, 3:51am

This code is not working for url GCN - Circulars - 34049: ZTF23aaoohpy (AT2023lcr/ATLAS23msn): Swift/UVOT detection as it contains filter word twice. How to overcome this issue also ?

using CSV, DataFrames, HTTP
url="https://gcn.nasa.gov/circulars/34049"
txt=String((HTTP.get(url)))
hb,he= findfirst(r"filter"i, txt)
lr,_=findnext("\n\nThe",txt,he)
cltxt=replace(txt[hb:lr], " +/- "=>"+/-", r"  +(\w)"=>s"\t\1",r"  +(&gt;)"=>s"\t>")
df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t')

rocco_sprmnt21 · July 8, 2023, 6:32am

I read, following the link suggested by @stevengj, that in addition to the ‘i’ flag there is the ‘m’ flag which enables the use of ‘^’ as a marker for the beginning of a line.
In the case that you propose a difference between the two occurrences of ‘filter’ word it is precisely the fact that the ‘filter’ you are looking for is placed at the beginning of a line. There would also be the fact that after the ‘filter’ there is a space, but this difference is less “robust”, in my opinion.

hb,he=findfirst(r"^Filter"im,txt)

PS
But seeing how the table is formatted, I believe you have many other problems to sort out.

raman_kumar · July 8, 2023, 9:49am

Yes , There is issue due to "Detection " word in GCN - Circulars - 34049: ZTF23aaoohpy (AT2023lcr/ATLAS23msn): Swift/UVOT detection .

Try this code given below:

using CSV, DataFrames, HTTP
url="https://gcn.nasa.gov/circulars/34049"
txt=String((HTTP.get(url)))
hb,he= findfirst(r"^filter"im, txt)
lr,_=findnext("\n\nThe",txt,he)
cltxt=replace(txt[hb:lr], " +/- "=>"+/-", r"  +(\w)"=>s"\t\1",r"  +(&gt;)"=>s"\t>")
df=CSV.read(IOBuffer(cltxt), DataFrame, delim='\t')

rocco_sprmnt21 · July 8, 2023, 6:10pm

Unless I misinterpret the table (view “.\raw”), there are several problems to deal with.
One is the header of the fourth field which is on two lines; then there are the missing on lines 1,2 and 5…
I don’t know if there is any package that has convenient methods for such cases

PS
Notice that the dataframe has only one column. There is work to do to divide the various pieces.

julia> hb,he=findfirst(r"^Filter"im,txt)
10949:10954

julia> lr,_=findnext("\n\nThe",txt,he)
11391:11395

julia> cltxt=replace(txt[hb:lr], "&gt;"=>">")
"FILTER  EXP(s)  MAG           Significance of   Upper Limits (3o)\n                                Detection\nv       157                 
                     >19.6\nb       157                                      >20.6\nu       157     20.6 ± 0.5     2.1 sigma         >20.1\nw1      315     20.4 ± 0.5     2.1 sigma         >20.0\nm2     1489     20.8 ± 0.3     3.9 sigma\nw2      629     21.0 ± 0.4     2.5 sigma         >20.7\n"

julia> df=CSV.read(IOBuffer(cltxt), DataFrame)
7×1 DataFrame
 Row │ FILTER  EXP(s)  MAG           Significance of   Upper Limits (3o) 
     │ String
─────┼───────────────────────────────────────────────────────────────────
   1 │                                 Detection
   2 │ v       157                                      >19.6
   3 │ b       157                                      >20.6
   4 │ u       157     20.6 ± 0.5     2.1 sigma         >20.1
   5 │ w1      315     20.4 ± 0.5     2.1 sigma         >20.0
   6 │ m2     1489     20.8 ± 0.3     3.9 sigma
   7 │ w2      629     21.0 ± 0.4     2.5 sigma         >20.7

Topic		Replies	Views
Scrap table from NASA GCN circulars website Web Stack question , strings , csv , http	30	1126	August 3, 2023
Combining data of different GCNs in a single file Data strings , data , loops , dataframes	21	639	August 1, 2023
Web scraping of GCN NASA circulars TEXT General Usage http	16	544	June 30, 2023
Problem of unnecessarily iterating many times in for loop Performance question , strings , csv , http	18	470	August 10, 2023
Reading a file from line x to line y General Usage csv	28	744	May 22, 2024

Get index corresponding to some number in list of outputs

Related topics