Get index corresponding to some number in list of outputs

raman_kumar · July 9, 2023, 3:35am

What is this .\raw ?

rocco_sprmnt21 · July 9, 2023, 6:27am

in the same page of the url there are the links for the json view and the raw view

rocco_sprmnt21 · July 9, 2023, 10:25am

this script could (*) handle a generality of cases (at least the tables that in the “raw” view appear well formatted, with columns separated by spaces) where some variables are missing for some observations.
In the case of the header on two lines, you could fix it afterwards by hand.

(*) I have not tested other cases. Other complications could arise from the presence of non-ASCII characters, so textwidth and ncodeunits do not have the same value.

using CSV, DataFrames, HTTP
url="https://gcn.nasa.gov/circulars/34049"
txt=String((HTTP.get(url)))
#m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the whole string.
hb,he=findfirst(r"^Filter"im,txt)
lr,_=findnext("\n\nThe",txt,he).-1

cltxt=replace(txt[hb:lr], "&gt;"=>">")
ls=split(cltxt,'\n')
lls=ncodeunits.(ls)
ml=maximum(lls[2:end])
adjls=rpad.(ls,ml,' ')

spl=[findall(r"\s\s+\S|\s\s+$"m, adjls[i]) for i in eachindex(ls)]

function splitrange(rng,n,m)
    s=[first(rng[n]):m-1,m:last(rng[n])]
    n==1 ? [s;rng[n+1:end]] : [rng[1:n-1];s;rng[n+1:end]]
end

function clrng(spl)
    mr=maximum(length,spl)
    for i in 1:mr-1
        m=minimum([first(e[i+1]) for e in filter(sp->length(sp)>=i+1,spl)])
        id=findall(>=(m),  [last(e[i]) for e in spl])
        [spl[idn]=splitrange(spl[idn],i,m) for idn in id]
    end 
end

clrng(spl)

pts=[intersect(vr...) for vr in zip(spl...)]

using IterTools
rr=Base.splat(:).(partition(sort([1;first.(pts); last.(pts); ml]),2))

res=join([join(strip.(getindex.([adjls[i]],rr)),'\t').*'\n' for i in eachindex(adjls)])
julia> df=CSV.read(IOBuffer(res), DataFrame, delim='\t')
7×5 DataFrame
 Row │ FILTER    EXP(s)   MAG         Significance of  Upper L  
     │ String3?  Int64?   String15?   String15?        String7?
─────┼──────────────────────────────────────────────────────────
   1 │ missing   missing  missing     Detection        missing
   2 │ v             157  missing     missing          >19.6
   3 │ b             157  missing     missing          >20.6
   4 │ u             157  20.6 ± 0.5  2.1 sigma        >20.1
   5 │ w1            315  20.4 ± 0.5  2.1 sigma        >20.0
   6 │ m2           1489  20.8 ± 0.3  3.9 sigma        missing
   7 │ w2            629  21.0 ± 0.4  2.5 sigma        >20.7

but no. it seems that the CSV.jl package is able to handle multiline headers, although something needs to be fixed

julia> df=CSV.read(IOBuffer(res), DataFrame, delim='\t', header=[1,2])
6×5 DataFrame
 Row │ FILTER_Column1  EXP(s)_Column2  MAG_Column3  Significance of_Detection  Upper L_Column5 
     │ String3         Int64           String15?    String15?                  String7?
─────┼─────────────────────────────────────────────────────────────────────────────────────────
   1 │ v                          157  missing      missing                    >19.6
   2 │ b                          157  missing      missing                    >20.6
   3 │ u                          157  20.6 ± 0.5   2.1 sigma                  >20.1
   4 │ w1                         315  20.4 ± 0.5   2.1 sigma                  >20.0
   5 │ m2                        1489  20.8 ± 0.3   3.9 sigma                  missing
   6 │ w2                         629  21.0 ± 0.4   2.5 sigma                  >20.7

rocco_sprmnt21 · July 9, 2023, 10:34am

The basic idea is to homogenize the ranges of spaces in each line.
Then you slice the table vertically staying inside the common spaces for each column.
Finally you delete the trailing and leading spaces and put it all back together with the TAB and CRLF in the right place.


julia> spl=[findall(r"\s\s+\S|\s\s+$"m, adjls[i]) for i in eachindex(ls)]
8-element Vector{Vector{UnitRange{Int64}}}:
 [7:9, 15:17, 20:31, 46:49]
 [1:33, 42:55]
 [2:9, 12:50]
 [2:9, 12:50]
 [2:9, 12:17, 28:33, 42:51]
 [3:9, 12:17, 28:33, 42:51]
 [3:8, 12:17, 28:33, 42:56]
 [3:9, 12:17, 28:33, 42:51]

julia> clrng(spl)

julia> spl
8-element Vector{Vector{UnitRange{Int64}}}:
 [7:9, 15:17, 20:31, 46:49]
 [1:11, 12:19, 20:33, 42:55]
 [2:9, 12:19, 20:41, 42:50]
 [2:9, 12:19, 20:41, 42:50]
 [2:9, 12:17, 28:33, 42:51]
 [3:9, 12:17, 28:33, 42:51]
 [3:8, 12:17, 28:33, 42:56]
 [3:9, 12:17, 28:33, 42:51]

julia> pts=[intersect(vr...) for vr in zip(spl...)]
4-element Vector{UnitRange{Int64}}:
 7:8
 15:17
 28:31
 46:49

julia> rr=Base.splat(:).(partition(sort([1;first.(pts); last.(pts); ml]),2))
5-element Vector{UnitRange{Int64}}:
 1:7
 8:15
 17:28
 31:46
 49:55

raman_kumar · July 10, 2023, 3:43am

Slot of Filter column can’t be empty so we can ignore those rows of whom Filter column slot is empty. See below Swift/UVOT page.

Topic		Replies	Views
Row index in a dataframe General Usage question , dataframes	4	1586	October 23, 2021
Help with CSV and Dataframe Data question , package	2	869	January 26, 2021
DataFrames invert index New to Julia question , dataframes	6	492	July 22, 2022
String Index for DataFrames Data question	1	2306	September 4, 2019
Indexing in dataframes Data data , indexing , dataframes	5	817	December 3, 2020

Get index corresponding to some number in list of outputs

Related topics