What is this .\raw
?
this script could (*) handle a generality of cases (at least the tables that in the βrawβ view appear well formatted, with columns separated by spaces) where some variables are missing for some observations.
In the case of the header on two lines, you could fix it afterwards by hand.
(*) I have not tested other cases. Other complications could arise from the presence of non-ASCII characters, so textwidth and ncodeunits do not have the same value.
using CSV, DataFrames, HTTP
url="https://gcn.nasa.gov/circulars/34049"
txt=String((HTTP.get(url)))
#m treats the ^ and $ tokens as matching the start and end of individual lines, as opposed to the whole string.
hb,he=findfirst(r"^Filter"im,txt)
lr,_=findnext("\n\nThe",txt,he).-1
cltxt=replace(txt[hb:lr], ">"=>">")
ls=split(cltxt,'\n')
lls=ncodeunits.(ls)
ml=maximum(lls[2:end])
adjls=rpad.(ls,ml,' ')
spl=[findall(r"\s\s+\S|\s\s+$"m, adjls[i]) for i in eachindex(ls)]
function splitrange(rng,n,m)
s=[first(rng[n]):m-1,m:last(rng[n])]
n==1 ? [s;rng[n+1:end]] : [rng[1:n-1];s;rng[n+1:end]]
end
function clrng(spl)
mr=maximum(length,spl)
for i in 1:mr-1
m=minimum([first(e[i+1]) for e in filter(sp->length(sp)>=i+1,spl)])
id=findall(>=(m), [last(e[i]) for e in spl])
[spl[idn]=splitrange(spl[idn],i,m) for idn in id]
end
end
clrng(spl)
pts=[intersect(vr...) for vr in zip(spl...)]
using IterTools
rr=Base.splat(:).(partition(sort([1;first.(pts); last.(pts); ml]),2))
res=join([join(strip.(getindex.([adjls[i]],rr)),'\t').*'\n' for i in eachindex(adjls)])
julia> df=CSV.read(IOBuffer(res), DataFrame, delim='\t')
7Γ5 DataFrame
Row β FILTER EXP(s) MAG Significance of Upper L
β String3? Int64? String15? String15? String7?
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β missing missing missing Detection missing
2 β v 157 missing missing >19.6
3 β b 157 missing missing >20.6
4 β u 157 20.6 Β± 0.5 2.1 sigma >20.1
5 β w1 315 20.4 Β± 0.5 2.1 sigma >20.0
6 β m2 1489 20.8 Β± 0.3 3.9 sigma missing
7 β w2 629 21.0 Β± 0.4 2.5 sigma >20.7
but no. it seems that the CSV.jl package is able to handle multiline headers, although something needs to be fixed
julia> df=CSV.read(IOBuffer(res), DataFrame, delim='\t', header=[1,2])
6Γ5 DataFrame
Row β FILTER_Column1 EXP(s)_Column2 MAG_Column3 Significance of_Detection Upper L_Column5
β String3 Int64 String15? String15? String7?
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β v 157 missing missing >19.6
2 β b 157 missing missing >20.6
3 β u 157 20.6 Β± 0.5 2.1 sigma >20.1
4 β w1 315 20.4 Β± 0.5 2.1 sigma >20.0
5 β m2 1489 20.8 Β± 0.3 3.9 sigma missing
6 β w2 629 21.0 Β± 0.4 2.5 sigma >20.7
The basic idea is to homogenize the ranges of spaces in each line.
Then you slice the table vertically staying inside the common spaces for each column.
Finally you delete the trailing and leading spaces and put it all back together with the TAB and CRLF in the right place.
julia> spl=[findall(r"\s\s+\S|\s\s+$"m, adjls[i]) for i in eachindex(ls)]
8-element Vector{Vector{UnitRange{Int64}}}:
[7:9, 15:17, 20:31, 46:49]
[1:33, 42:55]
[2:9, 12:50]
[2:9, 12:50]
[2:9, 12:17, 28:33, 42:51]
[3:9, 12:17, 28:33, 42:51]
[3:8, 12:17, 28:33, 42:56]
[3:9, 12:17, 28:33, 42:51]
julia> clrng(spl)
julia> spl
8-element Vector{Vector{UnitRange{Int64}}}:
[7:9, 15:17, 20:31, 46:49]
[1:11, 12:19, 20:33, 42:55]
[2:9, 12:19, 20:41, 42:50]
[2:9, 12:19, 20:41, 42:50]
[2:9, 12:17, 28:33, 42:51]
[3:9, 12:17, 28:33, 42:51]
[3:8, 12:17, 28:33, 42:56]
[3:9, 12:17, 28:33, 42:51]
julia> pts=[intersect(vr...) for vr in zip(spl...)]
4-element Vector{UnitRange{Int64}}:
7:8
15:17
28:31
46:49
julia> rr=Base.splat(:).(partition(sort([1;first.(pts); last.(pts); ml]),2))
5-element Vector{UnitRange{Int64}}:
1:7
8:15
17:28
31:46
49:55