How to read a table from url

empet · November 29, 2022, 10:47am

With pandas I can read a table into a dataframe, as follows:

import pandas as pd
dfs = pd.read_html('http://stats.ioinformatics.org/countries/')#returns a list of tables
df=dfs[0]
df.head()#inspect header

I searched for a similar method in DataFrames.jl (more precisely in the forthcomming book, Julia for Data Analysis, v_10, which I have pre-ordered), as well in the Julia for Data Science, but there is no approach or example of such a task. Following a suggestion given as answer to the same question, posted here, two years ago: [Any equivalent to Pandas read_html() in DataFrames.jl?)(Any equivalent to Pandas read_html() in DataFrames.jl?), I tried something like this:

using DataFrames, CSV, HTTP
read_remote_csv(url) = DataFrame(CSV.File(HTTP.get(url).body))
df = read_remote_csv("http://stats.ioinformatics.org/countries/")

but it displays the html contents, not a dataframe having as columns the table columns.

nilshg · November 29, 2022, 12:24pm

My answer was for situations where the remote url returns a delimited file, which is what CSV.File parses. You are looking to extract a table from html, so need a library that parses html rather than delimited files, see:

empet · November 29, 2022, 4:02pm

Thank you for the link to Scraping a html table from a url. Unfortunately I have a low level of knowledge and skills in html/CSS. For the moment I will read the tables with pd.read_html(url), save the corresponding dataframe as csv and re-read it in Julia.

rafael.guerra · November 29, 2022, 6:57pm

Check also this post on TableScraper.jl

cormullion · November 29, 2022, 7:11pm

Beat me to it…

using TableScraper
using DataFrames

url = "https://stats.ioinformatics.org/countries/"

st = scrape_tables(url)

df = DataFrame(q=[], 
    Country=String[], 
    Host=String[],
    G=[],
    S=[],
    B=[],
    Total=[])

for row in first(st).rows
    push!(df, row)
end

df.G = parse.(Int, df.G)
df.S = parse.(Int, df.S)
df.B = parse.(Int, df.B)
df.Total = parse.(Int, df.Total)

109×7 DataFrame
 Row │ q    Country     Host    G      S      B      Total 
     │ Any  String      String  Int64  Int64  Int64  Int64 
─────┼─────────────────────────────────────────────────────
   1 │      Albania                 0      0      0      0
   2 │ ?    Argentina   1993        3      9     23     35
  ⋮  │  ⋮       ⋮         ⋮       ⋮      ⋮      ⋮      ⋮
 108 │ ?    Yugoslavia              1      3      1      5
 109 │      Zimbabwe                0      0      0      0
                                           105 rows omitted

empet · November 30, 2022, 11:03am

@cormullion
Thanks for your nice solution. I adopted it but replaced the last four lines of code by:

for name in  names(df)[end-3:end]
    df[!, name]= parse.(Int, df[!, name])
end

rafael.guerra · November 30, 2022, 4:41pm

Or without for loops:

using TableScraper, DataFrames
url = "https://stats.ioinformatics.org/countries/"
st = scrape_tables(url)
df = DataFrame(permutedims(reduce(hcat, first(st).rows)), [:q,:Country,:Host,:G,:S,:B,:Total])
df[!,[:G,:S,:B,:Total]] .= parse.(Int, df[!,[:G,:S,:B,:Total]])

empet · November 30, 2022, 5:31pm

Yes, but the symbols :G, :S, :B, :Total, are repeated three times.

rafael.guerra · November 30, 2022, 5:44pm

Could be replaced by 4:7, or assigned to a single variable, for instance.

Topic		Replies	Views
Any equivalent to Pandas read_html() in DataFrames.jl? General Usage dataframes	1	766	February 20, 2021
Scraping a html table from a website New to Julia	2	2435	September 29, 2020
Getting data directly from a website Performance	9	6387	July 9, 2018
Scrap table from NASA GCN circulars website Web Stack question , strings , csv , http	30	1124	August 3, 2023
[ANN] Harbest.jl v0.4 - Simple Web Scraping Package Announcements package , dataframes	0	315	June 27, 2023

How to read a table from url

Related topics