How to read a table from url

With pandas I can read a table into a dataframe, as follows:

import pandas as pd
dfs = pd.read_html('http://stats.ioinformatics.org/countries/')#returns a list of tables
df=dfs[0]
df.head()#inspect header

I searched for a similar method in DataFrames.jl (more precisely in the forthcomming book, Julia for Data Analysis, v_10, which I have pre-ordered), as well in the Julia for Data Science, but there is no approach or example of such a task. Following a suggestion given as answer to the same question, posted here, two years ago: [Any equivalent to Pandas read_html() in DataFrames.jl?)(Any equivalent to Pandas read_html() in DataFrames.jl?), I tried something like this:

using DataFrames, CSV, HTTP
read_remote_csv(url) = DataFrame(CSV.File(HTTP.get(url).body))
df = read_remote_csv("http://stats.ioinformatics.org/countries/")

but it displays the html contents, not a dataframe having as columns the table columns.

My answer was for situations where the remote url returns a delimited file, which is what CSV.File parses. You are looking to extract a table from html, so need a library that parses html rather than delimited files, see:

Thank you for the link to Scraping a html table from a url. Unfortunately I have a low level of knowledge and skills in html/CSS. For the moment I will read the tables with pd.read_html(url), save the corresponding dataframe as csv and re-read it in Julia.

Check also this post on TableScraper.jl

3 Likes

Beat me to itโ€ฆ

using TableScraper
using DataFrames

url = "https://stats.ioinformatics.org/countries/"

st = scrape_tables(url)

df = DataFrame(q=[], 
    Country=String[], 
    Host=String[],
    G=[],
    S=[],
    B=[],
    Total=[])

for row in first(st).rows
    push!(df, row)
end

df.G = parse.(Int, df.G)
df.S = parse.(Int, df.S)
df.B = parse.(Int, df.B)
df.Total = parse.(Int, df.Total)
109ร—7 DataFrame
 Row โ”‚ q    Country     Host    G      S      B      Total 
     โ”‚ Any  String      String  Int64  Int64  Int64  Int64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚      Albania                 0      0      0      0
   2 โ”‚ ?    Argentina   1993        3      9     23     35
  โ‹ฎ  โ”‚  โ‹ฎ       โ‹ฎ         โ‹ฎ       โ‹ฎ      โ‹ฎ      โ‹ฎ      โ‹ฎ
 108 โ”‚ ?    Yugoslavia              1      3      1      5
 109 โ”‚      Zimbabwe                0      0      0      0
                                           105 rows omitted
4 Likes

@cormullion
Thanks for your nice solution. I adopted it but replaced the last four lines of code by:

for name in  names(df)[end-3:end]
    df[!, name]= parse.(Int, df[!, name])
end

Or without for loops:

using TableScraper, DataFrames
url = "https://stats.ioinformatics.org/countries/"
st = scrape_tables(url)
df = DataFrame(permutedims(reduce(hcat, first(st).rows)), [:q,:Country,:Host,:G,:S,:B,:Total])
df[!,[:G,:S,:B,:Total]] .= parse.(Int, df[!,[:G,:S,:B,:Total]])
1 Like

Yes, but the symbols :G, :S, :B, :Total, are repeated three times.

Could be replaced by 4:7, or assigned to a single variable, for instance.