Reading in Excel Data from GitHub

I am preparing a Minimal Reproducible Example which involves reading in some data contained in an Excel file. I’m trying to figure out the best way to make the data easily accessible for anyone who wants to run my code. I’ve posted the data here: sample_data

In Python for example, I can simply do

import pandas as pd
df = pd.read_excel("https://github.com/MichaelBarmann/ParamEst_MRE/blob/main/sample_data.xlsx")

and the data is placed into the DataFrame df.

In Julia I’ve tried xl_file = XLSX.readxlsx("https://github.com/MichaelBarmann/ParamEst_MRE/blob/main/sample_data.xlsx"), but unfortunately this doesn’t work.

Is there an analogous way of doing this in Julia? (using XLSX, or something similar?)

If you are willing to abandon excel in favor of CSV, CSV.jl can do that

See here.

To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:

f = CSV.File(HTTP.get(url).body)

Maybe the same strategy works for excel actually?

1 Like

I don’t think this works with XSLX out of the box, as the read methods expect a filepath (string), and HTTP.get returns a vector of UInts. The easiest thing would probably to just download the file to a temp dir and then read it in from there, but if it’s just some standard tabular data I agree with Peter that it would be better to move to a saner format than xlsx like csv, or maybe arrow.

2 Likes

Thanks, this is helpful! I had 3 spreadsheets in my Excel file, so I had to put each into an individual CSV file before attempting to implement your suggestion. Unfortunately, I get a weird error:

using HTTP, CSV
data_url = "https://github.com/MichaelBarmann/ParamEst_MRE/blob/main/sample_data.csv"
my_file = CSV.File(HTTP.get(data_url).body; header = 1)
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 53. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 57. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 149. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 163. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 178. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 179. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
178-element CSV.File{false}:
 CSV.Row: (<!DOCTYPE html> = "<html lang=\"en\" >",)
 CSV.Row: (<!DOCTYPE html> = "  <head>",)
 CSV.Row: (<!DOCTYPE html> = "    <meta charset=\"utf-8\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://github.githubassets.com\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://avatars.githubusercontent.com\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://github-cloud.s3.amazonaws.com\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://user-images.githubusercontent.com/\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-+D6Fy+EweT791upBwWDwZQxdKdpY58/tEdY1xnOvq7urHPMZe2ZtMhRmzWUL6+Dknj2BwHdlVvxPHDB6ep7wQQ==\" rel=\"stylesheet\" href=\"https://github.githubassets.com/assets/frameworks-f83e85cbe130793efdd6ea41c160f065.css\" />",)
 CSV.Row: (<!DOCTYPE html> = "  <link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-B+uagPXLI8XoStGPc9DrQ9xVWfBS0VvAKvYTQopaizdxo8T46TCtAVE9sY15FQ0kseA1LB3V5GKklDGuwjqCHg==\" rel=\"stylesheet\" href=\"https://github.githubassets.com/assets/site-07eb9a80f5cb23c5e84ad18f73d0eb43.css\" />",)
 CSV.Row: (<!DOCTYPE html> = "    <link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-BgutnhwyiKQLFDdsLnN+C/pdoX66e9VdVPnjGR1KkI6lMDTbqn8SWQLjp3dqzfGblsErstNfjUq94M/kZ8/1qA==\" rel=\"stylesheet\" href=\"https://github.githubassets.com/assets/behaviors-060bad9e1c3288a40b14376c2e737e0b.css\" />",)
 CSV.Row: (<!DOCTYPE html> = "    ",)
 CSV.Row: (<!DOCTYPE html> = "    ",)
 CSV.Row: (<!DOCTYPE html> = "    ",)
 ⋮
 CSV.Row: (<!DOCTYPE html> = "</div>",)
 CSV.Row: (<!DOCTYPE html> = "  </div>",)
 CSV.Row: (<!DOCTYPE html> = "</div>",)
 CSV.Row: (<!DOCTYPE html> = "    </main>",)
 CSV.Row: (<!DOCTYPE html> = "  </div>",)
 CSV.Row: (<!DOCTYPE html> = "  </div>",)
 CSV.Row: (<!DOCTYPE html> = "          ",)
 CSV.Row: (<!DOCTYPE html> = "<div class=\"footer container-xl width-full p-responsive\" role=\"contentinfo\">",)
 CSV.Row: (<!DOCTYPE html> = "  <div class=\"position-relative d-flex flex-row-reverse flex-lg-row flex-wrap flex-lg-nowrap flex-justify-center flex-lg-justify-between pt-6 pb-2 mt-6 f6 color-text-secondary border-top color-border-secondary \">",)
 CSV.Row: (<!DOCTYPE html> = "    <ul class=\"list-style-none d-flex flex-wrap col-12 col-lg-5 flex-justify-center flex-lg-justify-between mb-2 mb-lg-0\">",)
 CSV.Row: (<!DOCTYPE html> = "      <li class=\"mr-3 mr-lg-0\">&copy; 2021 GitHub",)
 CSV.Row: (<!DOCTYPE html> = "        <li class=\"mr-3 mr-lg-0\"><a href=\"https://docs.github.com/en/github/site-policy/github-terms-of-service\" data-ga-click=\"Footer",)

Any idea what’s going wrong? Do I need to specify some additional keywords for it to property parse the data? (I tried including header = 1, and specifying the column types, but it didn’t change anything…)

@Michael_Barmann, using the github’s link to the raw version of your csv file, it works:

using HTTP, CSV
data_url = "https://raw.githubusercontent.com/MichaelBarmann/ParamEst_MRE/main/sample_data.csv"
my_file = CSV.File(HTTP.get(data_url).body)
df = DataFrame(my_file)

103×3 DataFrame
 Row │ Date        t      Ic    
     │ String      Int64  Int64 
─────┼──────────────────────────
   1 │ 8/24/2020       0     79
   2 │ 8/25/2020       1    168
   3 │ 8/26/2020       2    222
   4 │ 8/27/2020       3    282
   5 │ 8/28/2020       4    347
   6 │ 8/29/2020       5    397
   7 │ 8/30/2020       6    501
   8 │ 8/31/2020       7    731
   9 │ 9/1/2020        8    930
  10 │ 9/2/2020        9   1050
  11 │ 9/3/2020       10   1138
  12 │ 9/4/2020       11   1242
  13 │ 9/5/2020       12   1279
  14 │ 9/6/2020       13   1316
  ⋮  │     ⋮         ⋮      ⋮
  90 │ 11/21/2020     89   3998
  91 │ 11/22/2020     90   4017
  92 │ 11/23/2020     91   4043
  93 │ 11/24/2020     92   4056
  94 │ 11/25/2020     93   4070
  95 │ 11/26/2020     94   4103
  96 │ 11/27/2020     95   4124
  97 │ 11/28/2020     96   4195
  98 │ 11/29/2020     97   4221
  99 │ 11/30/2020     98   4253
 100 │ 12/1/2020      99   4276
 101 │ 12/2/2020     100   4301
 102 │ 12/3/2020     101   4321
 103 │ 12/4/2020     102   4346
                 75 rows omitted
2 Likes

Awesome, thank you! Where can I find the url of the “raw version”? Or is the raw version url always just obtained by replacing '“github” with “raw.githubusercontent”?

Look here:
raw_github

2 Likes

Ah, thanks!!

You can also use Artifacts to manage downloading a GitHub repository at a specific version for you. That way, you can avoid downloading the file every time that you or an user restarts Julia

Docs: Artifacts · The Julia Language