Reading in Excel Data from GitHub

Michael_Barmann · March 26, 2021, 10:59pm

I am preparing a Minimal Reproducible Example which involves reading in some data contained in an Excel file. I’m trying to figure out the best way to make the data easily accessible for anyone who wants to run my code. I’ve posted the data here: sample_data

In Python for example, I can simply do

import pandas as pd
df = pd.read_excel("https://github.com/MichaelBarmann/ParamEst_MRE/blob/main/sample_data.xlsx")

and the data is placed into the DataFrame df.

In Julia I’ve tried xl_file = XLSX.readxlsx("https://github.com/MichaelBarmann/ParamEst_MRE/blob/main/sample_data.xlsx"), but unfortunately this doesn’t work.

Is there an analogous way of doing this in Julia? (using XLSX, or something similar?)

pdeffebach · March 27, 2021, 12:33am

If you are willing to abandon excel in favor of CSV, CSV.jl can do that

See here.

To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:
f = CSV.File(HTTP.get(url).body)

Maybe the same strategy works for excel actually?

nilshg · March 27, 2021, 6:59am

I don’t think this works with XSLX out of the box, as the read methods expect a filepath (string), and HTTP.get returns a vector of UInts. The easiest thing would probably to just download the file to a temp dir and then read it in from there, but if it’s just some standard tabular data I agree with Peter that it would be better to move to a saner format than xlsx like csv, or maybe arrow.

Michael_Barmann · March 27, 2021, 5:02pm

Thanks, this is helpful! I had 3 spreadsheets in my Excel file, so I had to put each into an individual CSV file before attempting to implement your suggestion. Unfortunately, I get a weird error:

using HTTP, CSV
data_url = "https://github.com/MichaelBarmann/ParamEst_MRE/blob/main/sample_data.csv"
my_file = CSV.File(HTTP.get(data_url).body; header = 1)

┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 53. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 57. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 149. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 163. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 178. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
┌ Warning: thread = 1 warning: parsed expected 1 columns, but didn't reach end of line around data row: 179. Ignoring any extra columns on this row
└ @ CSV C:\Users\Michael\.julia\packages\CSV\CJfFO\src\file.jl:606
178-element CSV.File{false}:
 CSV.Row: (<!DOCTYPE html> = "<html lang=\"en\" >",)
 CSV.Row: (<!DOCTYPE html> = "  <head>",)
 CSV.Row: (<!DOCTYPE html> = "    <meta charset=\"utf-8\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://github.githubassets.com\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://avatars.githubusercontent.com\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://github-cloud.s3.amazonaws.com\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link rel=\"dns-prefetch\" href=\"https://user-images.githubusercontent.com/\">",)
 CSV.Row: (<!DOCTYPE html> = "  <link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-+D6Fy+EweT791upBwWDwZQxdKdpY58/tEdY1xnOvq7urHPMZe2ZtMhRmzWUL6+Dknj2BwHdlVvxPHDB6ep7wQQ==\" rel=\"stylesheet\" href=\"https://github.githubassets.com/assets/frameworks-f83e85cbe130793efdd6ea41c160f065.css\" />",)
 CSV.Row: (<!DOCTYPE html> = "  <link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-B+uagPXLI8XoStGPc9DrQ9xVWfBS0VvAKvYTQopaizdxo8T46TCtAVE9sY15FQ0kseA1LB3V5GKklDGuwjqCHg==\" rel=\"stylesheet\" href=\"https://github.githubassets.com/assets/site-07eb9a80f5cb23c5e84ad18f73d0eb43.css\" />",)
 CSV.Row: (<!DOCTYPE html> = "    <link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-BgutnhwyiKQLFDdsLnN+C/pdoX66e9VdVPnjGR1KkI6lMDTbqn8SWQLjp3dqzfGblsErstNfjUq94M/kZ8/1qA==\" rel=\"stylesheet\" href=\"https://github.githubassets.com/assets/behaviors-060bad9e1c3288a40b14376c2e737e0b.css\" />",)
 CSV.Row: (<!DOCTYPE html> = "    ",)
 CSV.Row: (<!DOCTYPE html> = "    ",)
 CSV.Row: (<!DOCTYPE html> = "    ",)
 ⋮
 CSV.Row: (<!DOCTYPE html> = "</div>",)
 CSV.Row: (<!DOCTYPE html> = "  </div>",)
 CSV.Row: (<!DOCTYPE html> = "</div>",)
 CSV.Row: (<!DOCTYPE html> = "    </main>",)
 CSV.Row: (<!DOCTYPE html> = "  </div>",)
 CSV.Row: (<!DOCTYPE html> = "  </div>",)
 CSV.Row: (<!DOCTYPE html> = "          ",)
 CSV.Row: (<!DOCTYPE html> = "<div class=\"footer container-xl width-full p-responsive\" role=\"contentinfo\">",)
 CSV.Row: (<!DOCTYPE html> = "  <div class=\"position-relative d-flex flex-row-reverse flex-lg-row flex-wrap flex-lg-nowrap flex-justify-center flex-lg-justify-between pt-6 pb-2 mt-6 f6 color-text-secondary border-top color-border-secondary \">",)
 CSV.Row: (<!DOCTYPE html> = "    <ul class=\"list-style-none d-flex flex-wrap col-12 col-lg-5 flex-justify-center flex-lg-justify-between mb-2 mb-lg-0\">",)
 CSV.Row: (<!DOCTYPE html> = "      <li class=\"mr-3 mr-lg-0\">&copy; 2021 GitHub",)
 CSV.Row: (<!DOCTYPE html> = "        <li class=\"mr-3 mr-lg-0\"><a href=\"https://docs.github.com/en/github/site-policy/github-terms-of-service\" data-ga-click=\"Footer",)

Any idea what’s going wrong? Do I need to specify some additional keywords for it to property parse the data? (I tried including header = 1, and specifying the column types, but it didn’t change anything…)

rafael.guerra · March 27, 2021, 5:25pm

@Michael_Barmann, using the github’s link to the raw version of your csv file, it works:

using HTTP, CSV
data_url = "https://raw.githubusercontent.com/MichaelBarmann/ParamEst_MRE/main/sample_data.csv"
my_file = CSV.File(HTTP.get(data_url).body)
df = DataFrame(my_file)

103×3 DataFrame
 Row │ Date        t      Ic    
     │ String      Int64  Int64 
─────┼──────────────────────────
   1 │ 8/24/2020       0     79
   2 │ 8/25/2020       1    168
   3 │ 8/26/2020       2    222
   4 │ 8/27/2020       3    282
   5 │ 8/28/2020       4    347
   6 │ 8/29/2020       5    397
   7 │ 8/30/2020       6    501
   8 │ 8/31/2020       7    731
   9 │ 9/1/2020        8    930
  10 │ 9/2/2020        9   1050
  11 │ 9/3/2020       10   1138
  12 │ 9/4/2020       11   1242
  13 │ 9/5/2020       12   1279
  14 │ 9/6/2020       13   1316
  ⋮  │     ⋮         ⋮      ⋮
  90 │ 11/21/2020     89   3998
  91 │ 11/22/2020     90   4017
  92 │ 11/23/2020     91   4043
  93 │ 11/24/2020     92   4056
  94 │ 11/25/2020     93   4070
  95 │ 11/26/2020     94   4103
  96 │ 11/27/2020     95   4124
  97 │ 11/28/2020     96   4195
  98 │ 11/29/2020     97   4221
  99 │ 11/30/2020     98   4253
 100 │ 12/1/2020      99   4276
 101 │ 12/2/2020     100   4301
 102 │ 12/3/2020     101   4321
 103 │ 12/4/2020     102   4346
                 75 rows omitted

Michael_Barmann · March 27, 2021, 5:43pm

Awesome, thank you! Where can I find the url of the “raw version”? Or is the raw version url always just obtained by replacing '“github” with “raw.githubusercontent”?

rafael.guerra · March 27, 2021, 5:46pm

Look here:
raw_github

Michael_Barmann · March 27, 2021, 5:47pm

Ah, thanks!!

rikh · September 17, 2021, 3:15pm

You can also use Artifacts to manage downloading a GitHub repository at a specific version for you. That way, you can avoid downloading the file every time that you or an user restarts Julia

Docs: Artifacts · The Julia Language

Topic		Replies	Views
[ANN] XLSX.jl reaches v0.1.0 Community package , announcement	3	1075	May 21, 2018
Read a .xlsx file General Usage question , xlsx	15	626	November 19, 2024
Excel files & Julia? General Usage excel	4	778	June 26, 2023
Not sure how to use XLSX General Usage	3	315	March 24, 2020
Converting XLSX to DataFrame General Usage dataframes , xlsx	2	1119	November 8, 2021

Reading in Excel Data from GitHub

Related topics