Load and reformatting CSV file

rami · October 10, 2021, 5:24pm

Hello,

I am trying to load a CSV file from a Github rep, reformat, and store it as DataFrame obj. Here is what I tried:

using CSV, HTTP, DataFrames
url = "https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv"
http_response = HTTP.get(url)
file = CSV.File(http_response.body)
df = DataFrame(file)

Here is the output I am getting:

julia> df
514572×7 DataFrame
│ Row    │ date       │ province │ country  │ lat        │ long      │ type      │ cases    │
│        │ Dates.Date │ String63 │ String63 │ String15   │ String15  │ String15  │ String15 │
├────────┼────────────┼──────────┼──────────┼────────────┼───────────┼───────────┼──────────┤
│ 1      │ 2020-01-22 │ Alberta  │ Canada   │ 53.9333    │ -116.5765 │ confirmed │ 0        │
│ 2      │ 2020-01-23 │ Alberta  │ Canada   │ 53.9333    │ -116.5765 │ confirmed │ 0        │
│ 3      │ 2020-01-24 │ Alberta  │ Canada   │ 53.9333    │ -116.5765 │ confirmed │ 0        │
│ 4      │ 2020-01-25 │ Alberta  │ Canada   │ 53.9333    │ -116.5765 │ confirmed │ 0        │
│ 5      │ 2020-01-26 │ Alberta  │ Canada   │ 53.9333    │ -116.5765 │ confirmed │ 0        │
│ 6      │ 2020-01-27 │ Alberta  │ Canada   │ 53.9333    │ -116.5765 │ confirmed │ 0        │
⋮
│ 514566 │ 2021-10-02 │ NA       │ Zimbabwe │ -19.015438 │ 29.154857 │ recovered │ NA       │
│ 514567 │ 2021-10-03 │ NA       │ Zimbabwe │ -19.015438 │ 29.154857 │ recovered │ NA       │
│ 514568 │ 2021-10-04 │ NA       │ Zimbabwe │ -19.015438 │ 29.154857 │ recovered │ NA       │
│ 514569 │ 2021-10-05 │ NA       │ Zimbabwe │ -19.015438 │ 29.154857 │ recovered │ NA       │
│ 514570 │ 2021-10-06 │ NA       │ Zimbabwe │ -19.015438 │ 29.154857 │ recovered │ NA       │
│ 514571 │ 2021-10-07 │ NA       │ Zimbabwe │ -19.015438 │ 29.154857 │ recovered │ NA       │
│ 514572 │ 2021-10-08 │ NA       │ Zimbabwe │ -19.015438 │ 29.154857 │ recovered │ NA       │

How can I modify the cases column from string to integer? It classifies it as string as it has missing values classify as NA. I was trying to use this:

file = CSV.File(http_response.body, null=“NA”)

but this argument is not available for the CSV.File function.

Any suggestions?

Also, any shorter way to load CSV file from URL?

Thanks!

rafael.guerra · October 10, 2021, 6:15pm

You can use download():

using CSV, DataFrames
url = "https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv"
file = CSV.File(download(url))
df = DataFrame(file)

pdeffebach · October 10, 2021, 6:17pm

It seems like you are using some pretty old versions of things. The keyword argument should be missingstring = "NA". But I would suggest updating to the latest versions of packages first.

danielw2904 · October 10, 2021, 7:50pm

You can add the types Argument which takes a Dict(colname=> type). But I think once you us missingstring. It’ll parse automatically

danielw2904 · October 10, 2021, 7:51pm

I think the version is new and the argument incorrect ( since it is using the new inline string)

pdeffebach · October 10, 2021, 8:32pm

Ah good catch. But the DataFrames.jl is old.

rami · October 10, 2021, 11:10pm

Thanks all for the answers! I added the missingstring= "NA" argument and it works:

file = CSV.File(download(url), missingstring= "NA")
df = DataFrame(file)

df
515394×7 DataFrame
│ Row    │ date       │ province  │ country  │ lat      │ long     │ type      │ cases   │
│        │ Dates.Date │ String63? │ String63 │ Float64? │ Float64? │ String15  │ Int64?  │
├────────┼────────────┼───────────┼──────────┼──────────┼──────────┼───────────┼─────────┤
│ 1      │ 2020-01-22 │ Alberta   │ Canada   │ 53.9333  │ -116.576 │ confirmed │ 0       │
│ 2      │ 2020-01-23 │ Alberta   │ Canada   │ 53.9333  │ -116.576 │ confirmed │ 0       │
│ 3      │ 2020-01-24 │ Alberta   │ Canada   │ 53.9333  │ -116.576 │ confirmed │ 0       │
│ 4      │ 2020-01-25 │ Alberta   │ Canada   │ 53.9333  │ -116.576 │ confirmed │ 0       │
│ 5      │ 2020-01-26 │ Alberta   │ Canada   │ 53.9333  │ -116.576 │ confirmed │ 0       │
│ 6      │ 2020-01-27 │ Alberta   │ Canada   │ 53.9333  │ -116.576 │ confirmed │ 0       │
⋮
│ 515388 │ 2021-10-03 │ missing   │ Zimbabwe │ -19.0154 │ 29.1549  │ recovered │ missing │
│ 515389 │ 2021-10-04 │ missing   │ Zimbabwe │ -19.0154 │ 29.1549  │ recovered │ missing │
│ 515390 │ 2021-10-05 │ missing   │ Zimbabwe │ -19.0154 │ 29.1549  │ recovered │ missing │
│ 515391 │ 2021-10-06 │ missing   │ Zimbabwe │ -19.0154 │ 29.1549  │ recovered │ missing │
│ 515392 │ 2021-10-07 │ missing   │ Zimbabwe │ -19.0154 │ 29.1549  │ recovered │ missing │
│ 515393 │ 2021-10-08 │ missing   │ Zimbabwe │ -19.0154 │ 29.1549  │ recovered │ missing │
│ 515394 │ 2021-10-09 │ missing   │ Zimbabwe │ -19.0154 │ 29.1549  │ recovered │ missing │

Topic		Replies	Views
Converting CSV string values to floats (Python to Julia) New to Julia python , dataframes , csv	13	3233	February 2, 2021
Get JuliaDB.loadtable to parse all columns in CSVs as String New to Julia question	5	654	July 6, 2020
Julia is unable to load CSV files from the Kaggle competition Performance question , package , csv	6	2210	August 3, 2018
Best way to import a CSV with a repeated structure Data io	5	176	September 15, 2024
What is the best way to read a CSV file New to Julia csv , io	1	507	September 13, 2021

Load and reformatting CSV file

Related topics