Cummulative death by day by country

Chen · June 25, 2020, 11:47pm

In R, one would use tidyverse to, for example, aggregate daily deaths of COVID-19 by day and by country [0].

How would one do it in Julia? Is there a package that is similar to R’s tidyverse? Or does one write a couple of loops self?

Thanks!

[0] https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-06-24.xlsx

xiaodai · June 26, 2020, 12:32am

DataFrames.jl

dlakelan · June 26, 2020, 1:54am

The Queryverse package supports piping data around through transformations, grouping, etc.

Though for this kind of thing I often like to push the data to SQLite and run a SQL query. I find SQL actually quite expressive.

nilshg · June 26, 2020, 9:20am

julia> using Dates, DataFrames

julia> data = DataFrame(day = [Date(2020,4,1), Date(2020,4,2), Date(2020,4,1), Date(2020,4,2)], country = [:a, :a, :b, :b], deaths = [10, 20, 30, 40])
4×3 DataFrame
│ Row │ day        │ country │ deaths │
│     │ Date       │ Symbol  │ Int64  │
├─────┼────────────┼─────────┼────────┤
│ 1   │ 2020-04-01 │ a       │ 10     │
│ 2   │ 2020-04-02 │ a       │ 20     │
│ 3   │ 2020-04-01 │ b       │ 30     │
│ 4   │ 2020-04-02 │ b       │ 40     │

julia> transform(groupby(data, :country), :deaths => cumsum => :deaths)
4×3 DataFrame
│ Row │ country │ day        │ deaths │
│     │ Symbol  │ Date       │ Int64  │
├─────┼─────────┼────────────┼────────┤
│ 1   │ a       │ 2020-04-01 │ 10     │
│ 2   │ a       │ 2020-04-02 │ 30     │
│ 3   │ b       │ 2020-04-01 │ 30     │
│ 4   │ b       │ 2020-04-02 │ 70     │

Rudi79 · June 26, 2020, 11:37am

The analogue of tidyverse would be queryverse. See e.g.,

pdeffebach · June 26, 2020, 2:36pm

DataFrames + a few other helper packages will have all you need for this.

The anonymoys function syntax is probably the hardest part to understand coming from R. But hopefully we can make easier syntax soon.

I downloaded the excel file you posted and turned it into a CSV just by opening it in excel and saving it as a .csv file.

using CSV, DataFrames, Pipe, Statistics

df = CSV.File("COVID-19-geographic-disbtribution-worldwide-2020-06-24.csv") |> DataFrame

df_country_month = @pipe df |>
    groupby(_, ["countriesAndTerritories", "month", "year"]) |>
    combine(_, 
    	"cases" => (t -> mean(skipmissing(t))) => "cases",
    	"deaths" => (t -> mean(skipmissing(t))) => "deaths")

nilshg · June 26, 2020, 3:12pm

Arguable, but I think the nicer syntax already exists through function composition:

:cases => mean ∘ skipmissing => :cases

Chen · June 26, 2020, 11:00pm

Thank you!

xiaodai · June 26, 2020, 11:16pm

How do u type that symbol in between mean and missing?

jack_rabbit · June 27, 2020, 12:35am

\circ

dlakelan · June 27, 2020, 12:36am

How do you type the \circ with reasonable ease if you’re writing code in emacs?

magister-ludi · June 27, 2020, 4:36am

You just need to type \circ[TAB] (assuming emacs is in julia-mode). The first time you do that in a session, it may take time to display the character, but it does work.

Chen · June 27, 2020, 7:14am

Thank you.
Do you then store data in an SQL database?

Chen · June 27, 2020, 8:46am

Thank you! Which package contains the transform() function?

I got the following error:

julia> @time transform(groupby(data, :country), :deaths => cumsum => :deaths)
ERROR: UndefVarError: transform not defined
Stacktrace:
 [1] top-level scope at ./util.jl:175

nilshg · June 27, 2020, 8:58am

Ah sorry, assumed that as a given as my post was meant to amend Peter’s - it’s in DataFrames, but only as of version 0.21, so make sure you’ve got the latest!

kevbonham · June 27, 2020, 4:49pm

You could, but CSV loading and writing is pretty fast. If your data fits in memory, and you’re using DataFrames to manipulate it, I’d just stick with CSV

Chen · June 28, 2020, 12:18pm

Thank you, @Rudy79.

My Gentoo Linux amd64 cannot install ParquetFiles.jl, which is required by Queryverse. I opened an issue on Github: https://github.com/queryverse/ParquetFiles.jl/issues/33

Otherwise, yeah, Queryverse looks great!

pdeffebach · June 28, 2020, 3:16pm

DataFrames has the transform function.

Please see my code example above for a MWE with your data.

Chen · June 28, 2020, 9:26pm

Thank you!

I think I need to update my DataFrames.jl. However, I do not know what is preventing it from updated to the latest version. My hunch is that a package that depends on an older version of DataFrames.jl prevents DataFrames.jl from being updated.

julia> @time using CSV, DataFrames, Pipe, Statistics
  0.001125 seconds (1.48 k allocations: 78.219 KiB)

julia> @time df = CSV.File("/home/c/Downloads/COVID-19-geographic-disbtribution-worldwide-2020-06-24.csv") |> DataFrame
  0.145597 seconds (27.87 k allocations: 6.840 MiB, 56.48% gc time)
25517×11 DataFrame. Omitted printing of 5 columns
│ Row   │ dateRep   │ day   │ month │ year  │ cases │ deaths │
│       │ String    │ Int64 │ Int64 │ Int64 │ Int64 │ Int64  │
├───────┼───────────┼───────┼───────┼───────┼───────┼────────┤
│ 1     │ 6/24/2020 │ 24    │ 6     │ 2020  │ 338   │ 20     │
│ 2     │ 6/23/2020 │ 23    │ 6     │ 2020  │ 310   │ 17     │
│ 3     │ 6/22/2020 │ 22    │ 6     │ 2020  │ 409   │ 12     │
│ 4     │ 6/21/2020 │ 21    │ 6     │ 2020  │ 546   │ 21     │
│ 5     │ 6/20/2020 │ 20    │ 6     │ 2020  │ 346   │ 2      │
│ 6     │ 6/19/2020 │ 19    │ 6     │ 2020  │ 658   │ 42     │
│ 7     │ 6/18/2020 │ 18    │ 6     │ 2020  │ 564   │ 13     │
│ 8     │ 6/17/2020 │ 17    │ 6     │ 2020  │ 783   │ 13     │
│ 9     │ 6/16/2020 │ 16    │ 6     │ 2020  │ 761   │ 7      │
│ 10    │ 6/15/2020 │ 15    │ 6     │ 2020  │ 664   │ 20     │
│ 11    │ 6/14/2020 │ 14    │ 6     │ 2020  │ 556   │ 5      │
│ 12    │ 6/13/2020 │ 13    │ 6     │ 2020  │ 656   │ 20     │
│ 13    │ 6/12/2020 │ 12    │ 6     │ 2020  │ 747   │ 21     │
│ 14    │ 6/11/2020 │ 11    │ 6     │ 2020  │ 684   │ 21     │
│ 15    │ 6/10/2020 │ 10    │ 6     │ 2020  │ 542   │ 15     │
│ 16    │ 6/9/2020  │ 9     │ 6     │ 2020  │ 575   │ 12     │
│ 17    │ 6/8/2020  │ 8     │ 6     │ 2020  │ 791   │ 30     │
⋮
│ 25500 │ 4/7/2020  │ 7     │ 4     │ 2020  │ 0     │ 0      │
│ 25501 │ 4/6/2020  │ 6     │ 4     │ 2020  │ 0     │ 0      │
│ 25502 │ 4/5/2020  │ 5     │ 4     │ 2020  │ 0     │ 0      │
│ 25503 │ 4/4/2020  │ 4     │ 4     │ 2020  │ 1     │ 0      │
│ 25504 │ 4/3/2020  │ 3     │ 4     │ 2020  │ 0     │ 0      │
│ 25505 │ 4/2/2020  │ 2     │ 4     │ 2020  │ 0     │ 0      │
│ 25506 │ 4/1/2020  │ 1     │ 4     │ 2020  │ 1     │ 0      │
│ 25507 │ 3/31/2020 │ 31    │ 3     │ 2020  │ 0     │ 0      │
│ 25508 │ 3/30/2020 │ 30    │ 3     │ 2020  │ 0     │ 0      │
│ 25509 │ 3/29/2020 │ 29    │ 3     │ 2020  │ 2     │ 0      │
│ 25510 │ 3/28/2020 │ 28    │ 3     │ 2020  │ 2     │ 0      │
│ 25511 │ 3/27/2020 │ 27    │ 3     │ 2020  │ 0     │ 0      │
│ 25512 │ 3/26/2020 │ 26    │ 3     │ 2020  │ 1     │ 0      │
│ 25513 │ 3/25/2020 │ 25    │ 3     │ 2020  │ 0     │ 0      │
│ 25514 │ 3/24/2020 │ 24    │ 3     │ 2020  │ 0     │ 1      │
│ 25515 │ 3/23/2020 │ 23    │ 3     │ 2020  │ 0     │ 0      │
│ 25516 │ 3/22/2020 │ 22    │ 3     │ 2020  │ 1     │ 0      │
│ 25517 │ 3/21/2020 │ 21    │ 3     │ 2020  │ 1     │ 0      │

julia> @time df_country_month = @pipe df |>
           groupby(_, ["countriesAndTerritories", "month", "year"]) |>
           combine(_, 
            "cases" => (t -> mean(skipmissing(t))) => "cases",
            "deaths" => (t -> mean(skipmissing(t))) => "deaths")
ERROR: ArgumentError: idxs[1] has type String; Only Integer or Symbol values allowed when indexing by vector
Stacktrace:
 [1] getindex at /home/c/.julia/packages/DataFrames/S3ZFo/src/other/index.jl:222 [inlined]
 [2] groupby(::DataFrame, ::Array{String,1}; sort::Bool, skipmissing::Bool) at /home/c/.julia/packages/DataFrames/S3ZFo/src/groupeddataframe/grouping.jl:168
 [3] groupby(::DataFrame, ::Array{String,1}) at /home/c/.julia/packages/DataFrames/S3ZFo/src/groupeddataframe/grouping.jl:167
 [4] top-level scope at util.jl:175

julia> @time Pkg.add("DataFrames")
  Resolving package versions...
   Updating `~/.julia/environments/v1.4/Project.toml`
 [no changes]
   Updating `~/.julia/environments/v1.4/Manifest.toml`
 [no changes]
  3.576830 seconds (2.55 M allocations: 168.654 MiB, 5.42% gc time)

julia> @time using DataFrames
  0.000309 seconds (236 allocations: 12.641 KiB)

julia> @time Pkg.status()
Status `~/.julia/environments/v1.4/Project.toml`
  [c9ce4bd3] ArchGDAL v0.3.2
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.6.1
  [5ae59095] Colors v0.12.1
  [a93c6f00] DataFrames v0.20.2
  [add2ef01] GDAL v1.1.2
  [28b8d3ca] GR v0.49.1
  [dcc97b0b] GeoStats v0.11.6
  [7073ff75] IJulia v1.21.2
  [86fae568] ImageView v0.10.8
  [98b081ad] Literate v2.5.0
  [961ee093] ModelingToolkit v3.6.4
  [9b87118b] PackageCompiler v1.1.1
  [b98c9c47] Pipe v1.2.0
  [91a5bcdd] Plots v1.3.3
  [612083be] Queryverse v0.5.0
  [295af30f] Revise v2.6.7
  [123dc426] SymEngine v0.8.2
  [24249f21] SymPy v1.0.20
  [44d3d7a6] Weave v0.10.2
  [fdbf4ff8] XLSX v0.7.0
  0.200055 seconds (137.18 k allocations: 8.654 MiB)

julia> @time versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Pentium(R) 4 CPU 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, nocona)
  4.183625 seconds (2.06 M allocations: 93.365 MiB, 3.33% gc time)

Chen · June 28, 2020, 9:40pm

Oh. It turns out that I had a dirty registry. I removed ~/.julia/registries/General and was able to update packages normally. I’m going to try @pdeffebach’s MWE again now.

Topic		Replies	Views
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10528	April 22, 2022
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24023	April 27, 2024
How to compute a "cumulative" in a dataframe (without a for loop) Data question , dataframes	44	9516	September 11, 2021
Create a GroupedDataFrame by the relations of rows rather than the values of the rows in a column, e.g `groupby` consecutive dates? New to Julia question , dataframes , grouped-data	14	707	March 29, 2023
Release announcements for DataFrames.jl Data announcement , dataframes	190	24501	September 28, 2023

Cummulative death by day by country

Related topics