Cummulative death by day by country

I appreciate all the help in this thread, thank you, everyone.

I think pdeffebach’s MWE was not what I wanted, though.

I think nilshg’s MWE was close to what I want.

Here is the current codes, using bits from both MWEs.

The current problem is that the data has reverse chronic order, so that I need to sort the data by country and by day.

However, the sort() function does not understand that “1/10/2020” is after “1/9/2020”.

julia> @time using CSV, DataFrames, Dates
  0.000855 seconds (1.14 k allocations: 60.422 KiB)

julia> @time df = CSV.File("/home/c/Downloads/COVID-19-geographic-disbtribution-worldwide-2020-06-24.csv") |> DataFrame
  0.058011 seconds (148.06 k allocations: 7.819 MiB)

julia> @time transform(groupby(df, :countriesAndTerritories), :deaths => cumsum => :deaths)
  0.168699 seconds (211.01 k allocations: 10.525 MiB, 21.53% gc time)
julia> sort!(df, (order(:countriesAndTerritories), order(:dateRep)))
┌ Warning: Passing a tuple (DataFrames.UserColOrdering{Symbol}(:countriesAndTerritories, Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}()), DataFrames.UserColOrdering{Symbol}(:dateRep, Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}())) of column selectors when sorting data frame is deprecated. Pass a vector DataFrames.UserColOrdering{Symbol}[DataFrames.UserColOrdering{Symbol}(:countriesAndTerritories, Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}()), DataFrames.UserColOrdering{Symbol}(:dateRep, Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}())] instead.
│   caller = sort!(::DataFrame, ::Tuple{DataFrames.UserColOrdering{Symbol},DataFrames.UserColOrdering{Symbol}}; alg::Nothing, lt::Function, by::Function, rev::Bool, order::Base.Order.ForwardOrdering) at sort.jl:83
└ @ DataFrames ~/.julia/packages/DataFrames/e3P4n/src/dataframe/sort.jl:83
25517×11 DataFrame. Omitted printing of 1 columns
│ Row   │ dateRep   │ day   │ month │ year  │ cases │ deaths │ countriesAndTerritories │ geoId  │ countryterritoryCode │ popData2019 │
│       │ String    │ Int64 │ Int64 │ Int64 │ Int64 │ Int64  │ String                  │ String │ String?              │ Int64?      │
├───────┼───────────┼───────┼───────┼───────┼───────┼────────┼─────────────────────────┼────────┼──────────────────────┼─────────────┤
│ 1     │ 1/1/2020  │ 1     │ 1     │ 2020  │ 0     │ 0      │ Afghanistan             │ AF     │ AFG                  │ 38041757    │
│ 2     │ 1/10/2020 │ 10    │ 1     │ 2020  │ 0     │ 0      │ Afghanistan             │ AF     │ AFG                  │ 38041757    │
│ 3     │ 1/11/2020 │ 11    │ 1     │ 2020  │ 0     │ 0      │ Afghanistan             │ AF     │ AFG                  │ 38041757    │
│ 4     │ 1/12/2020 │ 12    │ 1     │ 2020  │ 0     │ 0      │ Afghanistan             │ AF     │ AFG                  │ 38041757    │
│ 5     │ 1/13/2020 │ 13    │ 1     │ 2020  │ 0     │ 0      │ Afghanistan             │ AF     │ AFG                  │ 38041757    │
│ 6     │ 1/14/2020 │ 14    │ 1     │ 2020  │ 0     │ 0      │ Afghanistan             │ AF     │ AFG                  │ 38041757    │


So I found the Date() function. But I do not know how to ask Date() to produce a new column of formatted dates:

julia> Date("6/15/2020", "mm/dd/yyyy")
2020-06-15

julia> Date(df.dateRep, "mm/dd/yyyy")
ERROR: MethodError: no method matching Int64(::PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}})
Closest candidates are:
  Int64(::Union{Bool, Int32, Int64, UInt32, UInt64, UInt8, Int128, Int16, Int8, UInt128, UInt16}) at boot.jl:707
  Int64(::Ptr) at boot.jl:717
  Int64(::Float32) at float.jl:707
  ...
Stacktrace:
 [1] Date(::PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}, ::String, ::Int64) at /var/tmp/portage/dev-lang/julia-1.4.2/work/julia-1.4.2/usr/share/julia/stdlib/v1.4/Dates/src/types.jl:368 (repeats 2 times)
 [2] top-level scope at REPL[16]:1

julia> df.dateRep
25517-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:
 "1/1/2020"
 "1/10/2020"
 "1/11/2020"
 "1/12/2020"

julia> df.Date = DateTime(df.dateRep, "mm/dd/yyyy")
ERROR: MethodError: no method matching Int64(::PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}})
Closest candidates are:
  Int64(::Union{Bool, Int32, Int64, UInt32, UInt64, UInt8, Int128, Int16, Int8, UInt128, UInt16}) at boot.jl:707
  Int64(::Ptr) at boot.jl:717
  Int64(::Float32) at float.jl:707

You are trying to apply a function to an array here

Date(df.dateRep, "mm/dd/yyyy")

since df.dateRep is a column. In this case, you need to use the broadcasting syntax in Julia by appending . to your function call as,

julia> df = DataFrame(dateRep = ["1/1/2020", "3/23/2020", "2/22/2020"])
3×1 DataFrame
│ Row │ dateRep   │
│     │ String    │
├─────┼───────────┤
│ 1   │ 1/1/2020  │
│ 2   │ 3/23/2020 │
│ 3   │ 2/22/2020 │

julia> df[!,:Date] = Date.(df.dateRep, "mm/dd/yyyy")
3-element Array{Date,1}:
 2020-01-01
 2020-03-23
 2020-02-22

julia> df
3×2 DataFrame
│ Row │ dateRep   │ Date       │
│     │ String    │ Date       │
├─────┼───────────┼────────────┤
│ 1   │ 1/1/2020  │ 2020-01-01 │
│ 2   │ 3/23/2020 │ 2020-03-23 │
│ 3   │ 2/22/2020 │ 2020-02-22 │

julia> sort!(df, :Date)
3×2 DataFrame
│ Row │ dateRep   │ Date       │
│     │ String    │ Date       │
├─────┼───────────┼────────────┤
│ 1   │ 1/1/2020  │ 2020-01-01 │
│ 2   │ 2/22/2020 │ 2020-02-22 │
│ 3   │ 3/23/2020 │ 2020-03-23 │

Off-topic, but you can always find out how to type a unicode symbol in Julia by using ? and copy-pasting the symbol:

help?> ∘
"∘" can be typed by \circ<tab>
4 Likes

Thank you very much!

Here is my code now:

@time using CSV, DataFrames, Dates, StatsPlots

@time df = CSV.File("/home/c/Downloads/COVID-19-geographic-disbtribution-worldwide-2020-06-24.csv") |> DataFrame

@time df[!,:Date] = Date.(df.dateRep, "mm/dd/yyyy")

@time sort!(df, :Date)

@time df = transform(groupby(df, :countriesAndTerritories), :deaths => cumsum => :deathsCum)

@time df = transform(groupby(df, :countriesAndTerritories), :cases => cumsum => :Cummulative_Cases)

@time df

@time df_europe = df[findall(in(["Europe"]), df[!:continentExp]), :]

@time @df df_europe plot(:Date, :deathsCum, group = :countriesAndTerritories)

@time df_europe_selected_countries = df_europe[findall(in(["Norway", "Sweden", "Italy", "France", "United_Kingdom", "Finland", "Denmark", "Austria", "Netherlands"]), df_europe[:countriesAndTerritories]), :]

@time @df df_europe_selected_countries plot(:Date, :Cummulative_Cases, group = :countriesAndTerritories, legend=:topleft)

I received a warning when I filter out observations by for example keeping only European countries:

julia> @time df_europe = df[findall(in(["Europe"]), df[:continentExp]), :]
┌ Warning: `getindex(df::DataFrame, col_ind::ColumnIndex)` is deprecated, use `df[!, col_ind]` instead.
│   caller = top-level scope at util.jl:175
└ @ Core ./util.jl:175
  0.009137 seconds (378 allocations: 910.047 KiB)
7738×14 DataFrame. Omitted printing of 5 columns

I see that the code is running fine. May I ask what I can do so that I will be following the new standard of practice? I tried a few alternatives but they did not work, for example,

julia> @time df_europe = df[findall(in(["Europe"]), df[!:continentExp]), :]
ERROR: MethodError: no method matching !(::Symbol)
Closest candidates are:
  !(::Missing) at missing.jl:100
  !(::Bool) at bool.jl:35
  !(::Function) at operators.jl:880
  ...
Stacktrace:^[[A^[[D^[[A
 [1] top-level scope at ./util.jl:175
 [2] eval_user_input(::Any, ::REPL.REPLBackend) at /var/tmp/portage/dev-lang/julia-1.4.2/work/julia-1.4.2/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [3] run_backend(::REPL.REPLBackend) at /home/c/.julia/packages/Revise/BqeJF/src/Revise.jl:1184
 [4] top-level scope at REPL[1]:0

The problem is that you’re only idexing one dimension - a DataFrame is two-dimensional, so you should alway index as df[row_selection, column_selection] (which I see you’re doing in the “outer” indexing operation in that line). Here you probably want:

df[df.continentExp .== "Europe", :]

If you want to use in because you’re checking against a group of values you cans use it as

df[in(my_list).(df.continentExp), :]

Thank you!

So far it seems that Julia has the Matlab way of having a . operation, as compared to the R way. Good to know!

I am now trying to filter out except a selected number of countries. I tried two methods. One method is to use in(), the other is to use the dumb way from the good old primitive programming days by specifying a number of OR operations.

Neither worked. Any hint, please?

@time df_europe = df[df.continentExp .== "Europe", :]

@time df_europe_selected_countries = df_europe[df_europe.countriesAndTerritories .in(["Norway", "Sweden", "Italy", "France", "United_Kingdom", "Finland", "Denmark", "Austria", "Netherlands"]), :]

@time df_europe_selected_countries = df_europe[df_europe.countriesAndTerritories .== "Norway" || df_europe.countriesAndTerritories .== "Sweden" || df_europe.countriesAndTerritories .== "Italy" || df_europe.countriesAndTerritories .== "France" || df_europe.countriesAndTerritories .== "United_Kingdom" || df_europe.countriesAndTerritories .== "Finland" || df_europe.countriesAndTerritories .== "Denmark" || df_europe.countriesAndTerritories .== "Austria" || df_europe.countriesAndTerritories .== "Netherlands", :]

You may try

df_europe_selected_countries = 
filter(:countriesAndTerritories => x->in(x, ["Norway", "Sweden"]), df_europe)

BTW, you may be interested in having a look at the excellent tutorial of DataFrames.jl by Bogumił Kamiński.

1 Like

Two separate issues with what you’re trying to do:

In the first case, your application of the in function seems to be inspired by Python/pandas, following an object-oriented paradigm where the method (in) is bound to the object (df.countries) and accessed by dot notation. This is not what dot notation in Julia is for. In Julia, dot notation after a function name (but before the calling parenthesis) denotes broadcasting, i.e. elementwise application of the function to the argument (e.g. vector/matrix) passed as an argument. So you want:

df[in(country_list).(df_europe.countriesAndTerritories), :]

where country_list is just a vector of all the countries you want to include. Two things are happening here: first, in(country_list) creates a Base.Fix2 function object, that is a function in which the second argument is fixed (in this case to country_list. in(country_list).(x) then applies this function to x, where the dot indicates elementwise application - that is, for each element of x, check whether it is in country_list. This is equivalent to in.(x, Ref(country_list), where we pass country_list as a second argument, but need to wrap it in Ref so that broadcasting does not try to also iterate over the elements of country_list. Consider the following examples:

# no broadcasting - doesn't work as it check whether the whole array [1,5,12] is 
# in 1:10, rather than 1, 5, and 12 separately
julia> in([1, 5, 12], 1:10) 
false

# same with broadcasting (notice the dot) - doesn't work as this is trying to 
# apply the function to each element of [1,5,12] and 1:10
julia> in.([1, 5, 12], 1:10)
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 10")
Stacktrace:
 [1] _bcs1 at .\broadcast.jl:490 [inlined]
 [2] _bcs at .\broadcast.jl:484 [inlined]
 [3] broadcast_shape at .\broadcast.jl:478 [inlined]
 [4] combine_axes at .\broadcast.jl:473 [inlined]
 [5] instantiate at .\broadcast.jl:256 [inlined]
 [6] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(in),Tuple{Array{Int64,1},UnitRange{Int64}}}) at .\broadcast.jl:820
 [7] top-level scope at none:0

# same broadcasting but now with Ref() wrapper around 1:10 to prevent
# broadcasting from also trying to broadcast over each element of 1:10
# This is the result we're after!
julia> in.([1, 5, 12], Ref(1:10))
3-element BitArray{1}:
 1
 1
 0

# simpler (imho!) way of writing this without Ref - create a Fix2 function version 
# of in in which the second argument is fixed to 1:10
julia> in(1:10)
(::Base.Fix2{typeof(in),UnitRange{Int64}}) (generic function with 1 method)

# Apply to our vector - this fails again as we're not broadcasting
julia> in(1:10)([1, 5, 12])
false

# With broadcasting we get what we're after - and note that for the Fix2 function 
# we can go without a Ref() wrapper
julia> in(1:10).([1, 5, 12])
3-element BitArray{1}:
 1
 1
 0

Your second issue is a lot easier: you just need parens around your comparisons due to operator precedence:

julia> 1 == 2 | 3 == 3
false

julia> (1 == 2) | (3 == 3)
true

also you again want to broadcast the | here, so do

df[(df.col1 .== "val1") .| (df.col2 .== "val2"), :]
1 Like