A few questions about my workflow for quickly investigating data!

I am taking an upper level stats course taught in r. I have been doing a duplicate of the assignments in Julia, which has been a lot of fun. However, I have come across a few sticking points. I’ll just stick to one in this thread and probably make a new thread for the other questions.

My question is how to easily modify my plots to show a simple linear fit by a specific group id. Below is my current workflow and where I am encountering issues!

These are the packages I am using for most of my problem sets so far:

using DataFrames, HTTP, CSV, Dates, StatsPlots, GLM

I begin by reading in my data from the professors site and modifying date data to be of type Date:

covid = DataFrame(CSV.File(HTTP.get("https://pages.uoregon.edu/dlevin/DATA/Covid.csv").body));
covid[!,1]=Date.(covid[:,1],Dates.DateFormat("dd/mm/yyyy"));

Next I create a quick plot to look at the log of the deaths versus time:

quick_plot=@df covid scatter(:dateRep,log.(:deaths),group= :countriesAndTerritories, legend =:topleft)

I am very pleased with how Julia handles the dates here, but I also notice something unfortunate. If I want to fit a line of best fit through the data I would normally just add smooth=true to my plot function call above, however in this case it does nothing (perhaps because of how the macro is set up?). It is also cool that it successfully filters out the problematic 0’s for the log function!

No worries I can use the GLM package to perform a linear fit, so I take my next step by grouping my dataframe:

gcovid=groupby(covid,:geoId)

#Checking to see what the ID's are
combine(gcovid, :geoId => unique => :geoId)

covid_uk = gcovid[1];
covid_us = gcovid[2];
covid_it = gcovid[3];

Next I perform my linear fit, being careful to filter out zeros from the dataset, because of the log:

fit_uk = lm(@formula(log(deaths) ~ dateRep),filter(:deaths => n->n!=0,gcovid[1]));
fit_us = lm(@formula(log(deaths) ~ dateRep),filter(:deaths => n->n!=0,gcovid[2]));
fit_it = lm(@formula(log(deaths) ~ dateRep),filter(:deaths => n->n!=0,gcovid[3]));

Here I encounter two problems. At first I wanted to use the same dates from my first plot which has identifier dateRep, but when I run this code I get the following error:

DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied.

I can fix this problem by using the numeric value of day instead. However, r’s lm function call has no issues with performing a linear fit on data of type Date, so I was wondering if there was a way to do that in Julia too? Moving on, I successfully get the linear model to run with the following code:

fit_uk = lm(@formula(log(deaths) ~ day),filter(:deaths => n->n!=0,gcovid[1]));
fit_us = lm(@formula(log(deaths) ~ day),filter(:deaths => n->n!=0,gcovid[2]));
fit_it = lm(@formula(log(deaths) ~ day),filter(:deaths => n->n!=0,gcovid[3]));

Great, so now I have the information I need to plot a linear fit to my data above, but now I am running into issues actually adding these liens to the above plot. Here is what I am currently trying:

m_uk = coeftable(fit_uk).cols[1][2]
b_uk = coeftable(fit_uk).cols[1][1]
y(x) = m_uk*x+b_uk

plot!(y(gcovid[1][:,:deaths]))

I am getting index errors here, and even if I try to restrict the domain to a non-zero filter I still get index errors…I fell very close to being able to do what I want, but I am struggling with this last step. Also, I have encountered several smaller questions along the way I would appreciate any insights on :slight_smile: .

1 Like

The problem with smooth=true not working is exactly because you have 0 passed to log so fitting a regression fails - just like your manual try.

Therefore do:

@df filter(:deaths => >(0), covid) scatter(:dateRep,log.(:deaths),group= :countriesAndTerritories, legend =:topleft, smooth=true)

and all will be nice.

If you would not want to drop 0s I typically recommend log(x+1) transform:

@df covid scatter(:dateRep,log.(:deaths .+ 1),group= :countriesAndTerritories, legend =:topleft, smooth=true)

Of course then the plot and the regression changes as you include 0s in the data and the transform is a bit different.

2 Likes

You really shouldn’t filter out zeros here… since they are an important data point. Use log(deaths + 1) if you have to, or a hyperbolic sine transform if you want to be more sophisticated. See this paper.

The problem here is that Julia is encoding the dates as fixed effects. Since you have the same number of date dummies in your data as you do rows, Julia throws an error.

Note: R does the same thing! It doesn’t throw an error but it does give nonsense results.

r$> library(tidyverse); library(lubridate);

r$> covid = read_csv(url("https://pages.uoregon.edu/dlevin/DATA/Covid.c
    sv"));
Rows: 282 Columns: 10
── Column specification ───────────────────────────────────────────────
Delimiter: ","
chr (4): dateRep, countriesAndTerritories, geoId, countryterritoryCode
dbl (6): day, month, year, cases, deaths, popData2018

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

r$> uk = covid %>% filter(geoId == "UK");

r$> lm(deaths ~ dateRep, data = uk) %>% summary

Call:
lm(formula = deaths ~ dateRep, data = uk)

Residuals:
ALL 94 residuals are 0: no residual degrees of freedom!

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)
(Intercept)       -5.292e-14        NaN     NaN      NaN
dateRep01/02/2020  5.262e-13        NaN     NaN      NaN
dateRep01/03/2020  5.215e-13        NaN     NaN      NaN
dateRep01/04/2020  3.810e+02        NaN     NaN      NaN
dateRep02/01/2020  1.078e-13        NaN     NaN      NaN

What you might think you want encode your dates as numbers, i.e. 1 for the first date, 2 for the second date, etc. But this also has tons of problems! You need to look into time series analysis and how to understand these kinds of regressions. You can’t just use OLS.

It’s also not obvious why you are using 3 different regressions. Why not country-level fixed effects? You should think more about what you want to show with these regressions and what hypothesis you are trying to test.

3 Likes

Ill try this and get back to you

Very helpful, thank you! This is more about trying to get a workflow down and getting comfortable with the tools, but I think all of your points are important to keep in mind for a more serious analysis!

If you are looking for “general workflow tips”, I would suggest DataFramesMeta.jl to help with data cleaning. It might be a more familiar syntax coming from R (disclaimer, I maintain DataFramesMeta.jl).

However you specifically ask

and the answer is “R can’t do it either”.

2 Likes

I’ll absoultely check it out! If you have R-studio give this a try, it seems to work fine on my end!

covid <- read.csv("https://pages.uoregon.edu/dlevin/DATA/Covid.csv")
covid$cdt <- as.Date(covid$dateRep,format = "%d/%m/%Y")

covid_cntr <- group_split(covid,geoId)
covid_usa <- covid_cntr[[3]]
fit_usa <- lm(log(deaths)~cdt, data=subset(covid_usa, deaths>0))

Ah good catch. In my data set in my example above, I accidentally had the date variable encoded as a string.

What’s happening is that Julia’s GLM.jl treats dates as factor variables – non-numeric. R treats dates as a numeric variable. IMO Julia has the correct behavior here. It’s not obvious how to translate a date into a numeric variable that can be used in a regression. Take the following

r$> fit_usa <- lm(log(deaths)~cdt, data=subset(covid_usa, deaths>0)) %>% print

Call:
lm(formula = log(deaths) ~ cdt, data = subset(covid_usa, deaths >
    0))

Coefficients:
(Intercept)          cdt
  -4177.179        0.228


r$> covid_usa$t = nrow(covid_usa):1;

r$> fit_usa <- lm(log(deaths)~t, data=subset(covid_usa, deaths>0)) %>% print

Call:
lm(formula = log(deaths) ~ t, data = subset(covid_usa, deaths >
    0))

Coefficients:
(Intercept)            t
    -14.592        0.228

What do these numbers mean? What does it mean for the first date to be 1, second date to be 2 etc.

Julia is being more restrictive here, in a good way. If you want to do the R route, then create a numeric variable 1:nrow(covid_usa) in your DataFrame, but be warned that this is a very bad regression and has uninterpretable results.

Yeah that makes sense. I can see why Julia would go that way. I guess I can always do my analysis and then use the Date package to conver back to dates before plotting, if that’s how I want the axis to look. Thanks so much :slight_smile:

I found this function which seems to turn dates into correctly ordered integers, so that’s probably the way to go!

datetime2rata()

Doesn’t just “rank” them but turns them into number of days from some arbitrary starting point

It’s funny that this came up twice in a few days. @pdeffebach already explained what happens, but let me just note that in the next release GLM won’t throw an error anymore, so it will be more obvious that the problem is due to dates being treated as categorical (Fix `coeftable` for saturated linear models by nalimilan · Pull Request #458 · JuliaStats/GLM.jl · GitHub).

1 Like