I am taking an upper level stats course taught in r. I have been doing a duplicate of the assignments in Julia, which has been a lot of fun. However, I have come across a few sticking points. I’ll just stick to one in this thread and probably make a new thread for the other questions.
My question is how to easily modify my plots to show a simple linear fit by a specific group id. Below is my current workflow and where I am encountering issues!
These are the packages I am using for most of my problem sets so far:
using DataFrames, HTTP, CSV, Dates, StatsPlots, GLM
I begin by reading in my data from the professors site and modifying date data to be of type Date:
covid = DataFrame(CSV.File(HTTP.get("https://pages.uoregon.edu/dlevin/DATA/Covid.csv").body));
covid[!,1]=Date.(covid[:,1],Dates.DateFormat("dd/mm/yyyy"));
Next I create a quick plot to look at the log of the deaths versus time:
quick_plot=@df covid scatter(:dateRep,log.(:deaths),group= :countriesAndTerritories, legend =:topleft)
I am very pleased with how Julia handles the dates here, but I also notice something unfortunate. If I want to fit a line of best fit through the data I would normally just add smooth=true
to my plot function call above, however in this case it does nothing (perhaps because of how the macro is set up?). It is also cool that it successfully filters out the problematic 0’s for the log function!
No worries I can use the GLM package to perform a linear fit, so I take my next step by grouping my dataframe:
gcovid=groupby(covid,:geoId)
#Checking to see what the ID's are
combine(gcovid, :geoId => unique => :geoId)
covid_uk = gcovid[1];
covid_us = gcovid[2];
covid_it = gcovid[3];
Next I perform my linear fit, being careful to filter out zeros from the dataset, because of the log:
fit_uk = lm(@formula(log(deaths) ~ dateRep),filter(:deaths => n->n!=0,gcovid[1]));
fit_us = lm(@formula(log(deaths) ~ dateRep),filter(:deaths => n->n!=0,gcovid[2]));
fit_it = lm(@formula(log(deaths) ~ dateRep),filter(:deaths => n->n!=0,gcovid[3]));
Here I encounter two problems. At first I wanted to use the same dates from my first plot which has identifier dateRep, but when I run this code I get the following error:
DomainError with 0.0:
FDist: the condition ν2 > zero(ν2) is not satisfied.
I can fix this problem by using the numeric value of day instead. However, r’s lm function call has no issues with performing a linear fit on data of type Date, so I was wondering if there was a way to do that in Julia too? Moving on, I successfully get the linear model to run with the following code:
fit_uk = lm(@formula(log(deaths) ~ day),filter(:deaths => n->n!=0,gcovid[1]));
fit_us = lm(@formula(log(deaths) ~ day),filter(:deaths => n->n!=0,gcovid[2]));
fit_it = lm(@formula(log(deaths) ~ day),filter(:deaths => n->n!=0,gcovid[3]));
Great, so now I have the information I need to plot a linear fit to my data above, but now I am running into issues actually adding these liens to the above plot. Here is what I am currently trying:
m_uk = coeftable(fit_uk).cols[1][2]
b_uk = coeftable(fit_uk).cols[1][1]
y(x) = m_uk*x+b_uk
plot!(y(gcovid[1][:,:deaths]))
I am getting index errors here, and even if I try to restrict the domain to a non-zero filter I still get index errors…I fell very close to being able to do what I want, but I am struggling with this last step. Also, I have encountered several smaller questions along the way I would appreciate any insights on .