Hello, I am trying to train dataset using linear regression in julia and then I have to calculate the MSE of the given test dataset. I was wondering what are the best packages to use, and an example of how to do this with a .csv file would really be helpful.
Also, the csv file I have doesn’t have column names, instead has numbers there and there is Float64 written below it. It is like the table I have shown below, I just copy pasted some of the .csv file data just to give an idea, but yes this is the kind of .csv file I have to use to train and calculate MSE.
This is a pretty broad question, you might want to try some things yourself to narrow it down to any issues you might eventually face.
Starting points:
using CSV # library to read CSV files
df = CSV.read("myfile.csv")
using GLM # package to fit linear models
lm(@formula(y ~ x1 + x2 + x3), df) # fit a linear model, assuming your data has columns called y, x1, x2, and x3
This is the simplest way - as you talk about “training” a model you might be looking to fit a model through some sort of optimization rather than just fitting OLS, in which case you should look at ML libaries. I’d probably start with MLJ.jl which offers a number of models to fit with a unified interface.
Yes, that certainly helps a lot. I am still new to Julia and this at least tells me I was on the right path. Thank you!
You said “assuming your data has columns called y, x1, x2, and x3”. My data in the .csv file has weird column names. I have copy pasted a bit of the sample code below where the numbers above Float64 are the column names (Ex: here it is 1.321249999999999855e-02) and Float64 is just the type and so on.
I guess your csv doesn’t have headers? You can do a number of things:
Manually add a header line to the csv (just add a row in Excel or something and type names for columns into it)
Pass the header directly into CSV.read like CSV.read("myfile.csv", header = ["first_column_name", "second_column_name", "third_column_name"])
If you’ve read the file into a DataFrame (let’s call it df) already you can call rename!(df, [:first_column_name, :second_column_name, :third_column_name]) on it
If you’re new to julia and want to use it for ML, consider following the julia academy courses here. https://juliaacademy.com/ There’s one on machine learning and the first lectures are about datasets etc.