Training dataset and reporting MSE

Hello, I am trying to train dataset using linear regression in julia and then I have to calculate the MSE of the given test dataset. I was wondering what are the best packages to use, and an example of how to do this with a .csv file would really be helpful.

Also, the csv file I have doesn’t have column names, instead has numbers there and there is Float64 written below it. It is like the table I have shown below, I just copy pasted some of the .csv file data just to give an idea, but yes this is the kind of .csv file I have to use to train and calculate MSE.

1.321249999999999855e-02 2.695833333333333501e-03 -2.254166666666667339e-03 3.345833333333333472e-03
Float64 Float64 Float64 Float64
1 -0.0095125 0.00390417 -0.0012125 0.00287083
2 0.0005875 0.0030625 -0.0015625 0.00432083
3 0.0139125 0.0010375 0.000845833 -0.00387917
4 -0.0028625 0.0017125 0.00257917 0.00325417
5 -0.0145625 -0.0005125 0.0013125 0.00390417
6 -0.0052875 0.0023875 0.000320833 -0.000854167
7 0.0083125 0.0030125 0.00204583 -0.0004625
8 0.0012125 -0.00314583 -0.00370417 -0.0012125
9 0.0009875 0.00402917 -7.91667e-5 -0.00345417
10 0.0125125

This is a pretty broad question, you might want to try some things yourself to narrow it down to any issues you might eventually face.

Starting points:

using CSV # library to read CSV files

df = CSV.read("myfile.csv")

using GLM # package to fit linear models

lm(@formula(y ~ x1 + x2 + x3), df) # fit a linear model, assuming your data has columns called y, x1, x2, and x3

This is the simplest way - as you talk about “training” a model you might be looking to fit a model through some sort of optimization rather than just fitting OLS, in which case you should look at ML libaries. I’d probably start with MLJ.jl which offers a number of models to fit with a unified interface.

1 Like

Yes, that certainly helps a lot. I am still new to Julia and this at least tells me I was on the right path. Thank you!

You said “assuming your data has columns called y, x1, x2, and x3”. My data in the .csv file has weird column names. I have copy pasted a bit of the sample code below where the numbers above Float64 are the column names (Ex: here it is 1.321249999999999855e-02) and Float64 is just the type and so on.

1.321249999999999855e-02 2.695833333333333501e-03 -2.254166666666667339e-03
Float64 Float64 Float64
1 -0.0095125 0.00390417 -0.0012125
2 0.0005875 0.0030625 -0.0015625

I wanted to know if we can change the column names in Julia, or do we have to do this by opening it in another application?

I guess your csv doesn’t have headers? You can do a number of things:

  1. Manually add a header line to the csv (just add a row in Excel or something and type names for columns into it)

  2. Pass the header directly into CSV.read like CSV.read("myfile.csv", header = ["first_column_name", "second_column_name", "third_column_name"])

  3. If you’ve read the file into a DataFrame (let’s call it df) already you can call rename!(df, [:first_column_name, :second_column_name, :third_column_name]) on it

1 Like

The second solution worked for me perfectly.

Thank you so much for the help, I appreciate it. Now the whole thing makes a lot more sense.

If you’re new to julia and want to use it for ML, consider following the julia academy courses here. https://juliaacademy.com/ There’s one on machine learning and the first lectures are about datasets etc.

1 Like

I will certainly check it out! Thanks for the input.