Training dataset and reporting MSE

Hello, I am trying to train dataset using linear regression in julia and then I have to calculate the MSE of the given test dataset. I was wondering what are the best packages to use, and an example of how to do this with a .csv file would really be helpful.

Also, the csv file I have doesn’t have column names, instead has numbers there and there is Float64 written below it. It is like the table I have shown below, I just copy pasted some of the .csv file data just to give an idea, but yes this is the kind of .csv file I have to use to train and calculate MSE.

1.321249999999999855e-02 2.695833333333333501e-03 -2.254166666666667339e-03 3.345833333333333472e-03
Float64 Float64 Float64 Float64
1 -0.0095125 0.00390417 -0.0012125 0.00287083
2 0.0005875 0.0030625 -0.0015625 0.00432083
3 0.0139125 0.0010375 0.000845833 -0.00387917
4 -0.0028625 0.0017125 0.00257917 0.00325417
5 -0.0145625 -0.0005125 0.0013125 0.00390417
6 -0.0052875 0.0023875 0.000320833 -0.000854167
7 0.0083125 0.0030125 0.00204583 -0.0004625
8 0.0012125 -0.00314583 -0.00370417 -0.0012125
9 0.0009875 0.00402917 -7.91667e-5 -0.00345417
10 0.0125125

This is a pretty broad question, you might want to try some things yourself to narrow it down to any issues you might eventually face.

Starting points:

using CSV # library to read CSV files

df ="myfile.csv")

using GLM # package to fit linear models

lm(@formula(y ~ x1 + x2 + x3), df) # fit a linear model, assuming your data has columns called y, x1, x2, and x3

This is the simplest way - as you talk about “training” a model you might be looking to fit a model through some sort of optimization rather than just fitting OLS, in which case you should look at ML libaries. I’d probably start with MLJ.jl which offers a number of models to fit with a unified interface.

1 Like

Yes, that certainly helps a lot. I am still new to Julia and this at least tells me I was on the right path. Thank you!

You said “assuming your data has columns called y, x1, x2, and x3”. My data in the .csv file has weird column names. I have copy pasted a bit of the sample code below where the numbers above Float64 are the column names (Ex: here it is 1.321249999999999855e-02) and Float64 is just the type and so on.

1.321249999999999855e-02 2.695833333333333501e-03 -2.254166666666667339e-03
Float64 Float64 Float64
1 -0.0095125 0.00390417 -0.0012125
2 0.0005875 0.0030625 -0.0015625

I wanted to know if we can change the column names in Julia, or do we have to do this by opening it in another application?

I guess your csv doesn’t have headers? You can do a number of things:

  1. Manually add a header line to the csv (just add a row in Excel or something and type names for columns into it)

  2. Pass the header directly into like"myfile.csv", header = ["first_column_name", "second_column_name", "third_column_name"])

  3. If you’ve read the file into a DataFrame (let’s call it df) already you can call rename!(df, [:first_column_name, :second_column_name, :third_column_name]) on it

1 Like

The second solution worked for me perfectly.

Thank you so much for the help, I appreciate it. Now the whole thing makes a lot more sense.

If you’re new to julia and want to use it for ML, consider following the julia academy courses here. There’s one on machine learning and the first lectures are about datasets etc.

1 Like

I will certainly check it out! Thanks for the input.