# Data Science Posts and Resources

Articles on Data Science

Predicting student grade using Linear Regression

Laxmi K Soni

## Linear Regression

The easiest and most basic machine learning algorithm is linear regression . It will be the first one that we are going to look at and it is a supervised learning algorithm. That means that we need both – inputs and outputs – to train the model.

If we are applying this model to data of schools and we try to find a relation between missing hours, learning time and the resulting grade, we will probably get a less accurate result than by including 30 parameters. Logically, however, we then no longer have a straight line or flat surface but a hyperplane. This is the equivalent to a straight line, in higher dimensions.

to get started with our code, we first need data that we want to work with. Here we use a dataset from UCI.

This is a dataset which contains a lot of information about student performance. We will use it as sample data for our models.

We download the ZIP-file from the Data Folder and extract the file student-mat.csv from there into the folder in which we code our script.

Now we can start with our code. First of all, we will import the necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Besides the imports of the first three libraries that we already know, we have two more imports that are new to us. First, we import the LinearRegression module . This is the module that we will use for creating, training and using our regression model. Additionally, we import the train_test_split module, which we will use to prepare and split our data. Our first action is to load the data from the CSV file into a Pandas DataFrame. We do this with the function read_csv.

data = pd.read_csv( '../data/student/student-mat.csv' , sep = ';' )
print(data.tail())
##     school sex  age address famsize Pstatus  ...  Walc  health absences  G1  G2  G3
## 390     MS   M   20       U     LE3       A  ...     5       4       11   9   9   9
## 391     MS   M   17       U     LE3       T  ...     4       2        3  14  16  16
## 392     MS   M   21       R     GT3       T  ...     3       3        3  10   8   7
## 393     MS   M   18       R     LE3       T  ...     4       5        0  11  12  10
## 394     MS   M   19       U     LE3       T  ...     3       5        5   8   9   9
##
## [5 rows x 33 columns]

It is important that we change our separator to semicolon, since the default separator for CSV files is a comma and our file is separated by semicolons.

In the next step, we think about which features (i.e. columns) are relevant for us, and what exactly we want to predict. A description of all features can be found on the previously mentioned website. In this example, we will limit ourselves to the following columns:

Age, Sex, Studytime, Absences, G1, G2, G3 (label)

data = data[[ 'age' , 'sex' , 'studytime' ,'absences' , 'G1' , 'G2' , 'G3' ]]

The columns G1, G2 and G3 are the three grades that the students get. Our goal is to predict the third and final grade by looking at the other values like first grade, age, sex and so on.

Summarized that means that we only select these columns from our DataFrame, out of the 33 possible.

G3 is our label and the rest are our features. Each feature is an axis in the coordinate system and each point is a record, that is, one row in the table. But we have a little problem here. The sex feature is not numeric, but stored as F (for female) or M (for male) . But for us to work with it and register it in the coordinate system, we have to convert it into numbers.

data[ 'sex' ] = data[ 'sex' ].map({ 'F' : 0 , 'M' : 1 })
prediction = 'G3' 

We do this by using the map function. Here, we map a dictionary to our feature. Each F becomes a zero and every M becomes a one. Now we can work with it. Finally, we define the column of the desired label as a variable to make it easier to work with.

## PREPARING DATA

Our data is now fully loaded and selected. However, in order to use it as training and testing data for our model, we have to reformat them. The sklearn models do not accept Pandas data frames, but only NumPy arrays. That’s why we turn our features into an x-array and our label into a y-array.

X = np.array(data.drop([prediction], 1 ))
Y = np.array(data[prediction])

The method np.array converts the selected columns into an array. The drop function returns the data frame without the specified column. Our X array now contains all of our columns, except for the final grade. The final grade is in the Y array. In order to train and test our model, we have to split our available data. The first part is used to get the hyperplane to fit our data as well as possible. The second part then checks the accuracy of the prediction, with previously unknown data.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1 )

With the function train_test_split , we divide our X and Y arrays into four arrays. The order must be exactly as shown here. The test_size parameter specifies what percentage of records to use for testing. In this case, it is 10%. This is also a good and recommended value. We do this to test how accurate it is with data that our model has never seen before.

## TRAINING AND TESTING

Now we can start training and testing our model. For that, we first define our model.

model = LinearRegression()
model.fit(X_train, Y_train)
## LinearRegression()

By using the constructor of the LinearRegression class, we create our model. We then use the fit function and pass our training data. Now our model is already trained. It has now adjusted its hyperplane so that it fits all of our values. In order to test how well our model performs, we can use the score method and pass our testing data.

accuracy = model.score(X_test, Y_test) print (accuracy) Since the splitting of training and test data is always random, we will have slightly different results on each run. An average result could look like this: 0.9130676521162756 Actually, 91 percent is a pretty high and good accuracy. Now that we know that our model is somewhat reliable, we can enter new data and predict the final grade.

X_new = np.array([[ 18 , 1 , 3 , 40 , 15 , 16 ]])
Y_new = model.predict(X_new)
print (Y_new)
## [17.39858933]

Here we define a new NumPy array with values for our features in the right order. Then we use the predict method, to calculate the likely final grade for our inputs.

[17.12142363]

In this case, the final grade would probably be 17.

Nothing yet.