Introduction to Linear Regression
- Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable.
- It looks for statistical relationship but not deterministic relationship.
- The following figure illustrates the deterministinc relationship between temperature in Degree celsius and Degree Fahrenmheit where the data points (x,y) fall directly on a line
- In a statistical relartionship the relationship between two variables x and y is not perfect and hence the line of relationship doesnot pass through all the points as depicted in the figure below.
- Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.
- A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
- It should be noted that linear regression analysis can only be done if the correlation coefficient is close to -1 or +1 and not possible if the same is close to 0.
Performing Linear Regression in python.
- The scikit-learn package can be used with DataFrames or NumPy arrays to perform linear regression analysis. We need to install the package by using the command pip install -U scikit-learn while we are online.
- Our task is to perform Linear Regression on a dataset comprising of area of land and its price (training dataset). After we perform the linear regression we should be able to predict the optimum price of a piece of land not given in the dataset (test data).
- To perform linear regression we create the dataset to train our model. (Training of the model here means to find the parameters so that the model best fits the data.)
import numpy as np
area = np.array([2600,3000,3200,3600,4000]).reshape((-1,1))
price = np.array([550000,565000,610000,680000,725000])
print(area)
print(price)
output:
[[2600]
[3000]
[3200]
[3600]
[4000]]
[550000 565000 610000 680000 725000]
- You can see above that the area has been reshaped into a 2D array comprising of each value as rows. This is required to be like this since the operations to follow needs the explanatory variable X to ba passed as a 2D array to our training function reg.fit()
- We are going to import the machine learning routines available with linear_model from sklearn to help us to create our model.
- The following example illustrates as to how we train our model.
import numpy as np
from sklearn import linear_model
area = np.array([2600,3000,3200,3600,4000]).reshape((-1,1))
price = np.array([550000,565000,610000,680000,725000])
# Instantiatin our linear regression model
reg = linear_model.LinearRegression()
# Now we train the our regression model with training data (area, price)
reg.fit(area,price)
#We can now predict the price of area(test data) not present in our training dataset
#observe that predict function too takes a 2D array
p = reg.predict([[3400]])
print('Price for area 3400:',p)
output:
Price for area 3400: [642294.52054795]
- We can also find the slope(m) and intercept(c) using the following example
print('The slope of the curve:',reg.coef_)
print('The intercept of the curve:',reg.intercept_)
output:
The slope of the curve: [135.78767123]
The intercept of the curve: 180616.43835616432
Plotting linear regression graph using matplotlib and numpy
- We shall superimpose the scatterplot of (area, price) below the line plot(regression line) using matplotlib
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
#Exclusively for jupyter to allow the plots to be inline and not require plt.show()
%matplotlib inline
# Our training dataset
area = np.array([2600,3000,3200,3600,4000]).reshape((-1,1))
price = np.array([550000,565000,610000,680000,725000])
# labels for x and y axis
plt.xlabel('area')
plt.ylabel('price')
# Scatter plot for the dataset in red colour
plt.scatter(area,price, color='red')
# Performing linear regression
reg = linear_model.LinearRegression() # instantiaing a linear regression model
reg.fit(area,price) # training the model with data
# Plotting the regression line
plt.plot(area,reg.predict(area), color='blue')
output:
Plotting linear regression graph using matplotlib and pandas
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
%matplotlib inline
df = pd.DataFrame([[2600,550000],[3000,565000],[3200,610000],\
[3600,680000],[4000,725000]],\
columns=['area','price'])
plt.xlabel('area')
plt.ylabel('price')
plt.scatter(df.area,df.price, color='red')
reg = linear_model.LinearRegression() # instantiaing a linear regression model
reg.fit(df[['area']],df.price) # training the model with data
plt.plot(df.area,reg.predict(df[['area']]), color='blue')
output: