CBSE- Informatics Practices (IP) : July 2019

Introduction to Linear Regression

Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable.
It looks for statistical relationship but not deterministic relationship.
The following figure illustrates the deterministinc relationship between temperature in Degree celsius and Degree Fahrenmheit where the data points (x,y) fall directly on a line

In a statistical relartionship the relationship between two variables x and y is not perfect and hence the line of relationship doesnot pass through all the points as depicted in the figure below.

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0).
It should be noted that linear regression analysis can only be done if the correlation coefficient is close to -1 or +1 and not possible if the same is close to 0.

Performing Linear Regression in python.

The scikit-learn package can be used with DataFrames or NumPy arrays to perform linear regression analysis. We need to install the package by using the command pip install -U scikit-learn while we are online.
Our task is to perform Linear Regression on a dataset comprising of area of land and its price (training dataset). After we perform the linear regression we should be able to predict the optimum price of a piece of land not given in the dataset (test data).

area price

2600 550000

3000 565000

3200 610000

3600 680000

4000 725000
To perform linear regression we create the dataset to train our model. (Training of the model here means to find the parameters so that the model best fits the data.)

area	price
2600	550000
3000	565000
3200	610000
3600	680000
4000	725000

import numpy as np

area = np.array([2600,3000,3200,3600,4000]).reshape((-1,1))

price = np.array([550000,565000,610000,680000,725000])

print(area)

print(price)

output:

[[2600]
 [3000]
 [3200]
 [3600]
 [4000]]
[550000 565000 610000 680000 725000]

You can see above that the area has been reshaped into a 2D array comprising of each value as rows. This is required to be like this since the operations to follow needs the explanatory variable X to ba passed as a 2D array to our training function reg.fit()
We are going to import the machine learning routines available with linear_model from sklearn to help us to create our model.
The following example illustrates as to how we train our model.

import numpy as np

from sklearn import linear_model

area = np.array([2600,3000,3200,3600,4000]).reshape((-1,1))

price = np.array([550000,565000,610000,680000,725000])

# Instantiatin our linear regression model

reg = linear_model.LinearRegression()

# Now we train the our regression model with training data (area, price)

reg.fit(area,price)

#We can now predict the price of area(test data) not present in our training dataset

#observe that predict function too takes a 2D array

p = reg.predict([[3400]])

print('Price for area 3400:',p)

output:

Price for area 3400: [642294.52054795]

We can also find the slope(m) and intercept(c) using the following example

print('The slope of the curve:',reg.coef_)

print('The intercept of the curve:',reg.intercept_)

output:  

The slope of the curve: [135.78767123]
The intercept of the curve: 180616.43835616432

Plotting linear regression graph using matplotlib and numpy

We shall superimpose the scatterplot of (area, price) below the line plot(regression line) using matplotlib

import numpy as np

import matplotlib.pyplot as plt

from sklearn import linear_model

#Exclusively for jupyter to allow the plots to be inline and not require plt.show()

%matplotlib inline

# Our training dataset

area = np.array([2600,3000,3200,3600,4000]).reshape((-1,1))

price = np.array([550000,565000,610000,680000,725000])

# labels for x and y axis

plt.xlabel('area')

plt.ylabel('price')

# Scatter plot for the dataset in red colour

plt.scatter(area,price, color='red')

# Performing linear regression

reg = linear_model.LinearRegression() # instantiaing a linear regression model

reg.fit(area,price) # training the model with data

# Plotting the regression line

plt.plot(area,reg.predict(area), color='blue')

output:

Plotting linear regression graph using matplotlib and pandas

import pandas as pd

import matplotlib.pyplot as plt

from sklearn import linear_model

%matplotlib inline

df = pd.DataFrame([[2600,550000],[3000,565000],[3200,610000],\

[3600,680000],[4000,725000]],\

columns=['area','price'])

plt.xlabel('area')

plt.ylabel('price')

plt.scatter(df.area,df.price, color='red')

reg = linear_model.LinearRegression() # instantiaing a linear regression model

reg.fit(df[['area']],df.price) # training the model with data

plt.plot(df.area,reg.predict(df[['area']]), color='blue')

output:

Reference:

Introduction to Correlation

Covariance and correlation both primarily assess the relationship between variables.
Covariance measures the total variation of two random variables from their expected values. However, it does not indicate the strength of the relationship, nor the dependency between the variables.
Correlation measures the strength of the relationship between variables. Correlation is the scaled measure of covariance.
The relationship between the two concepts can be expressed using the formula below:

The main concept of correlation is correlation coeficient . It can have values between -1 to +1. The closer coeficient is to +1 or -1, the more closely the two variables are related.
If correlation coeficient is close to 0, it means there is no relationship between the variables. If coeficient is positive, it means that as one variable gets larger the other gets larger. If coeficient is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).
Numpy calculates Linear correlation coefficient using Pearson method.
The function np.corrcoef(x,y) gives the matrix with following values

import numpy as np

x = np.array([1.2,2.4,3.6,4.8,7.2])

y = np.array([4,6,8,10,12])

print(np.corrcoef(x,y))

output:

[[1. 0.98639392] [0.98639392 1. ]]

Depiction of correlation through scatter plots

Let us try to visualize correlation between two variables with visual representation. We shall use some random numbers and scatter plots to visualize the concept.

Positive correlation

import numpy as np

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

matplotlib.style.use('ggplot')

# 1000 random integers between 0 and 50

x = np.random.randint(0, 50, 1000)

# Positive Correlation with respect to x

y = x + np.random.randint(0,10,1000)

print(np.corrcoef(x,y))

plt.scatter(x, y)

plt.show()

OUTPUT:

Negative correlation:

import numpy as np

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

matplotlib.style.use('ggplot')

# 1000 random integers between 0 and 50

x = np.random.randint(0, 50, 1000)

# Negative Correlation with respect to x

y = 100 - x + np.random.randint(0,10,1000)

print(np.corrcoef(x,y))

plt.scatter(x, y)

plt.show()

OUTPUT:

Weak or no correlation

import numpy as np

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

matplotlib.style.use('ggplot')

# 1000 random integers between 0 and 50

x = np.random.randint(0, 50, 1000)

# No Correlation with respect to x

y = np.random.randint(0,30,1000)

print(np.corrcoef(x,y))

plt.scatter(x, y)

plt.show()

CBSE- Informatics Practices (IP)

Wednesday, July 10, 2019

Linear Regression

Introduction to Linear Regression

Performing Linear Regression in python.

Plotting linear regression graph using matplotlib and numpy

Plotting linear regression graph using matplotlib and pandas

Reference:

Wednesday, July 3, 2019

Correlation

Introduction to Correlation

Depiction of correlation through scatter plots

Positive correlation

Negative correlation:

Weak or no correlation