Introduction to Correlation
- Covariance and correlation both primarily assess the relationship between variables.
- Covariance measures the total variation of two random variables from their expected values. However, it does not indicate the strength of the relationship, nor the dependency between the variables.
- Correlation measures the strength of the relationship between variables. Correlation is the scaled measure of covariance.
- The relationship between the two concepts can be expressed using the formula below:
- The main concept of correlation is correlation coeficient . It can have values between -1 to +1. The closer coeficient is to +1 or -1, the more closely the two variables are related.
- If correlation coeficient is close to 0, it means there is no relationship between the variables. If coeficient is positive, it means that as one variable gets larger the other gets larger. If coeficient is negative it means that as one gets larger, the other gets smaller (often called an "inverse" correlation).
- Numpy calculates Linear correlation coefficient using Pearson method.
- The function np.corrcoef(x,y) gives the matrix with following values
import numpy as np
x = np.array([1.2,2.4,3.6,4.8,7.2])
y = np.array([4,6,8,10,12])
print(np.corrcoef(x,y))
output:
[[1. 0.98639392]
[0.98639392 1. ]]
Depiction of correlation through scatter plots
- Let us try to visualize correlation between two variables with visual representation. We shall use some random numbers and scatter plots to visualize the concept.
Positive correlation
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Positive Correlation with respect to x
y = x + np.random.randint(0,10,1000)
print(np.corrcoef(x,y))
plt.scatter(x, y)
plt.show()
OUTPUT:
Negative correlation:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# Negative Correlation with respect to x
y = 100 - x + np.random.randint(0,10,1000)
print(np.corrcoef(x,y))
plt.scatter(x, y)
plt.show()
OUTPUT:
Weak or no correlation
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')
# 1000 random integers between 0 and 50
x = np.random.randint(0, 50, 1000)
# No Correlation with respect to x
y = np.random.randint(0,30,1000)
print(np.corrcoef(x,y))
plt.scatter(x, y)
plt.show()
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.