Saturday, May 4, 2019

XII-IP : Quantile-Percentile-Quartile

Quantiles

  • In statistics and probability quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities.
  • The word “quantile” comes from the word quantity. In simple terms, a quantile is where a sample is divided into equal-sized, adjacent, subgroups
  • The median is a quantile; the median is placed in a probability distribution so that exactly half of the data is lower than the median and half of the data is above the median.
In a given dataset of size n, arrenged in ascending order of their values, we can obtain the position of the value at quantile q by the formula: pos = q(n+1)
Example: For the given data set of size n = 25
2,3,4,5,7,9,11,13,15,16,18,25,28, 30,45,50,65,80,85,88,90,92,95,99,100 Let us calculate .10, .25 .40, .50, .60 and .75 quantile values
pos1 = 0.10*(25+1)= 2.6 The nearest upper value is 3. Hence the quantile value of the given dataset is the element at 3rd position, which is 4.
pos2 = 0.25*(25+1)=6.5 The nearest upper value is 7. Hence the quantile value (1st quartile too) of the given dataset is the element at 7the position, which is 11
pos3 = 0.40*(25+1)=10.4 The nearest upper value is 11. Hence the quantile value of the given dataset is the element at 11th position, which is 18
pos4 = 0.50*(25+1)=13. Hence the quantile value (2nd Quartile or median) of the given dataset is the element at 13th position, which is 28
pos5 = 0.60*(25+1)=15.6 The nearest lower value is 15 Hence the quantile value of the given dataset is the element at 15th position, which is 45
pos6 = 0.75*(25+1)=19.5 The nearest lower value is 19 Hence the quantile value (3rd Quartile) of the given dataset is the element at 20th position, which is 85
Note: With interpolation as nearest, we take the upper integer value for calculating quantiles up to median. Beyond median we take the lower integer value. With other interpolation settings we may get different results

import pandas as pd
s = pd.Series ([2,3,4,5,7,9,11,13,15,16,18,25,28,30,45,50,65,80,85,88,90,92,95,99,100])
q = s.quantile([.1,.25,.4,.5,.6,.75], interpolation='nearest')
display(q)
output: 
0.10     4
0.25    11
0.40    18
0.50    28
0.60    45
0.75    85
dtype: int64

Quartiles

  • Quartiles in statistics are values that divide your data into quarters.
  • Quartiles are also quantiles; they divide the distribution into four equal parts.
  • To calculate quartiles we shall be using the percentile() function provided by numPy
2,3,4,5,7,9, 11, 13,15,16,18,25, 28, 30,45,50,65,80, 85, 88,90,92,95,99, 100
We can see from the examples given above for quantiles the pos2, pos4 and pos6 are actually calculating the three quartiles.

import numpy as np
q = np.percentile(s,[25,50,75])
q
output: 
array([11., 28., 85.])

Percentile

The pth percentile is the value in a dataset at which it can be split into two parts. The lower part contains p percent of the data, and the upper part consists of the remaining data; i.e., 100-p (the total data equates to 100%).

Calculating the pth percentile

We are going to use the following way to calculate percentile. Don't worry about the result. There are methods that may calculate the percentile differently with nearby values.
1) Arrange the data in the ascending order.
2) Calculate an index i (the position of the pth percentile) as follows:
i = (p / 100) * n
Where: p is the percentile and n is the number of values that appear in the data.
If i is not an integer, round it up. The next integer greater than i represents the position of the pth percentile. If i is an integer, the pth percentile is the average of the values in positions i and i + 1.




XII-IP: Re-indexing and Altering lables in Pandas

Reindexing

Reindexing in Pandas can be used to change the index of rows and columns of a DataFrame. To reindex means to conform the data to match a given set of labels 
import pandas as pd
df = pd.DataFrame([[10,5,50,8],[20,10,60,16],[30,15,70,24],[40,20,80,32],[50,25,90,40]],\
index=['A','B','C','D','E'], columns=['COL1','COL2','COL3','COL4'])
display(df)
df1=df.reindex(['C','B','D','E','A'])
display(df1)

output:

  COL1 COL2 COL3 COL4
A 10 5 50 8
B 20 10 60 16
C 30 15 70 24
D 40 20 80 32
E 50 25 90 40
  COL1 COL2 COL3 COL4
C 30 15 70 24
B 20 10 60 16
D 40 20 80 32
E 50 25 90 40
A 10 5 50 8

display(df)
df2=df.reindex(['C','B','Z','E','A'])
display(df2)

  COL1 COL2 COL3 COL4
A 10 5 50 8
B 20 10 60 16
C 30 15 70 24
D 40 20 80 32
E 50 25 90 40
  COL1 COL2 COL3 COL4
C 30.0 15.0 70.0 24.0
B 20.0 10.0 60.0 16.0
Z NaN NaN NaN NaN
E 50.0 25.0 90.0 40.0
A 10.0 5.0 50.0 8.0

We can fill in the missing values by passing a value to the argument fill_value. In the following example the NaNs are replaced with 0.

display(df)
df2=df.reindex(['C','B','Z','E','A'], fill_value=0)
display(df2)

  COL1 COL2 COL3 COL4
A 10 5 50 8
B 20 10 60 16
C 30 15 70 24
D 40 20 80 32
E 50 25 90 40
  COL1 COL2 COL3 COL4
C 30 15 70 24
B 20 10 60 16
Z 0 0 0 0
E 50 25 90 40
A 10 5 50 8
We can use the method parameter with values bfillffill or nearest to fill values from adjacent rows. See the following example: In this example the ffill value has replaced the NaNs in Z row with values from E row.
df2=df.reindex(['C','B','Z','E','A'], method='ffill')
display(df2)
  COL1 COL2 COL3 COL4
C 30 15 70 24
B 20 10 60 16
Z 50 25 90 40
E 50 25 90 40
A 10 5 50 8

Altering Labels in DataFrames

We can also reindex the columns by setting the parameter columns as shown in the following example. Notice that since the reindexed DataFrame has a new column label as COL5 which is not a part of original DataFrame, NaNs are displayed in that column

display(df)
df2=df.reindex(columns=['COL1','COL2','COL3','COL5'])
display(df2)

  COL1 COL2 COL3 COL4
A 10 5 50 8
B 20 10 60 16
C 30 15 70 24
D 40 20 80 32
E 50 25 90 40
  COL1 COL2 COL3 COL5
A 10 5 50 NaN
B 20 10 60 NaN
C 30 15 70 NaN
D 40 20 80 NaN
E 50 25 90 NaN

The same can be achieved by setting the axis parameter to columns as shown below

df2=df.reindex(['COL1','COL2','COL3','COL5'], axis='columns')
display(df2)

  COL1 COL2 COL3 COL5
A 10 5 50 NaN
B 20 10 60 NaN
C 30 15 70 NaN
D 40 20 80 NaN
E 50 25 90 NaN

We can use rename method to rename the columns in a DataFrame. We need to pass a dictionary containing the old and new names to the columns parameter of rename method

df = df.rename(columns={'COL1':'W','COL2':'X','COL3':'Y','COL4':'Z'})
df
  W X Y Z
A 10 5 50 8
B 20 10 60 16
C 30 15 70 24
D 40 20 80 32
E 50 25 90 40

XII-IP : Pipe in Pandas

Pipe() in pandas

pipe() method is used to chain a number of operations on a dataframe in a single compound statement. Its an effective way of reducing codelines and achieve a single long term objective from a data frame. (We may run pipes of series items too) In the given example we a doing the followiing things to achive the goal of getting the mean of the Points column in an extracted group from a dataframe.
  • Creating the group based on year - df.groupby('Year')
  • Filtering in a single group from the set of groups(get_group(2014))
  • Getting the mean of column Points - x['Points'].agg('mean') 

 coding:

import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
x = df.pipe(lambda x: df.groupby('Year')
           .pipe(lambda x: x.get_group(2014))
           .pipe(lambda x: x['Points'].agg('mean')))
x

output:
795.25

Another example of pipe

x = df.pipe(lambda x: df.groupby('Team')
           .pipe(lambda x: x.get_group('kings'))
           .pipe(lambda x: x['Points'].agg('sum')))
x

output:
812


XII-IP : Transform, applymap and apply in Pandas

Transform

Takes one function that is expected to be applied to a column and return a column of equal size. Pandas DataFrame.transform() function call func on self producing a DataFrame with transformed values and that has the same axis length as self.
import pandas as pd
d1 = {'rollno':[101,101,103,102,104], 'maths': [80,70,80,60,70],\
      'physics':[90,40,50,90,65],'chem':[75,80,60,85,60] }
df = pd.DataFrame(d1)
display(df)
r = df.transform(func=lambda x:x+10)
print('--------------Transform----------------')
display(r)

  rollno maths physics chem
0 101 80 90 75
1 101 70 40 80
2 103 80 50 60
3 102 60 90 85
4 104 70 65 60
--------------Transform----------------
  rollno maths physics chem
0 111 90 100 85
1 111 80 50 90
2 113 90 60 70
3 112 70 100 95
4 114 80 75 70


Applymap

This takes a function and returns a new dataframe with the results of that function being applied to the value in each cell and replacing the value of the cell with the result.

import numpy as np
s = df.applymap(np.sqrt)
print('--------------applymap----------------')
display(s)

--------------applymap----------------
  rollno     maths    physics    chem
0 10.049876 8.944272 9.486833 8.660254
1 10.049876 8.366600 6.324555 8.944272
2 10.148892 8.944272 7.071068 7.745967
3 10.099505 7.745967 9.486833 9.219544
4 10.198039 8.366600 8.062258 7.745967

Apply

Takes a function and applies it either row-wise or column-wise depending on the value of axis parameter. Apply may give the same result as transfom() or applymap() with parameter set to non-aggregate function.

print('----------apply behaving as applymap-----------')
x = df.apply(np.sqrt)
display(x)
print('----------apply with axis =0-----------')
t = df.apply(np.mean)
display(t)
print('----------apply with axis =1-----------')
t = df.apply(np.mean, axis=1)
display(t)
output:

----------apply behaving as applymap-----------
     rollno maths     physics chem
0 10.049876 8.944272 9.486833 8.660254
1 10.049876 8.366600 6.324555 8.944272
2 10.148892 8.944272 7.071068 7.745967
3 10.099505 7.745967 9.486833 9.219544
4 10.198039 8.366600 8.062258 7.745967
----------apply with axis =0-----------
rollno     102.2
maths       72.0
physics     67.0
chem        72.0
dtype: float64
----------apply with axis =1-----------
0    86.50
1    72.75
2    73.25
3    84.25
4    74.75
dtype: float64

Another example of applymap and transform
d=df.transform(func=lambda x:x+10)
display(d)
e=df.applymap(lambda x:x+10)
display(e)

output:
  rollno maths physics chem
0 111 90 100 85
1 111 80 50 90
2 113 90 60 70
3 112 70 100 95
4 114 80 75 70
  rollno maths physics chem
0 111 90 100 85
1 111 80 50 90
2 113 90 60 70
3 112 70 100 95
4 114 80 75 70