Quantiles
- In statistics and probability quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities.
- The word “quantile” comes from the word quantity. In simple terms, a quantile is where a sample is divided into equal-sized, adjacent, subgroups
- The median is a quantile; the median is placed in a probability distribution so that exactly half of the data is lower than the median and half of the data is above the median.
In a given dataset of size n, arrenged in ascending order of their values, we can obtain the position of the value at quantile q by the formula: pos = q(n+1)
Example: For the given data set of size n = 25
2,3,4,5,7,9,11,13,15,16,18,25,28, 30,45,50,65,80,85,88,90,92,95,99,100 Let us calculate .10, .25 .40, .50, .60 and .75 quantile values
pos1 = 0.10*(25+1)= 2.6 The nearest upper value is 3. Hence the quantile value of the given dataset is the element at 3rd position, which is 4.
pos2 = 0.25*(25+1)=6.5 The nearest upper value is 7. Hence the quantile value (1st quartile too) of the given dataset is the element at 7the position, which is 11
pos3 = 0.40*(25+1)=10.4 The nearest upper value is 11. Hence the quantile value of the given dataset is the element at 11th position, which is 18
pos4 = 0.50*(25+1)=13. Hence the quantile value (2nd Quartile or median) of the given dataset is the element at 13th position, which is 28
pos5 = 0.60*(25+1)=15.6 The nearest lower value is 15 Hence the quantile value of the given dataset is the element at 15th position, which is 45
pos6 = 0.75*(25+1)=19.5 The nearest lower value is 19 Hence the quantile value (3rd Quartile) of the given dataset is the element at 20th position, which is 85
Note: With interpolation as nearest, we take the upper integer value for calculating quantiles up to median. Beyond median we take the lower integer value. With other interpolation settings we may get different results
import pandas as pd
s = pd.Series ([2,3,4,5,7,9,11,13,15,16,18,25,28,30,45,50,65,80,85,88,90,92,95,99,100])
q = s.quantile([.1,.25,.4,.5,.6,.75], interpolation='nearest')
display(q)
output:
0.10 4
0.25 11
0.40 18
0.50 28
0.60 45
0.75 85
dtype: int64
Quartiles
- Quartiles in statistics are values that divide your data into quarters.
- Quartiles are also quantiles; they divide the distribution into four equal parts.
- To calculate quartiles we shall be using the percentile() function provided by numPy
2,3,4,5,7,9, 11, 13,15,16,18,25, 28, 30,45,50,65,80, 85, 88,90,92,95,99, 100
We can see from the examples given above for quantiles the pos2, pos4 and pos6 are actually calculating the three quartiles.
import numpy as np
q = np.percentile(s,[25,50,75])
q
output:
array([11., 28., 85.])
Percentile
The pth percentile is the value in a dataset at which it can be split into two parts. The lower part contains p percent of the data, and the upper part consists of the remaining data; i.e., 100-p (the total data equates to 100%).
Calculating the pth percentile
We are going to use the following way to calculate percentile. Don't worry about the result. There are methods that may calculate the percentile differently with nearby values.
1) Arrange the data in the ascending order.
2) Calculate an index i (the position of the pth percentile) as follows:
i = (p / 100) * n
Where: p is the percentile and n is the number of values that appear in the data.
If i is not an integer, round it up. The next integer greater than i represents the position of the pth percentile. If i is an integer, the pth percentile is the average of the values in positions i and i + 1.