Aggregation in Pandas
An essential component of data analysis is efficient data summerization. This is achieved in Python Pandas using various aggregate functions like sum(), min(), max(), mean(), mode(), median(), variance() and standard deviation. All these functions gives a single value for a large dataset. Aggreation accompanied with groupby (grouping of data) gives further insight of data analysis.
import pandas as pd
from IPython.display import display
d1={'rollno':[101,101,103,102,104], 'name': ['Pat','Sid','Tom','Kim','Ray'],\
'physics':[90,40,50,90,65],'chem':[75,80,60,85,60] }
df = pd.DataFrame(d1)
display(df)
print('--------Basic aggregate functuions min(), max(), sum() and mean()')
print('minimum is:',df['physics'].min())
print('maximum is:',df['physics'].max())
print('sum is:',df['physics'].sum())
print('average is:',df['physics'].mean())
OUTPUT:
rollno name physics chem
0 101 Pat 90 75
1 101 Sid 40 80
2 103 Tom 50 60
3 102 Kim 90 85
4 104 Ray 65 60
--------Basic aggregate functuions min(), max(), sum() and mean()
minimum is: 40
maximum is: 90
sum is: 335
average is: 67.0
Median and mode
Median:
Median is the middle number in a sorted list of numbers. To determine the median value in a sequence of numbers, the numbers must first be arranged in value order from lowest to highest. For the given list 30, 20, 50, 40 we need to arrange the list in ascending order as 20, 30, 40, 50 and then find the middle value, which is 45 in this case. The point to be noted is that the median may not be the part of the dataset
print('Median is:',df['physics'].median())
print('Mode is:',df['physics'].mode())
OUTPUT:
Median is: 65.0
Mode is: 0 90
dtype: int64
Variance and standard deviation
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value. The variance and the closely-related standard deviation are measures of how spread out a distribution is.
print('variance is:',df['physics'].var())
print('Standard Deviation is:',df['physics'].std())
OUTPUT:
variance is: 520.0
Standard Deviation is: 22.80350850198276
Describe method
- Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
- The outut will vary depending on what is provided.
- For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.
- For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq.
Agg() function in pandas
agg() function is used to find Aggregate using one or more operations over the specified axis. Let us see the following example:
df.agg(['sum', 'min'])
rollno name physics chem
sum 511 PatSidTomKimRay 335 360
min 101 Kim 40 60
In the given example above you can see that the sum() function has joined all the four names and the min() function has resulted in giving the alphabetically first name in the Data set(Kim). The column name should not be used to take part in the agg() operation. We can get a better result by using the following practice:
df.agg({'physics' : ['sum', 'min'], 'chem' : ['min', 'max']})
physics chem
max NaN 85.0
min 40.0 60.0
sum 335.0 NaN
The agg() function can also be applied column wise by changing the default axis from rows to columns. See the following example:
df = pd.DataFrame([[1, 2, 3],\
[4, 5, 6],\
[7, 8, 9]],columns=['A', 'B', 'C'])
df.agg("mean", axis="columns")
0 2.0
1 5.0
2 8.0
dtype: float64
Group By (split-apply-combine):
By “group by” we are referring to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria.
- Applying a function to each group independently.
- Combining the results into a data structure.
In many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish do to one of the following:
- Aggregation:compute a summary statistic (or statistics) for each group. (mean, sum)
- Transformation:perform some group-specific computations and return a like-indexed object.
- Filteration:discard some groups, according to a group-wise computation that evaluates True or False.
Splitting an object into groups
pandas objects can be split on any of their axes. There are multiple ways to split an object like −
- obj.groupby('key')
- obj.groupby(['key1','key2'])
- obj.groupby(key,axis=1)
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df
output
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
Let us group the above dataframe on the column year and try to see one of the groups. We are going to use the functions groupby() to create the groups and get_group() to extract one of the groups.
grpyear = df.groupby('Year')
g1 = grpyear.get_group(2014)
g1
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701
Next we are going to apply an aggregate function on the the group displayed above.
g1['Points'].agg('mean')
795.25
Now we are going to apply multiple aggregate functions to the created group
g1['Points'].agg(['mean','max','var'])
mean 795.250000
max 876.000000
var 7645.583333
Name: Points, dtype: float64
If we want to see all the groups created based on a given key, we need to iterate over the grouped dataset:
for key, data in grpyear:
display(key)
display(data)
output:
2014
Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701
2015
Team Rank Year Points
1 Riders 2 2015 789
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690