Aggregation in Pandas

An essential component of data analysis is efficient data summerization. This is achieved in Python Pandas using various aggregate functions like sum(), min(), max(), mean(), mode(), median(), variance() and standard deviation. All these functions gives a single value for a large dataset. Aggreation accompanied with groupby (grouping of data) gives further insight of data analysis.

import pandas as pd

from IPython.display import display

d1={'rollno':[101,101,103,102,104], 'name': ['Pat','Sid','Tom','Kim','Ray'],\

'physics':[90,40,50,90,65],'chem':[75,80,60,85,60] }

df = pd.DataFrame(d1)

display(df)

print('--------Basic aggregate functuions min(), max(), sum() and mean()')

print('minimum is:',df['physics'].min())

print('maximum is:',df['physics'].max())

print('sum is:',df['physics'].sum())

print('average is:',df['physics'].mean())

OUTPUT:

rollno name physics chem

0 101 Pat 90 75

1 101 Sid 40 80

2 103 Tom 50 60

3 102 Kim 90 85

4 104 Ray 65 60

--------Basic aggregate functuions min(), max(), sum() and mean()

minimum is: 40

maximum is: 90

sum is: 335

average is: 67.0

Median and mode

Median:

Median is the middle number in a sorted list of numbers. To determine the median value in a sequence of numbers, the numbers must first be arranged in value order from lowest to highest. For the given list 30, 20, 50, 40 we need to arrange the list in ascending order as 20, 30, 40, 50 and then find the middle value, which is 45 in this case. The point to be noted is that the median may not be the part of the dataset

print('Median is:',df['physics'].median())

print('Mode is:',df['physics'].mode())

OUTPUT:

Median is: 65.0 Mode is: 0 90 dtype: int64

Variance and standard deviation

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value. The variance and the closely-related standard deviation are measures of how spread out a distribution is.

print('variance is:',df['physics'].var())

print('Standard Deviation is:',df['physics'].std())

OUTPUT:

variance is: 520.0 Standard Deviation is: 22.80350850198276

Describe method

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
The outut will vary depending on what is provided.
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq.

Agg() function in pandas

agg() function is used to find Aggregate using one or more operations over the specified axis. Let us see the following example:

df.agg(['sum', 'min'])

rollno name physics chem

sum 511 PatSidTomKimRay 335 360

min 101 Kim 40 60

In the given example above you can see that the sum() function has joined all the four names and the min() function has resulted in giving the alphabetically first name in the Data set(Kim). The column name should not be used to take part in the agg() operation. We can get a better result by using the following practice:

df.agg({'physics' : ['sum', 'min'], 'chem' : ['min', 'max']})

physics chem

max NaN 85.0

min 40.0 60.0

sum 335.0 NaN

The agg() function can also be applied column wise by changing the default axis from rows to columns. See the following example:

df = pd.DataFrame([[1, 2, 3],\

[4, 5, 6],\

[7, 8, 9]],columns=['A', 'B', 'C'])

df.agg("mean", axis="columns")

0 2.0 1 5.0 2 8.0 dtype: float64

Group By (split-apply-combine):

By “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.

In many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish do to one of the following:

Aggregation:compute a summary statistic (or statistics) for each group. (mean, sum)
Transformation:perform some group-specific computations and return a like-indexed object.
Filteration:discard some groups, according to a group-wise computation that evaluates True or False.

Splitting an object into groups

pandas objects can be split on any of their axes. There are multiple ways to split an object like −

obj.groupby('key')
obj.groupby(['key1','key2'])
obj.groupby(key,axis=1)

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',

'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],

'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],

'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],

'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)

output

  Team Rank Year Points

0 Riders 1 2014 876

1 Riders 2 2015 789

2 Devils 2 2014 863

3 Devils 3 2015 673

4 Kings 3 2014 741

5 kings 4 2015 812

6 Kings 1 2016 756

7 Kings 1 2017 788

8 Riders 2 2016 694

9 Royals 4 2014 701

10 Royals 1 2015 804

11 Riders 2 2017 690

Let us group the above dataframe on the column year and try to see one of the groups. We are going to use the functions groupby() to create the groups and get_group() to extract one of the groups.

grpyear = df.groupby('Year')

g1 = grpyear.get_group(2014)

g1

  Team Rank Year Points

0 Riders 1 2014 876

2 Devils 2 2014 863

4 Kings 3 2014 741

9 Royals 4 2014 701

Next we are going to apply an aggregate function on the the group displayed above.

g1['Points'].agg('mean')

795.25

Now we are going to apply multiple aggregate functions to the created group

g1['Points'].agg(['mean','max','var'])

mean 795.250000

max 876.000000

var 7645.583333

Name: Points, dtype: float64

If we want to see all the groups created based on a given key, we need to iterate over the grouped dataset:

for key, data in grpyear:

display(key)

display(data)

output:

2014

Team Rank Year Points

0 Riders 1 2014 876

2 Devils 2 2014 863

4 Kings 3 2014 741

9 Royals 4 2014 701

2015

Team Rank Year Points

1 Riders 2 2015 789

3 Devils 3 2015 673

5 kings 4 2015 812

10 Royals 1 2015 804

2016

Team Rank Year Points

6 Kings 1 2016 756

8 Riders 2 2016 694

2017

Team Rank Year Points

7 Kings 1 2017 788

11 Riders 2 2017 690

CBSE- Informatics Practices (IP)

Monday, April 29, 2019

XII-IP : Aggregation in Pandas

Aggregation in Pandas

Median and mode

Median:

Variance and standard deviation

Describe method

Agg() function in pandas

Group By (split-apply-combine):

Splitting an object into groups

No comments:

Post a Comment