Monday, April 29, 2019

XII-IP : Aggregation in Pandas

Aggregation in Pandas

An essential component of data analysis is efficient data summerization. This is achieved in Python Pandas using various aggregate functions like sum(), min(), max(), mean(), mode(), median(), variance() and standard deviation. All these functions gives a single value for a large dataset. Aggreation accompanied with groupby (grouping of data) gives further insight of data analysis.
import pandas as pd
from IPython.display import display
d1={'rollno':[101,101,103,102,104], 'name': ['Pat','Sid','Tom','Kim','Ray'],\
      'physics':[90,40,50,90,65],'chem':[75,80,60,85,60] }
df = pd.DataFrame(d1)
display(df)
print('--------Basic aggregate functuions min(), max(), sum() and mean()')
print('minimum is:',df['physics'].min())
print('maximum is:',df['physics'].max())
print('sum is:',df['physics'].sum())
print('average is:',df['physics'].mean())

OUTPUT:
  rollno name physics chem
0 101 Pat 90 75
1 101 Sid 40 80
2 103 Tom 50 60
3 102 Kim 90 85
4 104 Ray 65 60
--------Basic aggregate functuions min(), max(), sum() and mean()
minimum is: 40
maximum is: 90
sum is: 335
average is: 67.0

Median and mode

Median:

Median is the middle number in a sorted list of numbers. To determine the median value in a sequence of numbers, the numbers must first be arranged in value order from lowest to highest. For the given list 30, 20, 50, 40 we need to arrange the list in ascending order as 20, 30, 40, 50 and then find the middle value, which is 45 in this case. The point to be noted is that the median may not be the part of the dataset

print('Median is:',df['physics'].median())
print('Mode is:',df['physics'].mode())
OUTPUT:
Median is: 65.0 Mode is: 0 90 dtype: int64

Variance and standard deviation

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value. The variance and the closely-related standard deviation are measures of how spread out a distribution is. 

print('variance is:',df['physics'].var())
print('Standard Deviation is:',df['physics'].std())

OUTPUT:
variance is: 520.0 Standard Deviation is: 22.80350850198276

Describe method


  • Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
  • The outut will vary depending on what is provided.
  • For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.
  • For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq.

Agg() function in pandas

agg() function is used to find Aggregate using one or more operations over the specified axis. Let us see the following example:
df.agg(['sum', 'min'])
        rollno name        physics chem
sum 511 PatSidTomKimRay 335 360
min 101 Kim          40 60
In the given example above you can see that the sum() function has joined all the four names and the min() function has resulted in giving the alphabetically first name in the Data set(Kim). The column name should not be used to take part in the agg() operation. We can get a better result by using the following practice:

df.agg({'physics' : ['sum', 'min'], 'chem' : ['min', 'max']})
      physics chem
max NaN 85.0
min 40.0 60.0
sum 335.0 NaN

The agg() function can also be applied column wise by changing the default axis from rows to columns. See the following example:
df = pd.DataFrame([[1, 2, 3],\
                  [4, 5, 6],\
                  [7, 8, 9]],columns=['A', 'B', 'C'])
df.agg("mean", axis="columns")
0 2.0 1 5.0 2 8.0 dtype: float64

Group By (split-apply-combine):

By “group by” we are referring to a process involving one or more of the following steps:
  • Splitting the data into groups based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure.
In many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish do to one of the following:
  • Aggregation:compute a summary statistic (or statistics) for each group. (mean, sum)
  • Transformation:perform some group-specific computations and return a like-indexed object.
  • Filteration:discard some groups, according to a group-wise computation that evaluates True or False.

Splitting an object into groups

pandas objects can be split on any of their axes. There are multiple ways to split an object like −
  • obj.groupby('key')
  • obj.groupby(['key1','key2'])
  • obj.groupby(key,axis=1)


ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df

output

  Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690

Let us group the above dataframe on the column year and try to see one of the groups. We are going to use the functions groupby() to create the groups and get_group() to extract one of the groups.

grpyear = df.groupby('Year')
g1 = grpyear.get_group(2014)
g1

  Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701

Next we are going to apply an aggregate function on the the group displayed above.

g1['Points'].agg('mean')
795.25

Now we are going to apply multiple aggregate functions to the created group

g1['Points'].agg(['mean','max','var'])


mean     795.250000
max      876.000000
var     7645.583333
Name: Points, dtype: float64

If we want to see all the groups created based on a given key, we need to iterate over the grouped dataset:
for key, data in grpyear:
  display(key)
  display(data)

output:


2014
  Team Rank Year Points
0 Riders 1 2014 876
2 Devils 2 2014 863
4 Kings 3 2014 741
9 Royals 4 2014 701

2015
  Team Rank Year Points
1 Riders 2 2015 789
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804

2016
   Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694

2017
  Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690



Wednesday, April 17, 2019

XII-IP : CHAPTER- 1 ADVANCE OPERATIONS ON DATA-FRAMES


SESSION - 2019-20

CHAPTER- 1  ADVANCE OPERATIONS ON DATA-FRAMES

    Pivoting - Data-frame : 

Data-Frame - It is a 2-dimensional data structure with  columns of different types. It is just similar to spreadsheet or SQL table, or a dict of Series objects. It  is generally the most commonly used pandas object.

pivot()

The pivot function is used to create a new derived table out of a given table. Pivot takes 3 arguments with the following names: indexcolumns, and values. For each of these parameters we need to specify a column name from the original table. The pivot function will create a new table, whose row and column indices are the unique values of the respective parameters. The cell values of the new table are taken from column given as the values parameter. Let us understand the concept using the following example:


As you see in the resultant reshaped DataFrame on the right, the index has changed to Player and the Column headers have changed to Club. Please observe that the non-matching values are replaced with NaN(not a number), thus changing the dtype of the resultant DataFrame to float64

### pivot() with multiple values
We can use a list of column headers as values to give further details to the reshaped DataFrame. See in the following illustration the Contract and Bonus are passed as a list to the value parameter.




Note: Observe that all indices and columns are automatically sorted alphabetically

### When the pivot() function fails

We have seen above that pivot function takes at least 2 column names as parameters for index and columns . What will happen if there are more than one row with the same values for these columns? How will the pivot function determine the value of the corresponding cell in the pivoted table? The following diagram depicts the problem:


In this example we have two rows with the same values (“Yuvi” and “MI”) for the Player and Club columns. The pivot method can not know what should be the value of the corresponding value in the pivoted table. Thus, it throws an exception with the following message:
ValueError: Index contains duplicate entries, cannot reshape
Hence, before calling pivot we need to ensure that our data does not have rows with duplicate values for the specified columns. If we can’t ensure this we may have to use the pivot_table() function instead.

pivot_table()

The pivot_table() function removes the ambiguity problem in pivot() by aggregating the values for rows with duplicate entries. This is dipicted in the following example.



It can be seen that the corresponding values from a set of two matching columns are aggregated to mean value in the reshaped table and thus the ambiguity is resolved.
We can modify the default mean behaviour of the pivot_table() method by using the parameter aggfunc to achieve some other aggregation like sum, min, max, median etc. The following example illustrates the use of aggfunc parameter with sum option. In order to use np.sum we need to import numpy module in our project



Now we are going to work with an actual dataFrame to demonstrate various features of pivot_table( ) method.
  • Using the fill_value parameter to remove NaN from the reshaped table. You can see that the NaNs have been replaced with 0s and the dtype also changes back to integer.























  • Next we are going a to create a reshaped table with two aggregate functions on one value
p = df.pivot_table(index='Player',columns='Club',\
       values='Contract', aggfunc=[np.sum,np.mean], fill_value=0)
display(p)



































Using stack( ) and unstack( ) methods
       stack( ) and unstack( ) methods both flip the layout of DataFrame,  means these flips the levels of columns into row and flips levels of rows  into columns. DataFrame stacking means moving the innermost column  index to innermost row index and the opposite action is know as  unstacking




Sorting of DataFrames
Data of DataFrame can be sort according to values of row and  column.
By default sorting is done on row labels in ascending order.
Pandas DataFrames has two useful sort functions  :
sort_values( ): it sorts the data of given column to the function in ascending  or descending order.
sort_index( ):  this function sorts rows (axis=0) or columns (axis=1).
Its syntax is as follows:
DataFrame.sort_values(by = None, axis=0, ascending = True, inplace = False)
DataFrame.sort_index(by = None, axis=0, ascending = True, inplace = False)
Here
by: column to be sorted.
axis:  here passing 0 means sorting will be done row wise and 1 means column  wise
ascending: by default ascending is true
inplace: default is false if you don’t want a new dataframe then set it true.