Wednesday, April 17, 2019

XII-IP : CHAPTER- 1 ADVANCE OPERATIONS ON DATA-FRAMES


SESSION - 2019-20

CHAPTER- 1  ADVANCE OPERATIONS ON DATA-FRAMES

    Pivoting - Data-frame : 

Data-Frame - It is a 2-dimensional data structure with  columns of different types. It is just similar to spreadsheet or SQL table, or a dict of Series objects. It  is generally the most commonly used pandas object.

pivot()

The pivot function is used to create a new derived table out of a given table. Pivot takes 3 arguments with the following names: indexcolumns, and values. For each of these parameters we need to specify a column name from the original table. The pivot function will create a new table, whose row and column indices are the unique values of the respective parameters. The cell values of the new table are taken from column given as the values parameter. Let us understand the concept using the following example:


As you see in the resultant reshaped DataFrame on the right, the index has changed to Player and the Column headers have changed to Club. Please observe that the non-matching values are replaced with NaN(not a number), thus changing the dtype of the resultant DataFrame to float64

### pivot() with multiple values
We can use a list of column headers as values to give further details to the reshaped DataFrame. See in the following illustration the Contract and Bonus are passed as a list to the value parameter.




Note: Observe that all indices and columns are automatically sorted alphabetically

### When the pivot() function fails

We have seen above that pivot function takes at least 2 column names as parameters for index and columns . What will happen if there are more than one row with the same values for these columns? How will the pivot function determine the value of the corresponding cell in the pivoted table? The following diagram depicts the problem:


In this example we have two rows with the same values (“Yuvi” and “MI”) for the Player and Club columns. The pivot method can not know what should be the value of the corresponding value in the pivoted table. Thus, it throws an exception with the following message:
ValueError: Index contains duplicate entries, cannot reshape
Hence, before calling pivot we need to ensure that our data does not have rows with duplicate values for the specified columns. If we can’t ensure this we may have to use the pivot_table() function instead.

pivot_table()

The pivot_table() function removes the ambiguity problem in pivot() by aggregating the values for rows with duplicate entries. This is dipicted in the following example.



It can be seen that the corresponding values from a set of two matching columns are aggregated to mean value in the reshaped table and thus the ambiguity is resolved.
We can modify the default mean behaviour of the pivot_table() method by using the parameter aggfunc to achieve some other aggregation like sum, min, max, median etc. The following example illustrates the use of aggfunc parameter with sum option. In order to use np.sum we need to import numpy module in our project



Now we are going to work with an actual dataFrame to demonstrate various features of pivot_table( ) method.
  • Using the fill_value parameter to remove NaN from the reshaped table. You can see that the NaNs have been replaced with 0s and the dtype also changes back to integer.























  • Next we are going a to create a reshaped table with two aggregate functions on one value
p = df.pivot_table(index='Player',columns='Club',\
       values='Contract', aggfunc=[np.sum,np.mean], fill_value=0)
display(p)



































Using stack( ) and unstack( ) methods
       stack( ) and unstack( ) methods both flip the layout of DataFrame,  means these flips the levels of columns into row and flips levels of rows  into columns. DataFrame stacking means moving the innermost column  index to innermost row index and the opposite action is know as  unstacking




Sorting of DataFrames
Data of DataFrame can be sort according to values of row and  column.
By default sorting is done on row labels in ascending order.
Pandas DataFrames has two useful sort functions  :
sort_values( ): it sorts the data of given column to the function in ascending  or descending order.
sort_index( ):  this function sorts rows (axis=0) or columns (axis=1).
Its syntax is as follows:
DataFrame.sort_values(by = None, axis=0, ascending = True, inplace = False)
DataFrame.sort_index(by = None, axis=0, ascending = True, inplace = False)
Here
by: column to be sorted.
axis:  here passing 0 means sorting will be done row wise and 1 means column  wise
ascending: by default ascending is true
inplace: default is false if you don’t want a new dataframe then set it true.



No comments:

Post a Comment

Note: Only a member of this blog may post a comment.