SESSION - 2019-20
CHAPTER- 1 ADVANCE OPERATIONS ON DATA-FRAMES
Pivoting
- Data-frame :
Data-Frame - It is a 2-dimensional data structure with columns of different types. It is
just similar to a spreadsheet or
SQL table, or a dict of
Series objects. It is generally the most commonly used pandas object.
pivot()
The pivot function is used to create a new derived table out of a given table. Pivot takes 3 arguments with the following names: index, columns, and values. For each of these parameters we need to specify a column name from the original table. The pivot function will create a new table, whose row and column indices are the unique values of the respective parameters. The cell values of the new table are taken from column given as the values parameter. Let us understand the concept using the following example:
As you see in the resultant reshaped DataFrame on the right, the index has changed to Player and the Column headers have changed to Club. Please observe that the non-matching values are replaced with NaN(not a number), thus changing the dtype of the resultant DataFrame to float64
As you see in the resultant reshaped DataFrame on the right, the index has changed to Player and the Column headers have changed to Club. Please observe that the non-matching values are replaced with NaN(not a number), thus changing the dtype of the resultant DataFrame to float64
### pivot() with multiple values
We can use a list of column headers as values to give further details to the reshaped DataFrame. See in the following illustration the Contract and Bonus are passed as a list to the value parameter.
Note: Observe that all indices and columns are automatically sorted alphabetically
### When the pivot() function fails
We have seen above that pivot function takes at least 2 column names as parameters for index and columns . What will happen if there are more than one row with the same values for these columns? How will the pivot function determine the value of the corresponding cell in the pivoted table? The following diagram depicts the problem:
In this example we have two rows with the same values (“Yuvi” and “MI”) for the Player and Club columns. The pivot method can not know what should be the value of the corresponding value in the pivoted table. Thus, it throws an exception with the following message:
In this example we have two rows with the same values (“Yuvi” and “MI”) for the Player and Club columns. The pivot method can not know what should be the value of the corresponding value in the pivoted table. Thus, it throws an exception with the following message:
ValueError: Index contains duplicate entries, cannot reshape
Hence, before calling pivot we need to ensure that our data does not have rows with duplicate values for the specified columns. If we can’t ensure this we may have to use the pivot_table() function instead.
pivot_table()
The pivot_table() function removes the ambiguity problem in pivot() by aggregating the values for rows with duplicate entries. This is dipicted in the following example.
It can be seen that the corresponding values from a set of two matching columns are aggregated to mean value in the reshaped table and thus the ambiguity is resolved.
We can modify the default mean behaviour of the pivot_table() method by using the parameter aggfunc to achieve some other aggregation like sum, min, max, median etc. The following example illustrates the use of aggfunc parameter with sum option. In order to use np.sum we need to import numpy module in our project
Now we are going to work with an actual dataFrame to demonstrate various features of pivot_table( ) method.
Now we are going to work with an actual dataFrame to demonstrate various features of pivot_table( ) method.
- Using the fill_value parameter to remove NaN from the reshaped table. You can see that the NaNs have been replaced with 0s and the dtype also changes back to integer.
- Next we are going a to create a reshaped table with two aggregate functions on one value
p = df.pivot_table(index='Player',columns='Club',\
values='Contract', aggfunc=[np.sum,np.mean], fill_value=0)
display(p)
Using stack( ) and unstack( ) methods
stack(
) and
unstack( ) methods
both
flip
the layout
of DataFrame, means
these flips the levels of
columns into
row
and
flips
levels
of
rows into columns.
DataFrame stacking
means
moving
the innermost column index to innermost
row index
and the opposite
action is know
as unstacking
Sorting
of DataFrames
•Data of DataFrame
can
be sort
according to
values
of row
and column.
•By
default sorting
is done on row labels in ascending
order.
•Pandas DataFrames
has
two useful
sort
functions :
–sort_values(
): it
sorts the data of given column
to the function in ascending or descending
order.
–sort_index( ): this
function sorts rows (axis=0) or columns (axis=1).
•Its syntax
is as
follows:
•DataFrame.sort_values(by
= None,
axis=0,
ascending
= True,
inplace
= False)
•DataFrame.sort_index(by
= None,
axis=0,
ascending
= True,
inplace
= False)
•Here
–
•by: column
to be
sorted.
•axis: here
passing 0 means
sorting will be done row wise and
1 means
column wise
•ascending: by
default ascending
is
true
•inplace: default
is false
if
you don’t
want
a new
dataframe
then set it
true.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.