Data handling using Pandas – II
Descriptive Statistics
Descriptive statistics are used to describe/summarize large data in ways that are meaningful and useful. Means“must knows”with any set of data.It gives us a general idea of trends in our data including:
•The mean,mode,median and range.
•Variance and standard deviation,quartile
•Sum Count,maximum and minimum.
Descriptive statistics is useful because it allows us take decision.
For example,
let’s say we are having data on the incomes of one million people. No one is going to want to read a million pieces of data; if they did, they wouldn’t be able to get any useful information from it. On the other hand, if we summarize it,it becomes useful: an average wage,or a median income,is much easier to understand than reams of data.
Steps to Get the descriptive statistics
Step 1: Collect the Data either from data file or from user
Step 2: Create the DataFrame or Create dataframe from pandas object
Step 3: Get the Descriptive Statistics for Pandas DataFrame or Get the descriptive statistics as per requirement like mean,mode,max,sum etc. from pandas object
Note :-Dataframe object is best for descriptive statistics as it can hold large amount of data and relevant functions.
Pandas dataframe object come up with the methods to calculate max, min, count, sum, mean, median, mode, quartile, Standard deviation, variance.
Mean
Mean is an average of all the numbers. The steps required to calculate a mean are:
- sum up all the values of a target variable in the data-set.
- divide the sum by the number of values.
Median
Median is the middle value of a sorted list of numbers.The steps required to get a median from a list of numbers are:
- sort the numbers from smallest to highest.
- if the list has an odd number of values, the value in the middle position is the median.
- if the list has an even number of values, the average of the two values in the middle will be the median.
Mode
To find the mode,or modal value,it is best to put the numbers in order. Then count how many of each number. A number that appears most often is the mode.
e.g. {19,8,29,35,19,28,15}.
Arrange them in order:{8,15,19,19,28,29,35}.
19 appears twice, all the rest appear only once, so 19 is the mode.
Having two modes is called "bi-modal".
Having more than two modes is called "multi-modal".
Example:
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Age':pd.Series([17,18,19,18,19]),
'Score':pd.Series([9,8,9,8,9])}
df=pd.DataFrame(d)
print("Data Frame is :")
print (df)
print("No. of records in above DF are:")
print(df.count())
print("Sum of Score",df[['Score']].sum())
print("Minimum Age",df[['Age']].min())
print("Maxmum Age",df[['Age']].max())
print("Mean Score",df[['Score']].mean())
print("Mode Score",df[['Score']].mode())
print("Median Score",df[['Score']].median())
OUTPUT:
Name Age Score
0 Disha 17 9
1 Rohit 18 8
2 Manas 19 9
3 Krish 18 8
4 Vishnu 19 9
No. of records in above DF are:
Name 5
Age 5
Score 5
dtype: int64
Sum of Score
Score 43
dtype: int64
Minimum Age
Age 17
dtype: int64
Maxmum Age
Age 19
dtype: int64
Mean Score
Score 8.6
dtype: float64
Mode Score
Score
0 9
Median Score
Score 9.0
dtype: float64
Use of describe function
#describe method return mean, standard deviation min, max, percentile values
Example-
#program for data aggregation /descriptive statistics using describe method
from pandas import DataFrame
Cars={'Brand':['Maruti','Ford','Tata','Toyota','Audi'], #step-1
'Price':[22000,27000,25000,29000,35000],
'Year':[2014,2015,2016,2017,2018] }
df=DataFrame(Cars, columns=['Brand','Price','Year']) #step-2
ds=df['Price'].describe().astype(int) #step-3
print (df)
print(ds)
OUTPUT:
0 Maruti 22000 2014
1 Ford 27000 2015
2 Tata 25000 2016
3 Toyota 29000 2017
4 Audi 35000 2018
count 5
mean 27600
std 4878
min 22000
25% 25000
50% 27000
75% 29000
max 35000
Name: Price, dtype: int32
Note: describe function used for Generate descriptive statistics.
For numeric data, the result’s index will include
Example:
df1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One']) df1.describe()
count
, mean
, std
, min
, max
as well as lower, 50
and upper percentiles. By default the lower percentile is 25
and the upper percentile is 75
. The 50
percentile is the same as the median.
How to Calculate Percentile in
Descriptive Statistics
Example:
df1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One']) df1.describe()
OUTPUT:
One
count 4.000000
mean 7.000000
std 2.581989
min 4.000000
25% 5.500000
50% 7.000000
75% 8.500000
max 10.000000
mathematical formula to calculate the percentile :-
min+(max-min)*percentile
in the above example as per formula we get
for 25% =
>>> 4+(10-4)*25% = 4+6*25% = 4+1.5=5.5
for 50% =>>> all values are even so average of two middle values (4,6,8,10)
(6+8=14/2)=7.0
OR
Using above formula 4+(10-4)*50%=4+6*50%=4+3= 7.0
for 75% =>>>> 4+(10-4)*75% = 4+6*75% = 4+2.5= 8.5
*********************************************************************************
QUARTILES
The word “quartile” comes from the word quantity.means, a quantile is where a sample is divided in to equal-sized or subgroups (that’s why it’s sometimes called a“fractile“).
So that’s why, It can also refer to dividing a probability distribution into areas of equal probability.
The median is a kind of quantile; the median is placed in a probability distribution at center so that exactly half of the data is lower than the median and half of the data is above the median. The median cuts a distribution into two equal parts and thats why sometimes it is called 2-quantile.
in other words we can say that the word "quartile" is taken from the word "quantile" and the word "quantile" taken from the "quantity". Let us understand this by taking an example:-
The 0.35 quantile states tha 35% of the observations in the dataset are below a given line. It also states that there are 65% remaining observations are above the line.
Method to find Quartiles
suppose we have numbers- 1,3,4,7,8,8,9
Step 1: Arrange the data in ascending order (already in ascending order)
Step 2: Count total number of observations (n=7)
Step 3: Find out the first Quartile i.e. Q1 (25%) known as 25th percentile (0.25)
Step 4: Now calculate Q1= round(0.25(n+1)
i.e. round(.25(7+1))
= round(.25(8))=2.0 it means 2nd observation i.e. 3
Step 5: Calculate second Quartile i.e. Q2 (50%) 0.50 or 50th percentile
= round(.50(7+1))=4.0 it means 4th observation i.e. 7
Step 6: Calculate third Quartile i.e.Q3 (75%)=0.75 or 75th percentile
= round (.75(7+1))= 6.0 it means 6th observation i.e. 8
Python Program:
import pandas as pd
import numpy as np
s=pd.Series([1,3,4,7,8,8,9])
q=s.quantile([0.25,0.50,0.75]).astype(int)
print(q)
Output:
0.25 3
0.50 7
0.75 8
dtype: int32
=================================================
Variance
var() –Variance Function in python pandas is used to calculate variance of a given set of numbers, Variance of a data frame, Variance of column and Variance of rows, let’s see an example of each.
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value.
Example:
We have a dataset "Score"= [90,40,50,90,65] now let us know how to calculate Variance:
Step 1: Read Data-set (n) [90,40,50,90,65]
Step 2 : Square of each number given in Data-Set= [8100,1600,2500,8100,4225]
Example:
We have a dataset "Score"= [90,40,50,90,65] now let us know how to calculate Variance:
Step 1: Read Data-set (n) [90,40,50,90,65]
Step 2 : Square of each number given in Data-Set= [8100,1600,2500,8100,4225]
Step 3: Sum of each number of Data-Set [90+40+50+90+65 = 335 ]
Step 4: Square of sum of each number (x) = 335*335= 112225
Step 5: Sum of Square of each number (m) = [8100+1600+2500+8100+4225]=24525
Step 6: X / Count of Data-Set (n) = [112225/5] = 22445
Step 7: ( m-n ) =>> 24525-22445= 2080
Step 8: Variance (m-n)/count-1) = 2080/5-1 = 520
Python Program to find out Variance of a Data-Set
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Score':pd.Series([90,40,50,90,65])}
df=pd.DataFrame(d)
print("Data Frame is :")
print (df)
print ("Variance of Score")
print (df.var())
OUTPUT:
Data Frame is :
Name Score
0 Disha 90
1 Rohit 40
2 Manas 50
3 Krish 90
4 Vishnu 65
Variance of Score
Score 520.0
dtype: float64
We can also use variance function with following purpose:-
print(df.loc[:,"Score"].var()) # used for variance of specific column
print (df.var(axis=0)) # column wise variance
print (df.var(axis=1)) # row wise variance
standard deviation means measure the amount of variation / dispersion of a set of values. A low standard deviation means the values tend to be close to the mean in a set and a high standard deviation means the values are spread out over a wider range.
Standard deviation is the most important concepts as far as finance is concerned. Finance and banking is all about measuring the risk and standard deviation measures risk. Standard deviation is used by all portfolio managers to measure and track risk.
Steps to calculate the standard deviation:
Example :
To find out Standard Deviation for a dataset "Score" (n)= [90,40,50,90,65]
Step 1: Work Out the Mean :- (90+40+50+90+65)/5= 335/5=67
Step 2: Then for each number: Subtract the Mean and Square the Result:-
(90-67)= 23 *23 = 529
(40-67)= -27*-27= 729
print (df.var(axis=0)) # column wise variance
print (df.var(axis=1)) # row wise variance
Standard Deviation
standard deviation means measure the amount of variation / dispersion of a set of values. A low standard deviation means the values tend to be close to the mean in a set and a high standard deviation means the values are spread out over a wider range.
Standard deviation is the most important concepts as far as finance is concerned. Finance and banking is all about measuring the risk and standard deviation measures risk. Standard deviation is used by all portfolio managers to measure and track risk.
Steps to calculate the standard deviation:
- Work out the Mean(the simple average of the numbers)
- Then for each number:subtract the Mean and square the result
- Then work out the sum of those squared differences and divide (N-1)
- Take the square root of that and we are done!
Example :
To find out Standard Deviation for a dataset "Score" (n)= [90,40,50,90,65]
Step 1: Work Out the Mean :- (90+40+50+90+65)/5= 335/5=67
Step 2: Then for each number: Subtract the Mean and Square the Result:-
(90-67)= 23 *23 = 529
(40-67)= -27*-27= 729
(50-67)= -13*-13= 289
(90-67)= 23 *23 = 529
(65-67)= -2 *-2 = 4
Step 3: Then work out the sum of those squared differences and divide (N-1):-
Sum (529+729+289+529+4)=2080/(n-1)=2080/4=520 (This Value also known as Variance)
Step 4: Take the Suare root of the Value found at Step 3 (Variance)
The Square Root of 520 is 22.80 i.e. Standard Deviation
Or
Note: we can also apply the steps of finding out Variance and after getting Variance .we can find the square root of the value of Variance for calculation of Standard Deviation.
Python Program to find out Standard Deviation of a Data-Set
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Score':pd.Series([90,40,50,90,65])}
df=pd.DataFrame(d)
print("Data Frame is :")
print (df)
print ("Variance of Score")
print (df.var())
print ("Standard Deviation of Score")
print (df.std())
OUTPUT:
Data Frame is :
Name Score
0 Disha 90
1 Rohit 40
2 Manas 50
3 Krish 90
4 Vishnu 65
Variance of Score
Score 520.0
dtype: float64
Standard Deviation of Score
Score 22.803509
dtype: float64
DataFrame operations
Aggregation
Aggregation is the process of turning the values of a dataset (or a subset of it) in to one single value or data aggregation is a multi valued function, which require multiple values and return a single value as a result. There are number of aggregations possible like count, sum, min, max, median, quartile etc. These (count, sum etc.) are descriptive statistics and other related operations on DataFrame. Let us make this clear! If we have a DataFrame like
Data Frame is :
Name Score
0 Disha 90
1 Rohit 40
2 Manas 50
3 Krish 90
4 Vishnu 65
…then a simple aggregation method is to calculate the summary of the Score, which is 90+40+50+90+65=335. Or a different aggregation method would be to count the number of Name,which is 5.
Program to show use of aggregate functions in Python DF
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Score':pd.Series([90,40,50,90,65])}
df=pd.DataFrame(d)
print("Data Frame is :")
print (df)
print ('---Basic Aggregate functions min(),max(),sum(),Avg()---')
print('Minimum Score is:',df['Score'].min())
print('Maximum Score is:',df['Score'].max())
print('Sum of Score is:',df['Score'].sum())
print('Average of Score is:',df['Score'].mean())
OUTPUT:
Data Frame is :
Name Score
0 Disha 90
1 Rohit 40
2 Manas 50
3 Krish 90
4 Vishnu 65
---Basic Aggregate functions min(),max(),sum(),Avg()---
Minimum Score is: 40
Maximum Score is: 90
Sum of Score is: 335
Average of Score is: 67.0
Group by
A group by operation involves one of the following operations on the data frame:-
Data Frame is :
Name Score
0 Disha 90
1 Rohit 40
2 Manas 50
3 Krish 90
4 Vishnu 65
---Basic Aggregate functions min(),max(),sum(),Avg()---
Minimum Score is: 40
Maximum Score is: 90
Sum of Score is: 335
Average of Score is: 67.0
Group by
A group by operation involves one of the following operations on the data frame:-
- Splitting the data frame
- Applying a function (usually an aggregate function)
- Combining the result.
# PROGRAM TO GROUP THE DATA CITY-WISE AND FIND OUT MAXIMUM and MINIMUM TEMPERATURE ACCORDING THE CITY.
import pandas as pd
data={
'Date' : ['1-1-2020','1-1-2020','1-2-2020','1-2-2020','1-3-2020','1-3-2020'],
'City' : ['Jaipur','Jaipur','Jodhpur','Jodhpur','Jaisalmer','Jaisalmer'],
'Temp ': [28,30,22,24,32,34],
'Humidity': [60,55,80,70,90,85]
}
df=pd.DataFrame(data)
print (df)
print('\n result after group operation')
print (df.groupby('City').max())
print (df.groupby('City').min())
OUTPUT:
Date City Temp Humidity
0 1-1-2020 Jaipur 28 60
1 1-1-2020 Jaipur 30 55
2 1-2-2020 Jodhpur 22 80
3 1-2-2020 Jodhpur 24 70
4 1-3-2020 Jaisalmer 32 90
5 1-3-2020 Jaisalmer 34 85
result after group operation
Date Temp Humidity
City
Jaipur 1-1-2020 30 60
Jaisalmer 1-3-2020 34 90
Jodhpur 1-2-2020 24 80
Date Temp Humidity
City
Jaipur 1-1-2020 28 55
Jaisalmer 1-3-2020 32 85
Jodhpur 1-2-2020 22 70
SORTING
1.By value(column)
2.By index
- By value -Sorting over dataframe column/s elements is supported by sort_values() method. We will cover here three aspects of sorting values of dataframe.
- Sort a pandas dataframe in python by Ascending and Descending
- Sort a python pandas dataframe by single column
- Sort a pandas dataframe by multiple columns.
Sort the python pandas Dataframe by single column –Ascending order (Default)
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
print("Data Frame before sorting is :")
print (df)
df=df.sort_values(by='Score')
print("Data Frame after sorting")
print(df)
OUTPUT
Data Frame before sorting is :
Name Age Score
0 Disha 16 87
1 Rohit 17 89
2 Manas 15 67
3 Krish 14 55
4 Vishnu 19 47
Data Frame after sorting
Name Age Score
4 Vishnu 19 47
3 Krish 14 55
2 Manas 15 67
0 Disha 16 87
1 Rohit 17 89
#In above example dictionary object is used to create the dataframe. Elements of dataframe object df is sorted by sort_value() method. As argument we are passing value score for by parameter only.
by default it is sorting in ascending manner.
Sort the python pandas Data frame by single column –Descending order
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
print("Data Frame before sorting is :")
print (df)
df=df.sort_values(by='Score',ascending=0)
print("Data Frame after sorting")
print(df)
OUTPUT
Data Frame before sorting is :
Name Age Score
0 Disha 16 87
1 Rohit 17 89
2 Manas 15 67
3 Krish 14 55
4 Vishnu 19 47
Data Frame after sorting
Name Age Score
1 Rohit 17 89
0 Disha 16 87
2 Manas 15 67
3 Krish 14 55
4 Vishnu 19 47
#In above example dictionary object is used to create the data frame. Elements of data frame object df is sorted by sort_value() method. we are passing 0 for Ascending parameter ,which sort the data in descending order of score.
Sort the pandas Data frame by Multiple Columns
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
print("Data Frame before sorting is :")
print (df)
df=df.sort_values(by=['Age','Score'],ascending=[True,False])
print("Data Frame after sorting")
print(df)
OUTPUT:
Data Frame before sorting is :
Name Age Score
0 Disha 16 87
1 Rohit 17 89
2 Manas 15 67
3 Krish 14 55
4 Vishnu 19 47
Data Frame after sorting
Name Age Score
3 Krish 14 55
2 Manas 15 67
0 Disha 16 87
1 Rohit 17 89
4 Vishnu 19 47
2. By index
Sorting over data frame index sort_index()is supported by sort_values() method.
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
df=df.reindex([1,4,3,2,0])
print("Data Frame before sorting is :")
print (df)
df1=df.sort_index()
print("Data Frame after sorting IN ASCENDING")
print(df1)
OUTPUT:
Data Frame before sorting is :
Name Age Score
1 Rohit 17 89
4 Vishnu 19 47
3 Krish 14 55
2 Manas 15 67
0 Disha 16 87
Data Frame after sorting IN ASCENDING
Name Age Score
0 Disha 16 87
1 Rohit 17 89
2 Manas 15 67
3 Krish 14 55
4 Vishnu 19 47
Sorting pandas data frame by index in descending order
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
df=df.reindex([1,4,3,2,0])
print("Data Frame before sorting is :")
print (df)
df1=df.sort_index(ascending=0)
print("Data Frame after sorting descending order")
print(df1)
OUTPUT
Data Frame before sorting is :
Name Age Score
1 Rohit 17 89
4 Vishnu 19 47
3 Krish 14 55
2 Manas 15 67
0 Disha 16 87
Data Frame after sorting descending order
Name Age Score
4 Vishnu 19 47
3 Krish 14 55
2 Manas 15 67
1 Rohit 17 89
0 Disha 16 87
RENAMING INDEX
rename()method is used to rename the indexes in a dataframe
Syntax- df.rename (index,inplace (optional))
Example:
import pandas as pd
L=[{'First_Name': 'Manas', 'last_Name':'Purohit'},
{'First_Name': 'Rohit', 'last_Name':'Soni'},
{'First_Name': 'Disha', 'last_Name':'Bucha'},
{'First_Name': 'Vishnu', 'last_Name':'Nair'},
{'First_Name': 'Shubham', 'last_Name':'Parihar'}
]
df1=pd.DataFrame(L)
print (df1)
df=df1.rename(index={0:'Row1',1:'Row2',2:'Row3',3:'Row4',4:'Row5'})
print("Data Frame after renaming the indexes")
print(df)
OUTPUT:
First_Name last_Name
0 Manas Purohit
1 Rohit Soni
2 Disha Bucha
3 Vishnu Nair
4 Shubham Parihar
Data Frame after renaming the indexes
First_Name last_Name
Row1 Manas Purohit
Row2 Rohit Soni
Row3 Disha Bucha
Row4 Vishnu Nair
Row5 Shubham Parihar
DELETING INDEX
reset_index().drop() method is used to delete the indexes in a dataframe
Syntax- df.reset_index().drop(index,inplace (optional))
Example:
import pandas as pd
L=[{'First_Name': 'Manas', 'last_Name':'Purohit'},
{'First_Name': 'Rohit', 'last_Name':'Soni'},
{'First_Name': 'Disha', 'last_Name':'Bucha'},
{'First_Name': 'Vishnu', 'last_Name':'Nair'},
{'First_Name': 'Shubham', 'last_Name':'Parihar'}
]
df1=pd.DataFrame(L)
print (df1)
df=df1.reset_index().drop(index=[2,3])
print("Data Frame after deleting the indexes")
print(df)
OUTPUT:
First_Name last_Name
0 Manas Purohit
1 Rohit Soni
2 Disha Bucha
3 Vishnu Nair
4 Shubham Parihar
Data Frame after deleting the indexes
index First_Name last_Name
0 0 Manas Purohit
1 1 Rohit Soni
4 4 Shubham Parihar
Pivoting - Data-frame
Pivoting: - Pivoting is one of the important aspect of data analyst. It is used to summarize large amount of data and permit us to access important records from a large dataset.
Python Pandas provide two functions for pivoting:
1. pivot()
2. pivot_table()
pivot() allows us to transform or reshape the data frame based on the column values according to our perspective. It takes 3 arguments: (index,columns and values).
Example:
Program:
data={
'Year': ['2018','2019','2018','2019','2018','2019'],
'Team': ['MI','MI','RCB','RCB','CSK','CSK'],
'Runs': ['2500','2650','2200','2400','2300','2700']
}
df=pd.DataFrame(data)
print (df)
pv=pd.pivot(df,index='Year',columns='Team',values='Runs')
print(pv)
OUTPUT:
Year Team Runs
0 2018 MI 2500
1 2019 MI 2650
2 2018 RCB 2200
3 2019 RCB 2400
4 2018 CSK 2300
5 2019 CSK 2700
Team CSK MI RCB
Year
2018 2300 2500 2200
2019 2700 2650 2400
pivot_table():- we know that pivot() method takes at least 2 column names as parameters- index and column. what will happen if we have multiple rows with the same values for these columns.
The pivot_table() method comes to solve this problem. It works like pivot but it aggregates the values from rows with duplicate entries for the specified columns (means apply aggregate function by us)
By default pivot_table() apply mean() to aggregate the values from rows with duplicate entries for the specified columns.
For Example:-
Program:
data={
'Date': ['1-1-2019','1-1-2019','1-2-2019','1-2-2019','1-3-2019','1-3-2019'],
'City':['Delhi','Delhi','Mumbai','Mumbai','Chennai','Chennai'],
'Temp': [28,30,22,24,32,34],
'Humidity': [60,55,80,70,90,85]
}
df=pd.DataFrame(data)
print (df)
pv=pd.pivot_table(df,index='City',values='Temp')
print (pv)
OUTPUT:
Date City Temp Humidity
0 1-1-2019 Delhi 28 60
1 1-1-2019 Delhi 30 55
2 1-2-2019 Mumbai 22 80
3 1-2-2019 Mumbai 24 70
4 1-3-2019 Chennai 32 90
5 1-3-2019 Chennai 34 85
Temp
City
Chennai 33
Delhi 29
Mumbai 23
Note: in the above program pivot_table() apply mean() function by default.
PROGRAM TO FIND OUT CITYWISE MAXIMUM TEMPERATURE:
import pandas as pd
data={
'Date': ['1-1-2019','1-1-2019','1-2-2019','1-2-2019','1-3-2019','1-3-2019'],
'City':['Delhi','Delhi','Mumbai','Mumbai','Chennai','Chennai'],
'Temp': [28,30,22,24,32,34],
'Humidity': [60,55,80,70,90,85]
}
df=pd.DataFrame(data)
print (df)
pv=pd.pivot_table(df,index='City',values='Temp',aggfunc='max')
print (pv)
data={
'Date': ['1-1-2019','1-1-2019','1-2-2019','1-2-2019','1-3-2019','1-3-2019'],
'City':['Delhi','Delhi','Mumbai','Mumbai','Chennai','Chennai'],
'Temp': [28,30,22,24,32,34],
'Humidity': [60,55,80,70,90,85]
}
df=pd.DataFrame(data)
print (df)
pv=pd.pivot_table(df,index='City',values='Temp',aggfunc='max')
print (pv)
Output:
Date City Temp Humidity
0 1-1-2019 Delhi 28 60
1 1-1-2019 Delhi 30 55
2 1-2-2019 Mumbai 22 80
3 1-2-2019 Mumbai 24 70
4 1-3-2019 Chennai 32 90
5 1-3-2019 Chennai 34 85
Temp
City
Chennai 34
Delhi 30
Mumbai 24
PROGRAM TO PRINT DATA FRAME ON DATE INDEX AND CITY COLUMN
import pandas as pd
data={
'Date': ['1-1-2019','1-1-2019','1-2-2019','1-2-2019','1-3-2019','1-3-2019'],
'City':['Delhi','Delhi','Mumbai','Mumbai','Chennai','Chennai'],
'Temp': [28,30,22,24,32,34],
'Humidity': [60,55,80,70,90,85]
}
df=pd.DataFrame(data)
print (df)
pv=pd.pivot_table(df,index='Date',columns='City')
print (pv)
OUTPUT
Date City Temp Humidity
0 1-1-2019 Delhi 28 60
1 1-1-2019 Delhi 30 55
2 1-2-2019 Mumbai 22 80
3 1-2-2019 Mumbai 24 70
4 1-3-2019 Chennai 32 90
5 1-3-2019 Chennai 34 85
Humidity Temp
City Chennai Delhi Mumbai Chennai Delhi Mumbai
Date
1-1-2019 NaN 57.5 NaN NaN 29.0 NaN
1-2-2019 NaN NaN 75.0 NaN NaN 23.0
1-3-2019 87.5 NaN NaN 33.0 NaN NaN
HANDLING WITH MISSING VALUES
Dropping and filling
Sometimes, the data that we receive from many sources may not be perfect. that means there may be some missing data. for example- in the following program where student name is missing in one row and score is missing in other row:
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)
print(df)
OUTPUT:
0 Disha 16 87.0
1 Rohit 17 89.0
2 Manas 15 67.0
3 NaN 14 55.0
4 Vishnu 19 NaN
When we convert the data into dataframe, the missing data is represented by NaN (Not a Number). Nan is a default marker for the missing values.
Consider the following Dataframe:-
we can use fillna() method to replace Nan or NA value by a specified value.
# to fill the NaN value by 0
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)
df=df.fillna(0)
print(df)
OUTPUT:
0 Disha 16 87.0
1 Rohit 17 89.0
2 Manas 15 67.0
3 0 14 55.0
4 Vishnu 19 0.0
But this is not useful as it is filling any type of column with 0. We can fill each column with a different value by passing the column name and the value to be used to fill in that column.
for example: to fill Student Name with ' XXX' and Score with " **" . We should supply these values as a dictionary inside fillna() method.
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)
df=df.fillna({'Name':'XXXX','Score':'**'})
print(df)
OUTPUT:
0 Disha 16 87
1 Rohit 17 89
2 Manas 15 67
3 XXXX 14 55
4 Vishnu 19 **
if we do not want any missing data and want to remove those rows having Na or NaN values then we can use dropna() method.
example:
import pandas as pdimport numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
'Age':pd.Series([16,17,15,14,19]),
'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)
df=df.dropna()
print(df)
Output:
0 Disha 16 87.0
1 Rohit 17 89.0
2 Manas 15 67.0
*****************************************************
Assignements
Descriptive Statistics
Quartiles
Standard Deviation
aggregation and sorting
Pivoting
Handling with missing values
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.