Wednesday, April 22, 2020

Pandas-II


 
Data handling using Pandas – II

Descriptive Statistics


Descriptive statistics are used to describe/summarize large data in ways that are meaningful and useful. Means“must knows”with any set of data.It gives us a general idea of trends in our data including:

•The mean,mode,median and range.

•Variance and standard deviation,quartile

•Sum Count,maximum and minimum.

Descriptive statistics is useful because it allows us take decision.
For example, 
let’s say we are having data on the incomes of one million people. No one is going to want to read a million pieces of data; if they did, they wouldn’t be able to get any useful information from it. On the other hand, if we summarize it,it becomes useful: an average wage,or a median income,is much easier to understand than reams of data.


Steps to Get the descriptive statistics

Step 1: Collect the Data either from data file or from user

Step 2: Create the DataFrame or  Create dataframe from pandas object

Step 3: Get the Descriptive Statistics for Pandas DataFrame or Get the                          descriptive statistics as per requirement like mean,mode,max,sum etc.                from pandas object

Note :-Dataframe object is best for descriptive statistics as it          can hold large amount of data and relevant functions.

Pandas dataframe object come up with the methods  to calculate max, min, count, sum, mean, median, mode, quartile, Standard deviation, variance.



Mean



Mean is an average of all the numbers. The steps required to calculate a mean are:

  • sum up all the values of a target variable in the data-set.
  • divide the sum by the number of values.
Median

Median is the middle value of a sorted list of numbers.The steps required to get a median from a list of numbers are:



  • sort the numbers from smallest to highest.
  • if the list has an odd number of values, the value in the middle position is the median.
  • if the list has an even number of values, the average of the two values in the middle will be the median.
Mode

To find the mode,or modal value,it is best to put the numbers in order. Then count how many of each number. A number that appears most often is the mode. 
e.g. {19,8,29,35,19,28,15}. 
Arrange them in order:{8,15,19,19,28,29,35}. 
19 appears twice, all the rest appear only once, so 19 is the mode.
Having two modes is called "bi-modal". 
Having more than two modes is called "multi-modal".

Example:

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Age':pd.Series([17,18,19,18,19]),
   'Score':pd.Series([9,8,9,8,9])}
df=pd.DataFrame(d)
print("Data Frame is :")
print (df)
print("No. of records in above DF are:")
print(df.count())
print("Sum of Score",df[['Score']].sum())
print("Minimum Age",df[['Age']].min())
print("Maxmum Age",df[['Age']].max())
print("Mean Score",df[['Score']].mean())
print("Mode Score",df[['Score']].mode())
print("Median Score",df[['Score']].median())


OUTPUT:

Data Frame is :
    Name   Age   Score
0   Disha   17      9
1   Rohit   18      8
2   Manas   19      9
3   Krish   18      8
4  Vishnu   19      9

No. of records in above DF are:
Name     5
Age      5
Score    5
dtype: int64

Sum of Score 
Score    43
dtype: int64

Minimum Age 
Age    17
dtype: int64

Maxmum Age 
Age    19
dtype: int64

Mean Score 
Score    8.6
dtype: float64

Mode Score    
     Score
0      9

Median Score 
Score    9.0
dtype: float64


Use of describe function
#describe method return mean, standard deviation  min, max,  percentile  values

Example-

#program for data aggregation /descriptive statistics using describe method

from pandas import DataFrame
Cars={'Brand':['Maruti','Ford','Tata','Toyota','Audi'],   #step-1
      'Price':[22000,27000,25000,29000,35000],
      'Year':[2014,2015,2016,2017,2018] }
df=DataFrame(Cars, columns=['Brand','Price','Year'])       #step-2
ds=df['Price'].describe().astype(int)                      #step-3
print (df)
print(ds)

OUTPUT:

   Brand  Price  Year
0  Maruti  22000  2014
1    Ford  27000  2015
2    Tata  25000  2016
3  Toyota  29000  2017
4    Audi  35000  2018
count        5
mean     27600
std       4878
min      22000
25%      25000
50%      27000
75%      29000
max      35000

Name: Price, dtype: int32


Note: describe function used for  Generate descriptive statistics.
For numeric data, the result’s index will include countmeanstdminmax as well as lower, 50 and upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.



How to Calculate Percentile in 
Descriptive Statistics 

Example:

df1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One']) df1.describe()

OUTPUT:

       One 
count 4.000000 
mean  7.000000 
std   2.581989 
min   4.000000 
25%   5.500000 
50%   7.000000 
75%   8.500000 
max  10.000000

mathematical formula to calculate the percentile :- 
min+(max-min)*percentile 

in the above example as per formula we get for 25% =
>>> 4+(10-4)*25% = 4+6*25% = 4+1.5=5.5 

for 50% =>>> all values are even so average of two middle values                  (4,6,8,10) (6+8=14/2)=7.0 
OR

Using above formula 4+(10-4)*50%=4+6*50%=4+3= 7.0

for 75% =>>>> 4+(10-4)*75% = 4+6*75% = 4+2.5= 8.5
*********************************************************************************

QUARTILES


The word “quartile” comes from the word quantity.means, a quantile is where a sample is divided in to equal-sized or subgroups (that’s why it’s sometimes called a“fractile“).
So that’s why, It can also refer to dividing a probability distribution into areas of equal probability.
The median is a kind of quantile; the median is placed in a probability distribution at center so that exactly half of the data is lower than the median and half of the data is above the median. The median cuts a distribution into two equal parts and thats why sometimes it is called 2-quantile.

in other words we can say that the word "quartile" is taken from the word "quantile" and the word "quantile" taken from the "quantity". Let us understand this by taking an example:-

The 0.35 quantile states tha 35% of the observations in the dataset are below a given line. It also states that there are 65% remaining observations are above the line.


Method to find Quartiles

suppose we have numbers- 1,3,4,7,8,8,9

Step 1:  Arrange the data in ascending order (already in ascending order)

Step 2: Count total number of observations (n=7)

Step 3: Find out the first Quartile i.e. Q1 (25%) known as 25th percentile (0.25)

Step 4: Now calculate Q1= round(0.25(n+1)
             i.e.  round(.25(7+1)) 
            = round(.25(8))=2.0 it means 2nd observation i.e. 3

Step 5: Calculate second Quartile i.e. Q2 (50%) 0.50 or 50th percentile
         = round(.50(7+1))=4.0  it means  4th observation i.e. 7

Step 6: Calculate third Quartile i.e.Q3 (75%)=0.75 or 75th percentile
        = round (.75(7+1))= 6.0 it means 6th observation i.e. 8


Python Program:

import pandas as pd
import numpy as np
s=pd.Series([1,3,4,7,8,8,9])
q=s.quantile([0.25,0.50,0.75]).astype(int)
print(q)

Output:

0.25    3

0.50    7

0.75    8

dtype: int32

=================================================
Variance

var() –Variance Function in python pandas is used to calculate variance of a given set of numbers, Variance of a data frame, Variance of column and Variance of rows, let’s see an example of each.
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value. 

Example:
We have a dataset "Score"= [90,40,50,90,65]  now let us know how to calculate Variance:

Step 1: Read Data-set  (n)  [90,40,50,90,65]

Step 2 : Square of each number given in Data-Set= [8100,1600,2500,8100,4225]

Step 3: Sum of each number of Data-Set [90+40+50+90+65 = 335 ]

Step 4: Square of sum of each number (x) = 335*335= 112225

Step 5: Sum of Square of each number (m) = [8100+1600+2500+8100+4225]=24525

Step 6: X / Count of Data-Set (n) =   [112225/5] = 22445

Step 7: ( m-n ) =>>  24525-22445= 2080

Step 8: Variance (m-n)/count-1) = 2080/5-1 = 520


Python Program to find out Variance of a Data-Set


import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Score':pd.Series([90,40,50,90,65])}
df=pd.DataFrame(d)
print("Data Frame is :")
print (df)
print ("Variance of Score")
print (df.var())


OUTPUT:


Data Frame is :

    Name   Score

0   Disha     90

1   Rohit     40

2   Manas     50

3   Krish     90

4  Vishnu     65

Variance of Score
Score    520.0
dtype: float64

We can also use variance function with following purpose:-

print(df.loc[:,"Score"].var())    # used for variance of specific column 
print (df.var(axis=0))             # column wise variance

print (df.var(axis=1))             # row wise variance




Standard Deviation


standard deviation means measure the  amount of variation / dispersion of a set of values. A low standard deviation means the values tend to be close to the mean in a set and a high standard deviation means the values are spread out over a wider range.


Standard deviation is the most important concepts as far as finance is concerned. Finance and banking is all about measuring the risk and standard deviation measures risk. Standard deviation is used by all portfolio managers to measure and track risk.

Steps to calculate the standard deviation:

  • Work out the Mean(the simple average of the numbers)
  • Then for each number:subtract the Mean and square the result
  • Then work out the sum of those squared differences and divide (N-1)
  • Take the square root of that and we are done!


Example : 

To find out Standard Deviation  for a dataset "Score"  (n)= [90,40,50,90,65]  

Step 1: Work Out the Mean :- (90+40+50+90+65)/5= 335/5=67

Step 2: Then for each number: Subtract the Mean and Square the Result:- 
                 (90-67)= 23 *23 = 529
       (40-67)= -27*-27= 729
       (50-67)= -13*-13= 289
       (90-67)= 23 *23 = 529
       (65-67)= -2 *-2 = 4
Step 3: Then work out the sum of those squared differences and divide (N-1):-
   Sum (529+729+289+529+4)=2080/(n-1)=2080/4=520 (This Value also known as Variance)

Step 4: Take the Suare root of the Value found at Step 3 (Variance)

          The Square Root of 520 is  22.80 i.e. Standard Deviation 

Or 

Note: we can also apply the steps of finding out Variance and after getting Variance .we can find the square root of the value of Variance for calculation of  Standard Deviation.


              Python Program to find out Standard Deviation of a Data-Set


import pandas as pd

import numpy as np

d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),

   'Score':pd.Series([90,40,50,90,65])}

df=pd.DataFrame(d)

print("Data Frame is :")

print (df)

print ("Variance of Score")
print (df.var())
print ("Standard Deviation of Score")
print (df.std())

OUTPUT:

Data Frame is :
     Name  Score
0   Disha     90
1   Rohit     40
2   Manas     50
3   Krish     90
4  Vishnu     65

Variance of Score
Score    520.0
dtype: float64

Standard Deviation of Score
Score    22.803509
dtype: float64



                                   DataFrame operations



Aggregation

Aggregation is the process of turning the values of a dataset (or a subset of it) in to one single value or data aggregation is a multi valued function, which require multiple values and return a single value as a result. There are number of aggregations possible like count, sum, min, max, median, quartile etc. These (count, sum etc.) are descriptive statistics and other related operations on DataFrame. Let us make this clear! If we have a DataFrame like 

Data Frame is :
     Name  Score
0   Disha     90
1   Rohit     40
2   Manas     50
3   Krish     90
4  Vishnu     65


…then a simple aggregation method is to calculate the summary of the Score, which is 90+40+50+90+65=335. Or a different aggregation method would be to count the number of Name,which is 5.


Program to show use of aggregate functions in Python DF

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Score':pd.Series([90,40,50,90,65])}
df=pd.DataFrame(d)
print("Data Frame is :")
print (df)
print ('---Basic Aggregate functions min(),max(),sum(),Avg()---')
print('Minimum Score is:',df['Score'].min())
print('Maximum  Score is:',df['Score'].max())
print('Sum of Score is:',df['Score'].sum())
print('Average of Score is:',df['Score'].mean())

OUTPUT:

Data Frame is :
     Name  Score
0   Disha     90
1   Rohit     40
2   Manas     50
3   Krish     90
4  Vishnu     65
---Basic Aggregate functions min(),max(),sum(),Avg()---
Minimum Score is: 40
Maximum  Score is: 90
Sum of Score is: 335
Average of Score is: 67.0

Group by

A group by operation involves one of the following operations on the data frame:-

  • Splitting the data frame
  • Applying a function (usually an aggregate function)
  • Combining the result.
# PROGRAM TO GROUP THE DATA CITY-WISE AND FIND OUT MAXIMUM and MINIMUM TEMPERATURE ACCORDING THE CITY.


import pandas as pd
data={
           'Date' : ['1-1-2020','1-1-2020','1-2-2020','1-2-2020','1-3-2020','1-3-2020'],
          'City' :  ['Jaipur','Jaipur','Jodhpur','Jodhpur','Jaisalmer','Jaisalmer'],
          'Temp ': [28,30,22,24,32,34],
          'Humidity': [60,55,80,70,90,85]
       }
df=pd.DataFrame(data)
print (df)
print('\n result after group operation')
print (df.groupby('City').max())
print (df.groupby('City').min())

OUTPUT:

     Date       City     Temp  Humidity
0  1-1-2020     Jaipur    28        60
1  1-1-2020     Jaipur    30        55
2  1-2-2020    Jodhpur    22        80
3  1-2-2020    Jodhpur    24        70
4  1-3-2020  Jaisalmer    32        90
5  1-3-2020  Jaisalmer    34        85

 result after group operation

               Date  Temp  Humidity
City                               
Jaipur     1-1-2020    30        60
Jaisalmer  1-3-2020    34        90
Jodhpur    1-2-2020    24        80
              
               Date  Temp  Humidity
City                               
Jaipur     1-1-2020    28        55
Jaisalmer  1-3-2020    32        85
Jodhpur    1-2-2020    22        70


SORTING


Sorting means arranging the contents in ascending or descending order. There are two kinds of sorting available in pandas(Dataframe).
1.By value(column)
2.By index
  • By value -Sorting over dataframe column/s elements is supported by sort_values() method. We will cover here three aspects of sorting values of dataframe.
  • Sort a pandas dataframe in python by Ascending and Descending
  • Sort a python pandas dataframe by single column
  • Sort a pandas dataframe by multiple columns.
Sort the python pandas Dataframe by single column –Ascending order (Default)

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
print("Data Frame before sorting is :")
print (df)
df=df.sort_values(by='Score')
print("Data Frame after sorting")
print(df)

OUTPUT


Data Frame before sorting is :

     Name  Age  Score

0   Disha   16     87

1   Rohit   17     89

2   Manas   15     67

3   Krish   14     55

4  Vishnu   19     47
Data Frame after sorting
     Name  Age  Score
4  Vishnu   19     47
3   Krish   14     55
2   Manas   15     67
0   Disha   16     87
1   Rohit   17     89


#In above example dictionary object is used to create the dataframe. Elements of dataframe object df is sorted by sort_value() method. As argument we are passing value score for by parameter only. 
by default it is sorting in ascending manner.


Sort the python pandas Data frame by single column –Descending order

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
print("Data Frame before sorting is :")
print (df)
df=df.sort_values(by='Score',ascending=0)
print("Data Frame after sorting")

print(df)

OUTPUT

Data Frame before sorting is :
     Name  Age  Score
0   Disha   16     87
1   Rohit   17     89
2   Manas   15     67
3   Krish   14     55
4  Vishnu   19     47
Data Frame after sorting
     Name  Age  Score
1   Rohit   17     89
0   Disha   16     87
2   Manas   15     67
3   Krish   14     55

4  Vishnu   19     47


#In above example dictionary object is used to create the data frame. Elements of data frame object df is sorted by sort_value() method. we are passing 0 for Ascending parameter ,which sort the data in descending order of score.


Sort the pandas Data frame by Multiple Columns

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
print("Data Frame before sorting is :")
print (df)
df=df.sort_values(by=['Age','Score'],ascending=[True,False])
print("Data Frame after sorting")

print(df)


OUTPUT:

Data Frame before sorting is :
     Name  Age  Score
0   Disha   16     87
1   Rohit   17     89
2   Manas   15     67
3   Krish   14     55
4  Vishnu   19     47
Data Frame after sorting
     Name  Age  Score
3   Krish   14     55
2   Manas   15     67
0   Disha   16     87
1   Rohit   17     89

4  Vishnu   19     47


2. By index 

Sorting over data frame index sort_index()is supported by sort_values() method.

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
df=df.reindex([1,4,3,2,0])
print("Data Frame before sorting is :")
print (df)
df1=df.sort_index()
print("Data Frame after sorting IN ASCENDING")

print(df1)

OUTPUT:

Data Frame before sorting is :
     Name  Age  Score
1   Rohit   17     89
4  Vishnu   19     47
3   Krish   14     55
2   Manas   15     67
0   Disha   16     87

Data Frame after sorting IN ASCENDING

     Name  Age  Score
0   Disha   16     87
1   Rohit   17     89
2   Manas   15     67
3   Krish   14     55

4  Vishnu   19     47


Sorting pandas data frame by index in descending order

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,47])}
df=pd.DataFrame(d)
df=df.reindex([1,4,3,2,0])
print("Data Frame before sorting is :")
print (df)
df1=df.sort_index(ascending=0)
print("Data Frame after sorting descending order")

print(df1)

OUTPUT

Data Frame before sorting is :

     Name  Age  Score
1   Rohit   17     89
4  Vishnu   19     47
3   Krish   14     55
2   Manas   15     67
0   Disha   16     87
Data Frame after sorting descending order

     Name  Age  Score
4  Vishnu   19     47
3   Krish   14     55
2   Manas   15     67
1   Rohit   17     89
0   Disha   16     87



RENAMING INDEX

rename()method is used to rename the indexes in a dataframe

Syntax- df.rename (index,inplace (optional))

Example: 

import pandas as pd
L=[{'First_Name': 'Manas', 'last_Name':'Purohit'},
   {'First_Name': 'Rohit', 'last_Name':'Soni'},
   {'First_Name': 'Disha', 'last_Name':'Bucha'},
   {'First_Name': 'Vishnu', 'last_Name':'Nair'},
   {'First_Name': 'Shubham', 'last_Name':'Parihar'}
   ]
df1=pd.DataFrame(L)
print (df1)
df=df1.rename(index={0:'Row1',1:'Row2',2:'Row3',3:'Row4',4:'Row5'})
print("Data Frame after renaming the indexes")

print(df)


OUTPUT:

      First_Name last_Name
0      Manas   Purohit
1      Rohit      Soni
2      Disha     Bucha
3     Vishnu      Nair
4    Shubham   Parihar

Data Frame after renaming the indexes

     First_Name last_Name
Row1      Manas   Purohit
Row2      Rohit      Soni
Row3      Disha     Bucha
Row4     Vishnu      Nair
Row5    Shubham   Parihar



DELETING  INDEX

reset_index().drop() method is used to delete  the indexes in a dataframe

Syntax- df.reset_index().drop(index,inplace (optional))


Example: 

import pandas as pd
L=[{'First_Name': 'Manas', 'last_Name':'Purohit'},
   {'First_Name': 'Rohit', 'last_Name':'Soni'},
   {'First_Name': 'Disha', 'last_Name':'Bucha'},
   {'First_Name': 'Vishnu', 'last_Name':'Nair'},
   {'First_Name': 'Shubham', 'last_Name':'Parihar'}
   ]
df1=pd.DataFrame(L)
print (df1)
df=df1.reset_index().drop(index=[2,3])
print("Data Frame after deleting the indexes")

print(df)

OUTPUT:
      First_Name last_Name
0      Manas   Purohit
1      Rohit      Soni
2      Disha     Bucha
3     Vishnu      Nair
4    Shubham   Parihar

Data Frame after deleting the indexes

   index First_Name last_Name
0      0      Manas   Purohit
1      1      Rohit      Soni
4      4    Shubham   Parihar


                         



Pivoting - Data-frame


Pivoting: - Pivoting is one of the important aspect of data analyst. It is used to summarize large amount of data and permit us to access important records from a large dataset.

Python Pandas provide two functions for pivoting:
1. pivot()
2. pivot_table()


pivot() allows us to transform or reshape the data frame based on the column values according to our perspective. It takes 3 arguments: (index,columns and values).

Example: 
Program:

import pandas as pd
data={
        'Year': ['2018','2019','2018','2019','2018','2019'],
        'Team': ['MI','MI','RCB','RCB','CSK','CSK'],
        'Runs': ['2500','2650','2200','2400','2300','2700']
     }
df=pd.DataFrame(data)
print (df)
pv=pd.pivot(df,index='Year',columns='Team',values='Runs')

print(pv)

OUTPUT:

   Year Team  Runs
0  2018   MI  2500
1  2019   MI  2650
2  2018  RCB  2200
3  2019  RCB  2400
4  2018  CSK  2300
5  2019  CSK  2700

Team   CSK    MI   RCB
Year                  
2018  2300  2500  2200

2019  2700  2650  2400



pivot_table():- we know that pivot() method takes at least 2 column names as parameters- index and column. what will happen if we have multiple rows with the same values for these columns.

The pivot_table() method comes to solve this problem. It works like pivot but it aggregates the values from rows with duplicate entries for the specified columns (means apply aggregate function by us)

By default pivot_table() apply mean() to aggregate the values from rows with duplicate entries for the specified columns.

For Example:-

Program:

import pandas as pd
data={
      'Date': ['1-1-2019','1-1-2019','1-2-2019','1-2-2019','1-3-2019','1-3-2019'],
     'City':['Delhi','Delhi','Mumbai','Mumbai','Chennai','Chennai'],
       'Temp': [28,30,22,24,32,34],
       'Humidity': [60,55,80,70,90,85]
       }
df=pd.DataFrame(data)
print (df)
pv=pd.pivot_table(df,index='City',values='Temp')

print (pv)

OUTPUT:

            Date     City     Temp  Humidity
0  1-1-2019    Delhi    28        60
1  1-1-2019    Delhi    30        55
2  1-2-2019   Mumbai    22        80
3  1-2-2019   Mumbai    24        70
4  1-3-2019  Chennai    32        90
5  1-3-2019  Chennai    34        85
         

         Temp
City         
Chennai    33
Delhi      29

Mumbai     23


Note: in the above program pivot_table() apply mean() function by default.



PROGRAM TO FIND OUT CITYWISE MAXIMUM TEMPERATURE:

import pandas as pd
data={
                 'Date': ['1-1-2019','1-1-2019','1-2-2019','1-2-2019','1-3-2019','1-3-2019'],
        'City':['Delhi','Delhi','Mumbai','Mumbai','Chennai','Chennai'],
       'Temp': [28,30,22,24,32,34],
       'Humidity': [60,55,80,70,90,85]
       }
df=pd.DataFrame(data)
print (df)
pv=pd.pivot_table(df,index='City',values='Temp',aggfunc='max')
print (pv)


Output:

       Date          City  Temp  Humidity
0  1-1-2019    Delhi    28        60
1  1-1-2019    Delhi    30        55
2  1-2-2019   Mumbai    22        80
3  1-2-2019   Mumbai    24        70
4  1-3-2019  Chennai    32        90
5  1-3-2019  Chennai    34        85
         
       Temp
City         
Chennai    34
Delhi      30
Mumbai     24



PROGRAM TO PRINT DATA FRAME ON DATE INDEX AND CITY COLUMN

import pandas as pd
data={
       'Date': ['1-1-2019','1-1-2019','1-2-2019','1-2-2019','1-3-2019','1-3-2019'],
       'City':['Delhi','Delhi','Mumbai','Mumbai','Chennai','Chennai'],
       'Temp': [28,30,22,24,32,34],
       'Humidity': [60,55,80,70,90,85]
       }
df=pd.DataFrame(data)
print (df)
pv=pd.pivot_table(df,index='Date',columns='City')
print (pv)


OUTPUT


 Date           City  Temp  Humidity

0  1-1-2019    Delhi    28        60
1  1-1-2019    Delhi    30        55
2  1-2-2019   Mumbai    22        80
3  1-2-2019   Mumbai    24        70
4  1-3-2019  Chennai    32        90
5  1-3-2019  Chennai    34        85
         

               Humidity                 Temp             
City      Chennai Delhi Mumbai Chennai Delhi Mumbai
Date                                               
1-1-2019      NaN  57.5    NaN     NaN  29.0    NaN
1-2-2019      NaN   NaN   75.0     NaN   NaN   23.0
1-3-2019     87.5   NaN    NaN    33.0   NaN    NaN


HANDLING WITH MISSING VALUES

Dropping and filling

Sometimes, the data that we receive from many sources may not be perfect. that means there may be some missing data. for example- in the following program where student name is missing in one row and score is missing in other row:

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)

print(df)

OUTPUT:

     Name  Age  Score
0   Disha   16   87.0
1   Rohit   17   89.0
2   Manas   15   67.0
3     NaN   14   55.0

4  Vishnu   19    NaN


When we convert the data into dataframe, the missing data is represented by NaN (Not a Number). Nan is a default marker for the missing values.


Consider the following Dataframe:-
we can use fillna() method to replace Nan or NA value by a specified value. 

# to fill the NaN value by 0

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)
df=df.fillna(0)

print(df)

OUTPUT:

           Name  Age  Score
0   Disha   16   87.0
1   Rohit   17   89.0
2   Manas   15   67.0
3       0   14   55.0

4  Vishnu   19    0.0


But this is not useful as it is filling any type of column with 0. We can fill each column with a different value by passing the column name and the value to be used to fill in that column.

for example: to fill Student Name with ' XXX' and Score with " **" . We should supply these values as a dictionary inside fillna() method.

import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)
df=df.fillna({'Name':'XXXX','Score':'**'})

print(df)

OUTPUT:

     Name  Age Score
0   Disha   16    87
1   Rohit   17    89
2   Manas   15    67
3    XXXX   14    55

4  Vishnu   19    **


if we do not want any missing data and want to remove those rows having Na or NaN values then we can use dropna() method.

example:
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas',np.nan,'Vishnu']),
   'Age':pd.Series([16,17,15,14,19]),
   'Score':pd.Series([87,89,67,55,np.nan])}
df=pd.DataFrame(d)
df=df.dropna()

print(df)


Output:

    Name  Age  Score
0  Disha   16   87.0
1  Rohit   17   89.0

2  Manas   15   67.0

*****************************************************

Assignements

Descriptive Statistics 

Quartiles

Standard Deviation

aggregation and sorting

Pivoting

Handling with missing values


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.