Data Handling using Pandas -1
Python
Library – Pandas
It
is a most famous Python package for data science, which offers powerful and flexible
data structures that make data analysis and
manipulation easy.
Pandas makes data importing and data analyzing much
easier. Pandas builds on packages like NumPy and matplotlib
to
give us a single & convenient place for data analysis and
visualization work.
Pandas or Python Pandas is a library of Python which is used for data analysis.
•The term Pandas is derived from “Panel data system” , which is an ecometric term for multidimentioal, structured data set ecometrics.
•Now a days, Pandas has become a popular option for Data Analysis.
•Pandas provides various tools for data analysis in simpler form.
•Pandas is an Open Source, BSD library built for Python Programming language. BSD stands for Berkeley Software Distributions and BSD licensed code is freely-distributable with copyright and liability disclaimer
•Pandas offers high performance, easy to use data structure and data analysis tools.
•The main author of Pandas is Wes McKinney.
Pandas – Installation/Environment Setup
Pandas installation can be done in Standard Python distribution,using following steps:
There must be service pack installed on our computer if we are using windows.
If it is not installed then we will not be able to install pandas in existing Standard Python(which is already installed).So install it first(google it).
We can check it through properties option of my computer icon
Now move to script folder of python distribution in command prompt (through cmd command of windows).Execute following commands in command prompt serially.
>pip install numpy
>pip install six
>pip install pandas
Wait after each command for installation
Now we will be able to use pandas in standard python distribution.
Type import pandas as pd in python (IDLE) shell.
If it executed without error(it means pandas is installed on your system)Data Structures in Pandas
Two important data structures of pandas are–Series, DataFrame
1. Series
Series is like a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers.
7 4 1 5 8
Basic feature of series are:-
- Homogeneous data
- Size Immutable
- Values of Data Mutable
2. DataFrame
DataFrame is like two dimensional Array with heterogeneous data.
ex-
1 2 3
4 5 6
Pandas Series
It is like one-dimensional array capable of holding data of any type (integer, string, float, python objects, etc.). Series can be created using constructor.
Syntax :- pandas.Series( data, index, dtype, copy) Creation of Series is also possible from – ndarray, dictionary, scalar value.
Series can be created using
1.Array
2.Dict
3.Scalar value or constant
4 5 6
Pandas Series
It is like one-dimensional array capable of holding data of any type (integer, string, float, python objects, etc.). Series can be created using constructor.
Syntax :- pandas.Series( data, index, dtype, copy) Creation of Series is also possible from – ndarray, dictionary, scalar value.
Series can be created using
1.Array
2.Dict
3.Scalar value or constant
Pandas Series
1. Create an Empty Series:-
e.g.
import pandas as pseries
s = pseries.Series()
print(s)
Output:
Series([], dtype: float64
e.g.
import pandas as pd1
import numpy as np1
data = np1.array(['a','b','c','d'])
s = pd1.Series(data) print(s)
Output
0 a
1 b
2 c
3 d
dtype: object
Note : default index is starting from 0
data = np1.array(['a','b','c','d'])
s = pd1.Series(data) print(s)
Output
0 a
1 b
2 c
3 d
dtype: object
Note : default index is starting from 0
With index position
e.g.
import pandas as p1
import numpy as np1
data = np1.array(['a','b','c','d'])
s = p1.Series(data,index=[100,101,102,103])
print(s)
Output
data = np1.array(['a','b','c','d'])
s = p1.Series(data,index=[100,101,102,103])
print(s)
Output
100 a
101 b
102 c
103 d
dtype: object
Note : index is starting from 100
3. Create a Series from dictionary
Example -1 (without index)
import pandas as pd1
import numpy as np1
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd1.Series(data)
print(s)
Output:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd1.Series(data)
print(s)
Output:
a 0.0
b 1.0
c 2.0
dtype: float64
b 1.0
c 2.0
dtype: float64
Example 2 (with index)
import pandas as pd1
import numpy as np1
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd1.Series(data,index=['b','c','d','a'])
print(s)
Output
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd1.Series(data,index=['b','c','d','a'])
print(s)
Output
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
c 2.0
d NaN
a 0.0
dtype: float64
4. Create a Series from Scalar
e.g
import pandas as pd1
import pandas as pd1
import numpy as np1
s = pd1.Series(5, index=[0, 1, 2, 3]) print(s)
Output
s = pd1.Series(5, index=[0, 1, 2, 3]) print(s)
Output
0 5
1 5
2 5
3 5
dtype: int64 Note :- here 5 is repeated for 4 times (as per no of index)
1 5
2 5
3 5
dtype: int64 Note :- here 5 is repeated for 4 times (as per no of index)
5. Mathemetical Operation with Series
e.g.
import pandas as pd1
s = pd1.Series([1,2,3])
t = pd1.Series([1,2,4])
u=s+t #addition operation
t = pd1.Series([1,2,4])
u=s+t #addition operation
print (u)
u=s*t # multiplication operation
print (u)
Output # (addition)
0 2
1 4
2 7
dtype: int64
Output # (Multiplication)
0 1
1 4
2 12
dtype: int64
Head function
e.g
import pandas as pd1
s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
import pandas as pd1
s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s.head(3))
Output
a 1
b. 2
c. 3
dtype: int64
#Return first 3 elements
Output
a 1
b. 2
c. 3
dtype: int64
#Return first 3 elements
Tail function
e.g
import pandas as pd1
s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
import pandas as pd1
s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s.tail(3))
Output
Output
c 3
d. 4
e. 5
dtype: int64
#Return last 3 elements
d. 4
e. 5
dtype: int64
#Return last 3 elements
Accessing Data from Series with indexing and slicing
import pandas as pd1
s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s[0]) # for 0 index position
print (s[:3]) #for first 3 index values
print (s[-3:]) # slicing for last 3 index values
Output
1
a.1
b.2
c.3
dtype: int64
c.3
d.4
e.5
dtype: int64
Retrieve Data Using Label as (Index)
import pandas as pd1
s = pd1.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print (s[['c','d']])
Output
c 3
d 4
dtype: int64
Retrieve Data from selection
There are three methods for data selection:
- loc gets rows (or columns) with particular labels from the index.
- iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
- ix usually tries to behave like loc but falls back to behaving like iloc if a label is not present in the index. ix is deprecated and the use of loc and iloc is encouraged instead
Retrieve Data from selection
e.g.
import pandas as pd
>>> s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
>>> s.iloc[:3] # slice the first three rows
output:
49 NaN
48 NaN
47 NaN
>>> s.loc[:3] # slice up to and including label 3
output:
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
>>> s.ix[:3] # the integer is in the index so s.ix[:3] works like loc
output:
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
Pandas DataFrame
It is a two-dimensional data structure, just like any table (with rows &columns).
Basic Features of DataFrame
- Columns may be of different types
- Size can be changed(Mutable)
- Labeled axes (rows / columns)
- Arithmetic operations on rows and columns
Create DataFrame
It can be created with followings
- Lists
- dict
- Series
- Numpy ndarrays
- Another DataFrame
Create an Empty DataFrame
import pandas as pd1
df1 = pd1.DataFrame()
print(df1)
Output
Empty Data Frame
Columns: [ ]
Index: [ ]
Create a DataFrame from Lists
example-1
import pandas as pd1
data1 = [1,2,3,4,5]
df1 = pd1.DataFrame(data1)
print (df1)
Output
0 1
1 2
2 3
3 4
4 5
example-2
import pandas as pd1
data1 = [['Disha',10],['Rohit',12],['Vishnu',13]]
df1 = pd1.DataFrame(data1,columns=['Name','Age'])
print (df1)
Output
Name Age
1 Disha 10
2 Rohit 12
3 Vishnu 13
Create a DataFrame from Dict of ndarrays / Lists
e.g.1
import pandas as pd1
data1 = {'Name':['Disha','Manas'],'Age':[9,10]}
data1 = {'Name':['Disha','Manas'],'Age':[9,10]}
df1 = pd1.DataFrame(data1)
print (df1)
Output
Name Age
1 Disha 9
2 Manas 10
Create a DataFrame from List of Dicts
import pandas as pd1
data1 = [{'x': 1, 'y': 2},{'x': 5, 'y': 4, 'z': 5}]
df1 = pd1.DataFrame(data1)
print (df1)
Output
x y z
0 1 2 NaN
1 5 4 5.0
Create a DataFrame from Dict of Series
e.g.1
import pandas as pd1
d1 = {'one' : pd1.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd1.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df1 = pd1.DataFrame(d1) print (df1)
Output
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Pandas DataFrame Column addition
print (df)
c = [7,8,9]
df[‘C'] = c # adding a new column named as 'C'
print (df)
Output: (Before adding column C)
A B
0 1 4
1 2 5
2 3 6
Output: (After adding column C)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Column Deletion
we can delete column using three functions:-
1- del function is used with [ ] bracket without dot(.) operator.
2- pop function is used with ( ) bracket with dot(.) operator.
3- drop function is used with () bracket with axes.
Note: example associated with previous created dataframe df.
del df['C'] # Deleting the last column using DEL function
print (df)
Output:
A B
0 1 4
1 2 5
2 3 6
df.pop('B') #Deleting another column using POP function
Output:
A
0 1
1 2
2 3
import pandas as pd
df = pd.DataFrame({"A":[1, 2, 3],"B":[4, 5, 6]})
print (df)
Output:
A B
0 1 4
1 2 5
2 3 6
df.drop('A',axis=1)
Output:
B
0 4
1 5
2 6
Note#Deleting another column using drop function axis=1 means to delete data column-wise if we use axis=0 means to delete data row-wise with given index no.
Rename columns
>>> df.rename(columns={"A": "a", "B": "c"})
Output:
a c
0 1 4
1 2 5
2 3 6
Row Selection, Addition, and Deletion
import pandas as pd1 d1 = {'one' : pd1.Series([1, 2, 3],
index=['a', 'b', 'c']), 'two' : pd1.Series([1, 2, 3, 4],
index=['a', 'b', 'c', 'd'])}
df1 = pd1.DataFrame(d1)
print (df1.loc['b'])
Output
one 2.0
two 2.0
Name: b, dtype: float64
#Selection by integer location
import pandas as pd1
d1 = {'one':pd1.Series([1,2,3],index=['a','b','c']),
'two':pd1.Series([1,2,3,4],index=['a','b','c','d'])}
df1 = pd1.DataFrame(d1)
print (df1.iloc[2])
Output
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows : Multiple rows can be selected using ":" operator.
print (df1[2:4])
Addition of Rows
df1 = pd1.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd1.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df1 = df1.append(df2)
print (df1)
Output
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
df1 = df1.drop(0)
print (df1)
#output after deletion of first row of both colomn (0 label)
a b
1 3 4
1 7 8
Iteration on Rows and Columns
If we want to access record or data from a data frame row wise or column wise then iteration is used. Pandas provide 2 functions to perform iterations-
1.iterrows ()
2.iteritems ()
iterrows()
It is used to access the data row wise.
Example-
import pandas as pd
L=[{'First_Name': 'Birbal', 'last_Name':'Jat'},
{'First_Name': 'Vihaan', 'last_Name':'Chodhary'}]
df1=pd.DataFrame(L)
print (df1)
for (row_index,row_value) in df1.iterrows():
print('\n Row index is ::',row_index)
print('Row value is ::')
print(row_value)
First_Name last_Name
0 Birbal Jat
1 Vihaan Chodhary
Row index is :: 0
Row value is ::
First_Name Birbal
last_Name Jat
Name: 0, dtype: object
Row index is :: 1
Row value is ::
First_Name Vihaan
last_Name Chodhary
Name: 1, dtype: object
iteritems()
It is used to access the data column wise.
Example-
import pandas as pd
L=[{'First_Name': 'Birbal', 'last_Name':'Jat'},
{'First_Name': 'Vihaan', 'last_Name':'Chodhary'}]
df1=pd.DataFrame(L)
print (df1)
for (col_name,col_value) in df1.iteritems():
print('\n')
print('\n column Name is ::',col_name)
print('Column Value are ::')
print(col_value)
Output:
First_Name last_Name
0 Birbal Jat
1 Vihaan Chodhary
column Name is :: First_Name
Column Value are ::
0 Birbal
1 Vihaan
Name: First_Name, dtype: object
column Name is :: last_Name
Column Value are ::
0 Jat
1 Chodhary
Name: last_Name, dtype: object
Head & Tail
head() returns the first n rows (observe the index values). The default number of elements to display is 5, but you may pass a custom number. tail() returns the last n rows .
import pandas as pd
import numpy as np
d={'Name':pd.Series(['Disha','Rohit','Manas','Krish','Vishnu']),
'Age':pd.Series([17,18,19,18,19]),'Score':pd.Series([9,8,9,8,9])}
df=pd.DataFrame(d)
print("Data Frame is:")
print (df)
print("The first two rows of the DataFrame is (use of head function )")
print (df.head(2))
print("The last two rows of the DataFrame is (use of tail function )")
print (df.tail(2))
print("Result of showing 2nd to 4th row ")
print (df[2:5])
OUTPUT:
Data Frame is:
Name Age Score
0 Disha 17 9
1 Rohit 18 8
2 Manas 19 9
3 Krish 18 8
4 Vishnu 19 9
The first two rows of the DataFrame is (use of head function )
Name Age Score
0 Disha 17 9
1 Rohit 18 8
The last two rows of the DataFrame is (use of tail function )
Name Age Score
3 Krish 18 8
4 Vishnu 19 9
Result of showing 2nd to 4th row
Name Age Score
2 Manas 19 9
3 Krish 18 8
4 Vishnu 19 9
Boolean Indexing in Data Frame
In order to access a dataframe with a boolean index, we have to create a dataframe in which index of dataframe contains a boolean value that is “True” or “False”.
import pandas as pd
import numpy as np
d={'Name':['Disha','Rohit','Manas','Krish','Vishnu'],
'Age':[17,18,19,18,19],'Score':[9,8,9,8,9]}
df=pd.DataFrame(d,index=[True,False,True,False,True])
print("Data Frame is:")
print (df)
print("Accessing a DataFrame using .loc[] with index True ")
print (df.loc[True])
print("Accessing a DataFrame using .loc[] with index False ")
print (df.loc[False])
OUTPUT:
Data Frame is:
Name Age Score
True Disha 17 9
False Rohit 18 8
True Manas 19 9
False Krish 18 8
True Vishnu 19 9
Accessing a DataFrame using .loc[] with index True
Name Age Score
True Disha 17 9
True Manas 19 9
True Vishnu 19 9
Accessing a DataFrame using .loc[] with index False
Name Age Score
False Rohit 18 8
False Krish 18 8
Concat operation in data frame
Pandas
provides various facilities for easily
combining together
Series, DataFrame.
pd.concat(objs,
axis=0, join='inner',
join_axes=None,ignore_index=False
·objs
−
This
is
a
sequence
or
mapping
of
Series,
DataFrame,
or Panel
objects.
·axis
−
{0, 1, ...},
default 0. This is the axis to
concatenate
along.
·join −
{‘inner’,
‘outer’},
.
How
to
handle
indexes
on other
axis(es). Outer for union and inner for
intersection.
By default, the Pandas merge operation acts with an “inner” merge. An inner merge, (or inner join) keeps only the common values in both the left and right dataframes for the result.
By default, the Pandas merge operation acts with an “inner” merge. An inner merge, (or inner join) keeps only the common values in both the left and right dataframes for the result.
·ignore_index
−
boolean, default False. If True, do not use the index
values on the
concatenation
axis. The resulting axis will
be labeled
0, ..., n -
1.
·join_axes
−
This is the list of
Index objects.
Specific indexes to use
for the other (n-1)
axes instead
of performing inner/outer set
logic.
The
Concat() performs concatenation operations along an
axis
example 1:
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.concat([df1,df2])
print (df3)
OUTPUT:
RollNo Name Score
0 101 Disha 95
1 102 Manas 92
2 103 Rohit 93
3 104 Sarthak 91
0 105 Krishan 93
1 106 Gautam 92
2 107 Shubham 91
3 108 Vishnu 95
If you want the row labels to adjust automatically according to the join, you will have to set the argument ignore_index as True while calling the concat() function:
Example-2
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.concat([df1,df2],ignore_index=True)
print (df3)
OUTPUT:
RollNo Name Score
0 101 Disha 95
1 102 Manas 92
2 103 Rohit 93
3 104 Sarthak 91
4 105 Krishan 93
5 106 Gautam 92
6 107 Shubham 91
7 108 Vishnu 95
pandas also
provides you with
an option
to label
|
|
the DataFrames, after
the
concatenation, with
|
|
a key so that you may know which data came
|
|
from which DataFrame.
|
EXAMPLE-3
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
merge={'Data1':df1,'Data2':df2}
df3=pd.concat(merge)
print (df3)
OUTPUT:
RollNo Name Score
Data1 0 101 Disha 95
1 102 Manas 92
2 103 Rohit 93
3 104 Sarthak 91
Data2 0 105 Krishan 93
1 106 Gautam 92
2 107 Shubham 91
3 108 Vishnu 95
To concatenate DataFrames along column, you can specify
the axis parameter as 1.
EXAMPLE-4
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.concat([df1,df2],axis=1)
print (df3)
RollNo Name Score RollNo Name Score
0 101 Disha 95 105 Krishan 93
1 102 Manas 92 106 Gautam 92
2 103 Rohit 93 107 Shubham 91
3 104 Sarthak 91 108 Vishnu 95
Merge operation in data
frame
Two Data Frames might hold different kinds of information about the same entity and linked by some common feature/column. To join these Data Frames, pandas provides multiple functions like merge(), join() etc.
Example-1
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.merge(df1,df2,on='Score')
print (df3)
OUTPUT:
RollNo_x Name_x Score RollNo_y Name_y
0 101 Disha 95 108 Vishnu
1 102 Manas 92 106 Gautam
2 103 Rohit 93 105 Krishan
3 104 Sarthak 91 107 Shubham
Join operation in data frame
1.Full
Outer
Join:-
|
The
full outer join combines the
results of both
|
|
the
left and the
right outer joins. The joined data frame will
contain all
|
||
records from both the data frames and fill in NaNs for
missing
|
||
matches
on either side. You can
perform
a full outer join by specifying
|
||
the
how argument as outer in merge() function.
|
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.merge(df1,df2,on='RollNo',how='outer')
print (df3)
Output:
RollNo Name_x Score_x Name_y Score_y
0 101 Disha 95 NaN NaN
1 102 Manas 92 NaN NaN
2 103 Rohit 93 NaN NaN
3 104 Sarthak 91 NaN NaN
4 105 NaN NaN Krishan 93
5 106 NaN NaN Gautam 92
6 107 NaN NaN Shubham 91
7 108 NaN NaN Vishnu 95
2.Inner Join: The inner
join produce only those records
that match
in
both
the
data
frame.
You
have
to
pass
inner
in
how
argument inside
merge()
function
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.merge(df1,df2,on='Score',how='inner')
print (df3)
OUTPUT;
RollNo_x Name_x Score RollNo_y Name_y0 101 Disha 95 108 Vishnu
1 102 Manas 92 106 Gautam
2 103 Rohit 93 105 Krishan
3 104 Sarthak 91 107 Shubham
3. Right
Join
:-
|
The right join produce a
complete
set of records
|
|
from
data frame B(Right side Data Frame) with the matching records
|
||
(where available) in data frame
A( Left
side
data frame).
If there
is no
|
||
match
right
side
will
contain
null.
You have
to
pass
right in how
|
||
argument inside merge() function.
|
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.merge(df1,df2,on='RollNo',how='right')
print (df3)
OUTPUT:
RollNo Name_x Score_x Name_y Score_y
0 105 NaN NaN Krishan 93
1 106 NaN NaN Gautam 92
2 107 NaN NaN Shubham 91
3 108 NaN NaN Vishnu 95
4. Left Join :-
The right join produce a complete set of records from data frame A(Left side Data Frame) with the matching records (where available) in data frame B ( Right side data frame). If there is no match
left side will contain null. You have to pass left in how argument inside merge() function.
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.merge(df1,df2,on='RollNo',how='left')
print (df3)
OUTPUT:
RollNo Name_x Score_x Name_y Score_y
0 101 Disha 95 NaN NaN
1 102 Manas 92 NaN NaN
2 103 Rohit 93 NaN NaN
3 104 Sarthak 91 NaN NaN
import pandas as pd
d1={'RollNo':['101','102','103','104'],
'Name':['Disha','Manas','Rohit','Sarthak'],
'Score':['95','92','93','91']}
d2={'RollNo':['105','106','107','108'],
'Name':['Krishan','Gautam','Shubham','Vishnu'],
'Score':['93','92','91','95']}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
df3=pd.merge(df1,df2,right_index=True, left_index=True)
print (df3)
OUTPUT:
RollNo_x Name_x Score_x RollNo_y Name_y Score_y
0 101 Disha 95 105 Krishan 93
1 102 Manas 92 106 Gautam 92
2 103 Rohit 93 107 Shubham 91
3 104 Sarthak 91 108 Vishnu 95
Importing/Exporting Data between CSV files and Data Frames.
CSV File: Introduction
A
CSV is a comma separated values file, which allows data
to be
saved in a tabular format. CSV is a simple file
such as a
spreadsheet or database.
Files in the csv format can be imported
and exported from programs that store data
in tables, such
as Microsoft excel or Open
Office.
CSV
files data
fields are
most often separated, or delimited
by a comma. Here
the data
in each
row
are delimited by comma and individual
rows are separated by
newline.
To
create a
csv file, first choose your favorite
text editor such as- Notepad
and open a
new file. Then enter the
text data
you want
the file
to contain, separating each value
with a
comma and each
row with
a new
line.
Save the file
with the
extension.csv. You can open the file
using MS
Excel
or another spread sheet program. It
will create the table of similar
data.
Advantages of CSV format-
- A simple, compact and ubiquitous format for data storage.
- A common format for data interchange.
- It can be opened in popular spredsheet packages like MS-EXCEL etc.
- Nearly all spredsheets and databases support import/export to csv format.
Reading from a CSV file to DataFrame
import pandas as pd
Open an MS-Excel file and enter some data in it and save <file_name>.csv (with .csv extension). give correct path in Python Program:
Now we will create a DataFrame like this:
import pandas as pd
df=pd.read_csv("e:\student.csv")
print (df)
RollNo Name Class Marks
0 101 Disha 12-A 95
1 102 Manas 12-A 94
2 103 Rohit 12-A 94
3 104 Gautam 12-B 95
4 105 Krishna 12-B 94
USE OF HEADER as None
import pandas as pd
df=pd.read_csv("e:\student.csv", header=None)
print (df)
OUTPUT:
0 1 2 3
0 RollNo Name Class Marks
1 101 Disha 12-A 95
2 102 Manas 12-A 94
3 103 Rohit 12-A 94
4 104 Gautam 12-B 95
5 105 Krishna 12-B 94
Reading selected lines from CSV file Use of nrows ()
import pandas as pd
df=pd.read_csv("e:\student.csv",nrows=3)
print (df)
OUTPUT:
RollNo Name Class Marks
0 101 Disha 12-A 95
1 102 Manas 12-A 94
2 103 Rohit 12-A 94
Exporting data from data frame to CSV File
To export a data frame into a csv file first of all, we create a dataframe say df1 and use dataframe.to_csv ( E:\student1.csv ’ ) method to export data -frame df1 into csv file student1.csv.
import pandas as pd
d={'Name':['Disha','Rohit','Manas','Krish','Vishnu'],
'Age':[17,18,19,18,19],
'Score':[9,8,9,8,9]}
df1=pd.DataFrame(d)
df1.to_csv("e:\student1.csv")
now a MS-Excel file will be created at given path i.e. in E drive of computer system.
OUTPUT:
We can change comma separator with other character using "sep"
import pandas as pd
d={'Name':['Disha','Rohit','Manas','Krish','Vishnu'],
'Age':[17,18,19,18,19],
'Score':[9,8,9,8,9]}
df1=pd.DataFrame(d)
df1.to_csv("e:\student1.csv",sep='#')
TO see the output open "student1.csv" file with Notepad
OUTPUT
#Name#Age#Score
0#Disha#17#9
1#Rohit#18#8
2#Manas#19#9
3#Krish#18#8
4#Vishnu#19#9
assignment section for Revision Tour
Assignment section for Data Handling with Pandas-I
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.