Pandas :Quick Reference

Pandas is a powerful open-source library in Python specifically designed for data analysis and manipulation. It excels at working with tabular data, similar to what you might see in spreadsheets or SQL tables.

Pandas: Installation, import and Check Version

install pand


c:\>pip install pandas

import & check version


import pandas as pd
print(pd.__version__)

Pandas Series

A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type.


import pandas as pd

data = [10,20,30]
mydata = pd.Series(data)
print(mydata)

Accessing value from series by index


import pandas as pd

data = [10,20,30]
mydata = pd.Series(data)
print(mydata[0]) #return 10

Accessing value from series by labels

Instead of index, you can create your labels


import pandas as pd

data = [10,20,30]
mydata = pd.Series(data, index=['x','y','z'])
print(mydata['x']) #return 10

Pandas series from dictionary

Creating pandas series from dictionary.


import pandas as pd

calories = {"day1":340,"day2":310, "day3":300}

x = pd.Series(calories)

print(x["day1"]) # print 340

x = pd.Series(calories, index =["day1","day2"])
#print(x["day3"]) #error
print(x["day2"]) #print 310

Pandas DataFrame

A Pandas DataFrame is a like a table with rows and columns. It is a 2 dimensional data structure like a 2 dimensional array.

DataFrame using Pandas Series

Creating a dataframe using pandas series


import pandas as pd

data = {
    "calories":[349,203,203],
    "duration":[40,20,10]
}

df = pd.DataFrame(data)
print(df)

Fetch row

To fetch row from Pandas Dataframe, you have to use "loc" attribute to return 1 or more rows.


import pandas as pd

data = {
    "calories":[349,203,203],
    "duration":[40,20,10]
}

df = pd.DataFrame(data)

#fetch row
print(df.loc[0]) 
#return 349 , 40 first item each series

#list of indexes
print(df.loc[[0,1]]) 
#return as dataframe values of index 0 and 1

Named Indexes

Below code add a list of names to give each row a name.


import pandas as pd

data = {
    "calories":[349,203,203],
    "duration":[40,20,10]
}

df = pd.DataFrame(data, index=["day1","day2","day3"])
print(df)        

Fetch Locate Indexes

Using "loc" attribute to return specified rows.


import pandas as pd

data = {
    "calories":[349,203,203],
    "duration":[40,20,10]
}

df = pd.DataFrame(data, index=["day1","day2","day3"])
print(df.loc["day1"]) #return first row  

Read CSV file

Use pandas read_csv function to load CSV file into DataFrame


import pandas as pd

data = pd.read_csv("data.csv")

print(data)

DataFrame Max Rows

Pandas will return only first 5 rows and last 5 rows.

To check the system maximum rows, use max_rows statement.

In order to increase the maximum rows to display the all Dataframe, you can set the value of max_rows .


import pandas as pd

data = pd.read_csv("data.csv")

print(data)

print(pd.options.display.max_rows)

pd.options.display.max_rows = 9999

print(data)

print(pd.options.display.max_rows)

Read JSON files

Use pandas read_json function to load JSON file into DataFrame


import pandas as pd

pd.options.display.max_rows = 9999
data = pd.read_json("data.json")

print(data)

Loading Python Dictionary as JSON into DataFrame


import pandas as pd

data_dict = {
        "roll" :{"0":101,"1":102,"2":103,"3":104},
        "math" :{"0":34,"1":23,"2":44,"3":32},
        "english":{"0":34,"1":23,"2":45,"3":33}        
}

df = pd.DataFrame(data_dict)
print(df)

head method

The head() method returns the headers and specified number of rows from top.


import pandas as pd

data = pd.read_json("data.json")

df = pd.DataFrame(data)

print(df.head(10))

tail method

The tail() method returns the headers and specified number of rows from bottom.


import pandas as pd

data = pd.read_json("data.json")

df = pd.DataFrame(data)

print(df.tail(10))

info method

info method give information about the data set. like


import pandas as pd

data = pd.read_json("data.json")

df = pd.DataFrame(data)

print(df.info())

Cleaning Data

Fixing bad data like, empty cells, data in wrong format, wrong data & duplicates.

Removing Rows - Empty Cells

use "dropna" method to remove the empty cells. It return a new DataFrame by default. use parameter inplace="True" to change in original dataframe.


import pandas as pd

data = pd.read_json("data1.json")

df = pd.DataFrame(data)

new_df = df.dropna() # return new dataframe

print(new_df)

df.dropna(inplace=True) #remove from orignal df

print(df)

Replace Empty Values

"fillna" method allows us to replace the empty cells with values.


import pandas as pd

data = pd.read_json("data1.json")

df = pd.DataFrame(data)

df.fillna(120, inplace=True)

print(df)

# replace only for specified row

df['marks'].fillna(50,inplace=True)

print(df)

Replace using mean(), median() and mode()

calculate mean and replace the empty value with it.


import pandas as pd

df = pd.read_json("data1.json")

x = df['marks'].mean()

df['marks'].fillna(x, inplace=True)

print(df)

x = df['marks'].median()

df['marks'].fillna(x,inplace=True)

print(df)

x= df['marks'].mode()[0]
df['marks'].fillna(x,inplace=True)
print(df)

convert dataframe column datatype

Below is the example of converting data type of column to datetime


import pandas as pd

df = pd.read_json("data1.json")

df['Date'] = pd.to_datetime(df['Date'])

print(df)

removing rows with null value of specified column


import pandas as pd

df = pd.read_json("data1.json")

df.dropna(subset=['Date'], inplace=True) #remove empty cell Date rows

print(df)

Replacing values

You can change the value of cell by specifying row index and column name.


import pandas as pd

df = pd.read_json("data1.json")

df.loc[7,"marks"] = 50

print(df)

Iterating DataFrame column


import pandas as pd

df = pd.read_json("data1.json")

for r in df.index
    if df.loc[x,"marks"] > 50:
        df.loc[x,"marks"] = 50

# removing rows

for r in df.index
    if df.loc[x, "marks"] > 50:
        df.drop(x, inplace = True)

Removing Duplicates

duplicated() method return True for every row that is duplicate

drop_duplicates() method remove the duplicates rows


import pandas as pd

df = pd.read_json("data1.json")

print(df.duplicated())

print(df.drop_duplicates(inplace = True))

Correlations

The corr() method calculates the relationship between each column in your data set.


import pandas as pd

df = pd.read_json("data1.json")

df.corr()

Quick Reference


abs() --> Return a DataFrame with the absolute value of each value
add() --> Adds the values of a DataFrame with the specified value(s)
add_prefix() --> Prefix all labels
add_suffix() --> Suffix all labels
agg() --> Apply a function or a function name to one of the axis of the DataFrame
aggregate() --> Apply a function or a function name to one of the axis of the DataFrame
align() --> Aligns two DataFrames with a specified join method
all() --> Return True if all values in the DataFrame are True, otherwise False
any() --> Returns True if any of the values in the DataFrame are True, otherwise False
append() --> Append new columns
applymap() --> Execute a function for each element in the DataFrame
apply() --> Apply a function to one of the axis of the DataFrame
assign() --> Assign new columns
astype() --> Convert the DataFrame into a specified dtype
at --> Get or set the value of the item with the specified label
axes --> Returns the labels of the rows and the columns of the DataFrame
bfill() --> Replaces NULL values with the value from the next row
bool() --> Returns the Boolean value of the DataFrame
columns --> Returns the column labels of the DataFrame
combine() --> Compare the values in two DataFrames, and let a function decide which values to keep
combine_first() --> Compare two DataFrames, and if the first DataFrame has a NULL value, it will be filled with the respective value from the second DataFrame
compare() --> Compare two DataFrames and return the differences
convert_dtypes() --> Converts the columns in the DataFrame into new dtypes
corr() --> Find the correlation (relationship) between each column
count() --> Returns the number of not empty cells for each column/row
cov() --> Find the covariance of the columns
copy() --> Returns a copy of the DataFrame
cummax() --> Calculate the cumulative maximum values of the DataFrame
cummin() --> Calculate the cumulative minmum values of the DataFrame
cumprod() --> Calculate the cumulative product over the DataFrame
cumsum() --> Calculate the cumulative sum over the DataFrame
describe() --> Returns a description summary for each column in the DataFrame
diff() --> Calculate the difference between a value and the value of the same column in the previous row
div() --> Divides the values of a DataFrame with the specified value(s)
dot() --> Multiplies the values of a DataFrame with values from another array-like object, and add the result
drop() --> Drops the specified rows/columns from the DataFrame
drop_duplicates() --> Drops duplicate values from the DataFrame
droplevel() --> Drops the specified index/column(s)
dropna() --> Drops all rows that contains NULL values
dtypes --> Returns the dtypes of the columns of the DataFrame
duplicated() --> Returns True for duplicated rows, otherwise False
empty --> Returns True if the DataFrame is empty, otherwise False
eq() --> Returns True for values that are equal to the specified value(s), otherwise False
equals() --> Returns True if two DataFrames are equal, otherwise False
eval --> Evaluate a specified string
explode() --> Converts each element into a row
ffill() --> Replaces NULL values with the value from the previous row
fillna() --> Replaces NULL values with the specified value
filter() --> Filter the DataFrame according to the specified filter
first() --> Returns the first rows of a specified date selection
floordiv() --> Divides the values of a DataFrame with the specified value(s), and floor the values
ge() --> Returns True for values greater than, or equal to the specified value(s), otherwise False
get() --> Returns the item of the specified key
groupby() --> Groups the rows/columns into specified groups
gt() --> Returns True for values greater than the specified value(s), otherwise False
head() --> Returns the header row and the first 10 rows, or the specified number of rows
iat --> Get or set the value of the item in the specified position
idxmax() --> Returns the label of the max value in the specified axis
idxmin() --> Returns the label of the min value in the specified axis
iloc --> Get or set the values of a group of elements in the specified positions
index --> Returns the row labels of the DataFrame
infer_objects() --> Change the dtype of the columns in the DataFrame
info() --> Prints information about the DataFrame
insert() --> Insert a column in the DataFrame
interpolate() --> Replaces not-a-number values with the interpolated method
isin() --> Returns True if each elements in the DataFrame is in the specified value
isna() --> Finds not-a-number values
isnull() --> Finds NULL values
items() --> Iterate over the columns of the DataFrame
iteritems() --> Iterate over the columns of the DataFrame
iterrows() --> Iterate over the rows of the DataFrame
itertuples() --> Iterate over the rows as named tuples
join() --> Join columns of another DataFrame
last() --> Returns the last rows of a specified date selection
le() --> Returns True for values less than, or equal to the specified value(s), otherwise False
loc --> Get or set the value of a group of elements specified using their labels
lt() --> Returns True for values less than the specified value(s), otherwise False
keys() --> Returns the keys of the info axis
kurtosis() --> Returns the kurtosis of the values in the specified axis
mask() --> Replace all values where the specified condition is True
max() --> Return the max of the values in the specified axis
mean() --> Return the mean of the values in the specified axis
median() --> Return the median of the values in the specified axis
melt() --> Reshape the DataFrame from a wide table to a long table
memory_usage() --> Returns the memory usage of each column
merge() --> Merge DataFrame objects
min() --> Returns the min of the values in the specified axis
mod() --> Modules (find the remainder) of the values of a DataFrame
mode() --> Returns the mode of the values in the specified axis
mul() --> Multiplies the values of a DataFrame with the specified value(s)
ndim --> Returns the number of dimensions of the DataFrame
ne() --> Returns True for values that are not equal to the specified value(s), otherwise False
nlargest() --> Sort the DataFrame by the specified columns, descending, and return the specified number of rows
notna() --> Finds values that are not not-a-number
notnull() --> Finds values that are not NULL
nsmallest() --> Sort the DataFrame by the specified columns, ascending, and return the specified number of rows
nunique() --> Returns the number of unique values in the specified axis
pct_change() --> Returns the percentage change between the previous and the current value
pipe() --> Apply a function to the DataFrame
pivot() --> Re-shape the DataFrame
pivot_table() --> Create a spreadsheet pivot table as a DataFrame
pop() --> Removes an element from the DataFrame
pow() --> Raise the values of one DataFrame to the values of another DataFrame
prod() --> Returns the product of all values in the specified axis
product() --> Returns the product of the values in the specified axis
quantile() --> Returns the values at the specified quantile of the specified axis
query() --> Query the DataFrame
radd() --> Reverse-adds the values of one DataFrame with the values of another DataFrame
rdiv() --> Reverse-divides the values of one DataFrame with the values of another DataFrame
reindex() --> Change the labels of the DataFrame
reindex_like() --> ??
rename() --> Change the labels of the axes
rename_axis() --> Change the name of the axis
reorder_levels() --> Re-order the index levels
replace() --> Replace the specified values
reset_index() --> Reset the index
rfloordiv() --> Reverse-divides the values of one DataFrame with the values of another DataFrame
rmod() --> Reverse-modules the values of one DataFrame to the values of another DataFrame
rmul() --> Reverse-multiplies the values of one DataFrame with the values of another DataFrame
round() --> Returns a DataFrame with all values rounded into the specified format
rpow() --> Reverse-raises the values of one DataFrame up to the values of another DataFrame
rsub() --> Reverse-subtracts the values of one DataFrame to the values of another DataFrame
rtruediv() --> Reverse-divides the values of one DataFrame with the values of another DataFrame
sample() --> Returns a random selection elements
sem() --> Returns the standard error of the mean in the specified axis
select_dtypes() --> Returns a DataFrame with columns of selected data types
shape --> Returns the number of rows and columns of the DataFrame
set_axis() --> Sets the index of the specified axis
set_flags() --> Returns a new DataFrame with the specified flags
set_index() --> Set the Index of the DataFrame
size --> Returns the number of elements in the DataFrame
skew() --> Returns the skew of the values in the specified axis
sort_index() --> Sorts the DataFrame according to the labels
sort_values() --> Sorts the DataFrame according to the values
squeeze() --> Converts a single column DataFrame into a Series
stack() --> Reshape the DataFrame from a wide table to a long table
std() --> Returns the standard deviation of the values in the specified axis
sum() --> Returns the sum of the values in the specified axis
sub() --> Subtracts the values of a DataFrame with the specified value(s)
swaplevel() --> Swaps the two specified levels
T --> Turns rows into columns and columns into rows
tail() --> Returns the headers and the last rows
take() --> Returns the specified elements
to_xarray() --> Returns an xarray object
transform() --> Execute a function for each value in the DataFrame
transpose() --> Turns rows into columns and columns into rows
truediv() --> Divides the values of a DataFrame with the specified value(s)
truncate() --> Removes elements outside of a specified set of values
update() --> Update one DataFrame with the values from another DataFrame
value_counts() --> Returns the number of unique rows
values --> Returns the DataFrame as a NumPy array
var() --> Returns the variance of the values in the specified axis
where() --> Replace all values where the specified condition is False
xs() --> Returns the cross-section of the DataFrame
__iter__() --> Returns an iterator of the info axes