Pandas is a powerful open-source library in Python specifically designed for data analysis and manipulation. It excels at working with tabular data, similar to what you might see in spreadsheets or SQL tables.
install pand
c:\>pip install pandas
import & check version
import pandas as pd
print(pd.__version__)
A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type.
import pandas as pd
data = [10,20,30]
mydata = pd.Series(data)
print(mydata)
import pandas as pd
data = [10,20,30]
mydata = pd.Series(data)
print(mydata[0]) #return 10
Instead of index, you can create your labels
import pandas as pd
data = [10,20,30]
mydata = pd.Series(data, index=['x','y','z'])
print(mydata['x']) #return 10
Creating pandas series from dictionary.
import pandas as pd
calories = {"day1":340,"day2":310, "day3":300}
x = pd.Series(calories)
print(x["day1"]) # print 340
x = pd.Series(calories, index =["day1","day2"])
#print(x["day3"]) #error
print(x["day2"]) #print 310
A Pandas DataFrame is a like a table with rows and columns. It is a 2 dimensional data structure like a 2 dimensional array.
Creating a dataframe using pandas series
import pandas as pd
data = {
"calories":[349,203,203],
"duration":[40,20,10]
}
df = pd.DataFrame(data)
print(df)
To fetch row from Pandas Dataframe, you have to use "loc" attribute to return 1 or more rows.
import pandas as pd
data = {
"calories":[349,203,203],
"duration":[40,20,10]
}
df = pd.DataFrame(data)
#fetch row
print(df.loc[0])
#return 349 , 40 first item each series
#list of indexes
print(df.loc[[0,1]])
#return as dataframe values of index 0 and 1
Below code add a list of names to give each row a name.
import pandas as pd
data = {
"calories":[349,203,203],
"duration":[40,20,10]
}
df = pd.DataFrame(data, index=["day1","day2","day3"])
print(df)
Using "loc" attribute to return specified rows.
import pandas as pd
data = {
"calories":[349,203,203],
"duration":[40,20,10]
}
df = pd.DataFrame(data, index=["day1","day2","day3"])
print(df.loc["day1"]) #return first row
Use pandas read_csv function to load CSV file into DataFrame
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
Pandas will return only first 5 rows and last 5 rows.
To check the system maximum rows, use max_rows statement.
In order to increase the maximum rows to display the all Dataframe, you can set the value of max_rows .
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
print(pd.options.display.max_rows)
pd.options.display.max_rows = 9999
print(data)
print(pd.options.display.max_rows)
Use pandas read_json function to load JSON file into DataFrame
import pandas as pd
pd.options.display.max_rows = 9999
data = pd.read_json("data.json")
print(data)
import pandas as pd
data_dict = {
"roll" :{"0":101,"1":102,"2":103,"3":104},
"math" :{"0":34,"1":23,"2":44,"3":32},
"english":{"0":34,"1":23,"2":45,"3":33}
}
df = pd.DataFrame(data_dict)
print(df)
The head() method returns the headers and specified number of rows from top.
import pandas as pd
data = pd.read_json("data.json")
df = pd.DataFrame(data)
print(df.head(10))
The tail() method returns the headers and specified number of rows from bottom.
import pandas as pd
data = pd.read_json("data.json")
df = pd.DataFrame(data)
print(df.tail(10))
info method give information about the data set. like
import pandas as pd
data = pd.read_json("data.json")
df = pd.DataFrame(data)
print(df.info())
Fixing bad data like, empty cells, data in wrong format, wrong data & duplicates.
use "dropna" method to remove the empty cells. It return a new DataFrame by default. use parameter inplace="True" to change in original dataframe.
import pandas as pd
data = pd.read_json("data1.json")
df = pd.DataFrame(data)
new_df = df.dropna() # return new dataframe
print(new_df)
df.dropna(inplace=True) #remove from orignal df
print(df)
"fillna" method allows us to replace the empty cells with values.
import pandas as pd
data = pd.read_json("data1.json")
df = pd.DataFrame(data)
df.fillna(120, inplace=True)
print(df)
# replace only for specified row
df['marks'].fillna(50,inplace=True)
print(df)
calculate mean and replace the empty value with it.
import pandas as pd
df = pd.read_json("data1.json")
x = df['marks'].mean()
df['marks'].fillna(x, inplace=True)
print(df)
x = df['marks'].median()
df['marks'].fillna(x,inplace=True)
print(df)
x= df['marks'].mode()[0]
df['marks'].fillna(x,inplace=True)
print(df)
Below is the example of converting data type of column to datetime
import pandas as pd
df = pd.read_json("data1.json")
df['Date'] = pd.to_datetime(df['Date'])
print(df)
import pandas as pd
df = pd.read_json("data1.json")
df.dropna(subset=['Date'], inplace=True) #remove empty cell Date rows
print(df)
You can change the value of cell by specifying row index and column name.
import pandas as pd
df = pd.read_json("data1.json")
df.loc[7,"marks"] = 50
print(df)
import pandas as pd
df = pd.read_json("data1.json")
for r in df.index
if df.loc[x,"marks"] > 50:
df.loc[x,"marks"] = 50
# removing rows
for r in df.index
if df.loc[x, "marks"] > 50:
df.drop(x, inplace = True)
duplicated() method return True for every row that is duplicate
drop_duplicates() method remove the duplicates rows
import pandas as pd
df = pd.read_json("data1.json")
print(df.duplicated())
print(df.drop_duplicates(inplace = True))
The corr() method calculates the relationship between each column in your data set.
import pandas as pd
df = pd.read_json("data1.json")
df.corr()
abs() --> Return a DataFrame with the absolute value of each value
add() --> Adds the values of a DataFrame with the specified value(s)
add_prefix() --> Prefix all labels
add_suffix() --> Suffix all labels
agg() --> Apply a function or a function name to one of the axis of the DataFrame
aggregate() --> Apply a function or a function name to one of the axis of the DataFrame
align() --> Aligns two DataFrames with a specified join method
all() --> Return True if all values in the DataFrame are True, otherwise False
any() --> Returns True if any of the values in the DataFrame are True, otherwise False
append() --> Append new columns
applymap() --> Execute a function for each element in the DataFrame
apply() --> Apply a function to one of the axis of the DataFrame
assign() --> Assign new columns
astype() --> Convert the DataFrame into a specified dtype
at --> Get or set the value of the item with the specified label
axes --> Returns the labels of the rows and the columns of the DataFrame
bfill() --> Replaces NULL values with the value from the next row
bool() --> Returns the Boolean value of the DataFrame
columns --> Returns the column labels of the DataFrame
combine() --> Compare the values in two DataFrames, and let a function decide which values to keep
combine_first() --> Compare two DataFrames, and if the first DataFrame has a NULL value, it will be filled with the respective value from the second DataFrame
compare() --> Compare two DataFrames and return the differences
convert_dtypes() --> Converts the columns in the DataFrame into new dtypes
corr() --> Find the correlation (relationship) between each column
count() --> Returns the number of not empty cells for each column/row
cov() --> Find the covariance of the columns
copy() --> Returns a copy of the DataFrame
cummax() --> Calculate the cumulative maximum values of the DataFrame
cummin() --> Calculate the cumulative minmum values of the DataFrame
cumprod() --> Calculate the cumulative product over the DataFrame
cumsum() --> Calculate the cumulative sum over the DataFrame
describe() --> Returns a description summary for each column in the DataFrame
diff() --> Calculate the difference between a value and the value of the same column in the previous row
div() --> Divides the values of a DataFrame with the specified value(s)
dot() --> Multiplies the values of a DataFrame with values from another array-like object, and add the result
drop() --> Drops the specified rows/columns from the DataFrame
drop_duplicates() --> Drops duplicate values from the DataFrame
droplevel() --> Drops the specified index/column(s)
dropna() --> Drops all rows that contains NULL values
dtypes --> Returns the dtypes of the columns of the DataFrame
duplicated() --> Returns True for duplicated rows, otherwise False
empty --> Returns True if the DataFrame is empty, otherwise False
eq() --> Returns True for values that are equal to the specified value(s), otherwise False
equals() --> Returns True if two DataFrames are equal, otherwise False
eval --> Evaluate a specified string
explode() --> Converts each element into a row
ffill() --> Replaces NULL values with the value from the previous row
fillna() --> Replaces NULL values with the specified value
filter() --> Filter the DataFrame according to the specified filter
first() --> Returns the first rows of a specified date selection
floordiv() --> Divides the values of a DataFrame with the specified value(s), and floor the values
ge() --> Returns True for values greater than, or equal to the specified value(s), otherwise False
get() --> Returns the item of the specified key
groupby() --> Groups the rows/columns into specified groups
gt() --> Returns True for values greater than the specified value(s), otherwise False
head() --> Returns the header row and the first 10 rows, or the specified number of rows
iat --> Get or set the value of the item in the specified position
idxmax() --> Returns the label of the max value in the specified axis
idxmin() --> Returns the label of the min value in the specified axis
iloc --> Get or set the values of a group of elements in the specified positions
index --> Returns the row labels of the DataFrame
infer_objects() --> Change the dtype of the columns in the DataFrame
info() --> Prints information about the DataFrame
insert() --> Insert a column in the DataFrame
interpolate() --> Replaces not-a-number values with the interpolated method
isin() --> Returns True if each elements in the DataFrame is in the specified value
isna() --> Finds not-a-number values
isnull() --> Finds NULL values
items() --> Iterate over the columns of the DataFrame
iteritems() --> Iterate over the columns of the DataFrame
iterrows() --> Iterate over the rows of the DataFrame
itertuples() --> Iterate over the rows as named tuples
join() --> Join columns of another DataFrame
last() --> Returns the last rows of a specified date selection
le() --> Returns True for values less than, or equal to the specified value(s), otherwise False
loc --> Get or set the value of a group of elements specified using their labels
lt() --> Returns True for values less than the specified value(s), otherwise False
keys() --> Returns the keys of the info axis
kurtosis() --> Returns the kurtosis of the values in the specified axis
mask() --> Replace all values where the specified condition is True
max() --> Return the max of the values in the specified axis
mean() --> Return the mean of the values in the specified axis
median() --> Return the median of the values in the specified axis
melt() --> Reshape the DataFrame from a wide table to a long table
memory_usage() --> Returns the memory usage of each column
merge() --> Merge DataFrame objects
min() --> Returns the min of the values in the specified axis
mod() --> Modules (find the remainder) of the values of a DataFrame
mode() --> Returns the mode of the values in the specified axis
mul() --> Multiplies the values of a DataFrame with the specified value(s)
ndim --> Returns the number of dimensions of the DataFrame
ne() --> Returns True for values that are not equal to the specified value(s), otherwise False
nlargest() --> Sort the DataFrame by the specified columns, descending, and return the specified number of rows
notna() --> Finds values that are not not-a-number
notnull() --> Finds values that are not NULL
nsmallest() --> Sort the DataFrame by the specified columns, ascending, and return the specified number of rows
nunique() --> Returns the number of unique values in the specified axis
pct_change() --> Returns the percentage change between the previous and the current value
pipe() --> Apply a function to the DataFrame
pivot() --> Re-shape the DataFrame
pivot_table() --> Create a spreadsheet pivot table as a DataFrame
pop() --> Removes an element from the DataFrame
pow() --> Raise the values of one DataFrame to the values of another DataFrame
prod() --> Returns the product of all values in the specified axis
product() --> Returns the product of the values in the specified axis
quantile() --> Returns the values at the specified quantile of the specified axis
query() --> Query the DataFrame
radd() --> Reverse-adds the values of one DataFrame with the values of another DataFrame
rdiv() --> Reverse-divides the values of one DataFrame with the values of another DataFrame
reindex() --> Change the labels of the DataFrame
reindex_like() --> ??
rename() --> Change the labels of the axes
rename_axis() --> Change the name of the axis
reorder_levels() --> Re-order the index levels
replace() --> Replace the specified values
reset_index() --> Reset the index
rfloordiv() --> Reverse-divides the values of one DataFrame with the values of another DataFrame
rmod() --> Reverse-modules the values of one DataFrame to the values of another DataFrame
rmul() --> Reverse-multiplies the values of one DataFrame with the values of another DataFrame
round() --> Returns a DataFrame with all values rounded into the specified format
rpow() --> Reverse-raises the values of one DataFrame up to the values of another DataFrame
rsub() --> Reverse-subtracts the values of one DataFrame to the values of another DataFrame
rtruediv() --> Reverse-divides the values of one DataFrame with the values of another DataFrame
sample() --> Returns a random selection elements
sem() --> Returns the standard error of the mean in the specified axis
select_dtypes() --> Returns a DataFrame with columns of selected data types
shape --> Returns the number of rows and columns of the DataFrame
set_axis() --> Sets the index of the specified axis
set_flags() --> Returns a new DataFrame with the specified flags
set_index() --> Set the Index of the DataFrame
size --> Returns the number of elements in the DataFrame
skew() --> Returns the skew of the values in the specified axis
sort_index() --> Sorts the DataFrame according to the labels
sort_values() --> Sorts the DataFrame according to the values
squeeze() --> Converts a single column DataFrame into a Series
stack() --> Reshape the DataFrame from a wide table to a long table
std() --> Returns the standard deviation of the values in the specified axis
sum() --> Returns the sum of the values in the specified axis
sub() --> Subtracts the values of a DataFrame with the specified value(s)
swaplevel() --> Swaps the two specified levels
T --> Turns rows into columns and columns into rows
tail() --> Returns the headers and the last rows
take() --> Returns the specified elements
to_xarray() --> Returns an xarray object
transform() --> Execute a function for each value in the DataFrame
transpose() --> Turns rows into columns and columns into rows
truediv() --> Divides the values of a DataFrame with the specified value(s)
truncate() --> Removes elements outside of a specified set of values
update() --> Update one DataFrame with the values from another DataFrame
value_counts() --> Returns the number of unique rows
values --> Returns the DataFrame as a NumPy array
var() --> Returns the variance of the values in the specified axis
where() --> Replace all values where the specified condition is False
xs() --> Returns the cross-section of the DataFrame
__iter__() --> Returns an iterator of the info axes