Exploring the Dataset
# df is the dataframe
df.head() # Upper part of the dataset
df.tail() #Lower part of the dataset
df.sample() # Getting the random fraction of the dataset, we can pass
# frac attribute to set the proportion to reveal in the dataset
df.info() # Provides a brief description about the dataframe, including the
# column names, data types in each column, and the memory usage of the dataset
df.info(verbose = False) # Passing verbose attribute into False shows
# a shorthand summary
df.describe() # Taking the summary descriptive statistics of numeric variables
Components of a Dataframe
#returns a 2d array of the dataframe values
df.values
#returns the list of column names
df.columns
#returns the index of the rows
df.index
I/O Dataframe (Reading and Writing Dataframe from External Source)
# Most common data type to read for data analysis is csv, but there are also
# other data types that pandas can read such as excel,html,json, etc..
# Reading a dataframe
df.read_csv("file_path")
# Saving a dataframe to a csv file path, same as reading it can also saved to
# other file formats, check pandas documentations for more information
df.to_csv("file_path", index=False)
# index set up to False means index is not included in the file to be saved
Sorting a Dataframe
df.sort_values("column_name") # default sorting of values by ascending order
df.sort_values("column_name", ascending=False) # returns a descending order
df.sort_values(["column_1","column_2"]) # Sorting by multiple column via list
#Sorting multiple columns with the 'ascending' attribute
df.sort_values(["column_1","column_2"], ascending=[True,False])
Subsetting a Dataframe
df["column_name"] # Subsetting a single column
df[["column_1","column_2"]] # Subsetting multiple columns via list
# If the list is too long, we can store first a list object and pass it in the
# outer square bracket
df["target_column"] > 50 # Subsetting rows by passing a logical operation
# Returns a boolean array
df[df["target_column"] > 50] # Subsetting rows in the dataframe by wrapping this
# logical operation by the dataframe itself to return the dataframe subset
df[df["categorical_column" == "categorical_value"]] # Subsetting based on string data
df[df["date_column" >= "2018-10-10"]] # Subsetting rows by date
# Subsetting rows based on multiple logical operations (conditions)
category = df["categorical_column"] == "categorical_value"
numerical = df["numerical_column"] >= 1000
df[category & numerical]
# Shorthand method for multiple logical operation
df[(df["categorical_column"] == "value") & (df["numerical_column"] >= 1000 )]
# Subsetting via .isin() method by passing a list of categorical variables
df[df["categorical_column"].isin(["value_1","value_2"])]
Subsetting by List Comprehension
df[[cols for columns in df.cols if "word_in_column" in cols]]
# This case helps you filtering specific columns containing some key words in
# their name, let's say you want all columns with "time" in their names, we can
# replace "word_in_column" with "time" to use this
Subsetting by Query Method
df.query('column_name > 1000') # Query with one condition
df.query('(column_name > 1000) and (categorical_column == "value")')
# Query with multiple conditions, in case of categorical column, the value
# must be inserted in double quotes
external_variable = 100
df.query('column_name > @external_variable')
# In case that we want to filter using a declared value outside the .query method
# Adding @ symbol followed by the variable name allows to use that specific
# value in the query
Adding New Columns
df["new_column"] = df["original_column"] /100 # Adding a new column by
# simple arithmetic operation