Chapter 1: Dataframe


Exploring the Dataset

# df is the dataframe

df.head()   # Upper part of the dataset

df.tail()   #Lower part of the dataset

df.sample()   # Getting the random fraction of the dataset, we can pass
# frac attribute to set the proportion to reveal in the dataset

df.info()  # Provides a brief description about the dataframe, including the
# column names, data types in each column, and the memory usage of the dataset

df.info(verbose = False) # Passing verbose attribute into False shows 
# a shorthand summary

df.describe()  # Taking the summary descriptive statistics of numeric variables

Components of a Dataframe

#returns a 2d array of the dataframe values
df.values 

#returns the list of column names
df.columns 

#returns the index of the rows
df.index 

I/O Dataframe (Reading and Writing Dataframe from External Source)

# Most common data type to read for data analysis is csv, but there are also
# other data types that pandas can read such as excel,html,json, etc..

# Reading a dataframe
df.read_csv("file_path")

# Saving a dataframe to a csv file path, same as reading it can also saved to
# other file formats, check pandas documentations for more information
df.to_csv("file_path", index=False)
# index set up to False means index is not included in the file to be saved

Sorting a Dataframe

df.sort_values("column_name")  # default sorting of values by ascending order

df.sort_values("column_name", ascending=False)  # returns a descending order

df.sort_values(["column_1","column_2"]) # Sorting by multiple column via list

#Sorting multiple columns with the 'ascending' attribute
df.sort_values(["column_1","column_2"], ascending=[True,False])

Subsetting a Dataframe

df["column_name"]   # Subsetting a single column

df[["column_1","column_2"]]  # Subsetting multiple columns via list
# If the list is too long, we can store first a list object and pass it in the
# outer square bracket

df["target_column"] > 50 # Subsetting rows by passing a logical operation
# Returns a boolean array

df[df["target_column"] > 50]  # Subsetting rows in the dataframe by wrapping this
# logical operation by the dataframe itself to return the dataframe subset

df[df["categorical_column" == "categorical_value"]] # Subsetting based on string data

df[df["date_column" >= "2018-10-10"]] # Subsetting rows by date

# Subsetting rows based on multiple logical operations (conditions)
category = df["categorical_column"] == "categorical_value"
numerical = df["numerical_column"] >= 1000
df[category & numerical]

# Shorthand method for multiple logical operation
df[(df["categorical_column"] == "value") & (df["numerical_column"] >= 1000 )]

# Subsetting via .isin() method by passing a list of categorical variables
df[df["categorical_column"].isin(["value_1","value_2"])]

Subsetting by List Comprehension

df[[cols for columns in df.cols if "word_in_column" in cols]]

# This case helps you filtering specific columns containing some key words in
# their name, let's say you want all columns with "time" in their names, we can
# replace "word_in_column" with "time" to use this

Subsetting by Query Method

df.query('column_name > 1000')  # Query with one condition

df.query('(column_name > 1000) and (categorical_column == "value")')
# Query with multiple conditions, in case of categorical column, the value
# must be inserted in double quotes

external_variable = 100
df.query('column_name > @external_variable')
# In case that we want to filter using a declared value outside the .query method
# Adding @ symbol followed by the variable name allows to use that specific 
# value in the query

Adding New Columns

df["new_column"] = df["original_column"] /100  # Adding a new column by 
# simple arithmetic operation

Chapter 2: Summary Statistics