Banks Table - BS - CSV - SQLite3 - Log

This example will demonstrate an ETL of scraping a webpage displaying Bank Values. The process involves using BeautifulSoup to extract the necessary table, transform the data, load the data into a csv file, and a SQLite3 db, run exploratory queries, and save the progress in a log file

Case Study


  • We are asked to create a code that can be used to compile the list of the top 10 largest banks in the world ranked by market capitalization in billion USD.
  • The data needs to be transformed and stored in GBP, EUR and INR as well, in accordance with the exchange rate information that has been made available in a CSV file
  • The processed information table is to be saved locally in a CSV format and as a database table
  • Run queries on the db
  • Log all function call in a progress log file

Your job is to create an automated system to generate this information so that the same can be executed in every financial quarter to prepare the report.

Setup


  • Let’s start by loading the libraries and
  • Setting the global variables
library(reticulate)
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import sqlite3
import requests
from datetime import datetime
# Global Vars
url = 'https://web.archive.org/web/20230908091635%20/https://en.wikipedia.org/wiki/List_of_largest_banks'
exchange_file = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/exchange_rate.csv'
sql_connection = sqlite3.connect('D:/data/BANK/Banks.db')
transformed_file = 'D:/data/BANK/Largest_banks_data.csv'
table_name = 'Largest_banks'
attribute_list = ['Name','MC_USD_Billion']
log_file = 'D:/data/BANK/code_log.txt'

Task 1: Logging function


Write the function to log the progress of the code, log_progress().

  • The function accepts the message to be logged and enters it to a text file code_log.txt
  • The log file will save the timestamp and the message generated by every process done
  • Here is a list of the messages we need to save upon execution
Task Log message on completion
Declaring known values Preliminaries complete. Initiating ETL process
Call extract() function Data extraction complete. Initiating Transformation process
Call transform() function Data transformation complete. Initiating Loading process
Call load_to_csv() Data saved to CSV file
Initiate SQLite3 connection SQL Connection initiated
Call load_to_db() Data loaded to Database as a table, Executing queries
Call run_query() Process Complete
Close SQLite3 connection Server Connection closed

Code

# Logging Function
def log_progress(message):
        timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
        now = datetime.now() # get current timestamp 
        timestamp = now.strftime(timestamp_format) 
        with open(log_file,"a") as f: 
                f.write(timestamp + ',' + message + '\n')

Task 2 : Extraction of data


Analyze the webpage on the given URL:

  • Identify the position of the required table under the heading By market capitalization.
  • Write the function extract() to retrieve the information of the table to a Pandas data frame.
  • Note: Remember to remove the last character from the Market Cap column contents, like, ‘\n’, and typecast the value to float format.
  • Write a function call for extract() and print the returning data frame.
  • Make the relevant log entry.

HTML Structure Inspection

  • I’ll first do a visual inspection of the webpage using the inspect option on the browser just to get a general idea of the layout
  • It appears from the page that the table we need is the first of 3 on that page. But to be sure I’ll go ahead and create a BS object and review it after parsing it.
  • I’ll go ahead and print out part of the HTML structure
data = rquests.get(url).text
page = BeautifulSoup(data, 'html5lib')
tables = page.find_all('table')
table = tables[0]
rows = table.find_all('tr')
print(table.prettify())

  • As you see above it is the correct table so let’s drill into the rows 'tr'
  • So here below we’ll find all the rows and iterate through all the rows and target the columns 'td'
  • I’ll print out the contents of each column in each row so we can identify the data we need
  • As you can see we are interested in col[1] and col[2]
  • col[1] we need to focus on the contents of the <a> </a>
  • By looking at the output above you can see there are multiple tags in the second column value.
  • So let’s list them all for each row and see if the table is consistent, that way we’ll know which title to extract for our study
  • From the results below we can tell that the table is consistently displaying the Bank Name as the second title in the row
data = requests.get(url).text
page = BeautifulSoup(data, 'html5lib')
tables = page.find_all('table')
table = tables[0]
rows = table.find_all('tr')
data_dict = {}
df = pd.DataFrame(columns = attribute_list)
for x, row in enumerate(rows):
        col = row.find_all('td')
        if len(col) != 0:
                print(f"We are in row {x}")
                print(f"content 0 is: {col[0].contents}")
                print(f"content 1 is: {col[1].contents}")
                print(f"content 2 is: {col[2].contents}")
"""
content 0 is: ['1\n']

content 1 is: [<span class="flagicon"><span class="mw-image-border" typeof="mw:File"><a href="/web/20230908091635/https://en.wikipedia.org/wiki/United_States" title="United States"><img alt="United States" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //web.archive.org/web/20230908091635im_/https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></a></span></span>, ' ', <a href="/web/20230908091635/https://en.wikipedia.org/wiki/JPMorgan_Chase" title="JPMorgan Chase">JPMorgan Chase</a>, '\n']

content 2 is: ['432.92\n']
"""
  • You should also notice that the Bank Names are within the 'a' tags
  • So let’s edit our code to target the ‘a’ tag and the title for the Bank Name
  • And contents for the Capitalization value
  • All we need to do is skip the first ‘title’ value and target the second
  • So let’s edit the code and save the values into a dict first then into a df
for x, row in enumerate(rows):
        col = row.find_all('td')
        if len(col) != 0:
                print(f"We are in row {x}")
                if col[1].find_all('a') is not None:
                        for x in col[1].find_all('a'):
                                print(f"content 1 titles are: {x.get('title')}")
                        print(f"content 2 is: {col[2].contents[0]}")
                        
"""
We are in row 1
content 1 titles are: United States
content 1 titles are: JPMorgan Chase
content 2 is: 432.92

We are in row 2
content 1 titles are: United States
content 1 titles are: Bank of America
content 2 is: 231.52

We are in row 3
content 1 titles are: China
content 1 titles are: Industrial and Commercial Bank of China
content 2 is: 194.56

We are in row 4
content 1 titles are: China
content 1 titles are: Agricultural Bank of China
"""
  • So let’s revise the code to capture the second title a_count[1]['title']
  • For the Market Capitalization value we need col[2].contents[0] but notice the at the end so we have to start from the end after skipping 1 char col[2].contents[0][:-1]
  • We then save the data in a dict which is saved in df1
  • We have already created a df and initialized it with the column names using the attribute_list
  • We concatenate df with df1 into df which will contain all the columns and rows we extracted from the webpage
  • The function will return the df to the caller

Code

# Extract Function
def extract(url,attribute_list):
        data = requests.get(url).text
        page = BeautifulSoup(data, 'html5lib')
        #print(page.prettify())
        tables = page.find_all('table')
        table = tables[0]
        rows = table.find_all('tr')
        data_dict = {}
        df = pd.DataFrame(columns = attribute_list)
        for row in rows:
                col = row.find_all('td')
                if len(col) != 0:
                        a_count = col[1].find_all('a')
                        if a_count != 0:
                                data_dict = {'Name' : a_count[1]['title'] ,
                                'MC_USD_Billion' :  col[2].contents[0][:-1]}
                                df1 = pd.DataFrame(data_dict, index = [0])
                                df = pd.concat([df, df1], ignore_index = True)
        return df

Task 3 : Transformation of data


The Transform function needs to perform the following tasks:

  • Read the exchange rate CSV file
  • Convert the contents to a dictionary so that the contents of the first columns are the keys to the dictionary and the contents of the second column are the corresponding values
  • Add 3 additional columns for different exchanges

Exchange Rate Function

  • Here we’ll import the exchange rate factors
  • Save into a df
  • Convert df to a dictionary
  • Return dict to transform() function for computation

reference: loop through dict

  • This is for reference only while I’m here.
  • I’ll include this part in another part of the pandas section
# This is one way to loop through a dict to see all the keys
for key in rate:
        print(key, rate[key])
        
Can use list.keys
# another way to extract values
x = 0
for key in rate:
        rate[key] = exchange_df.iloc[x,0]
        rate[key] = exchange_df.iloc[x,1]
        x = x + 1
        
rate

df to dict - method 1

# this is another example of how to transform 2 df cols to dict
def get_exchange_rate(filename):
        exchange_df = pd.read_csv(filename)
        df_dict = exchange_df.set_index('Currency')['Rate'.to_dict()]
        return df_dict

Exchange Rate method 2

  • Instead of using a dictionary as we see above and below (the two methods)
  • We can use the df directly like this
# Use this if you wish to use a df as the rate exchange multiplier source

def transform(df, multiplier):
        df['MC_USD_Billion'] = df['MC_USD_Billion'].astype(float)
        df['MC_GBP_Billion'] = [np.round(x * multiplier.iloc[1,1],2) for x in df['MC_USD_Billion']]
        df['MC_EUR_Billion'] = [np.round(x * multiplier.iloc[0,1],2) for x in df['MC_USD_Billion']]
        df['MC_INR_Billion'] = [np.round(x * multiplier.iloc[2,1],2) for x in df['MC_USD_Billion']]
        return df

# use these two lines in the Call Function code to call the transform function using the df instead

df = transform(df, exchange_df)

Code

df to dict - method 2

# Import and convert exchange rate
def get_exchange_rate(filename):
        exchange_df = pd.read_csv(filename)
        df_dict = dict(zip(exchange_df['Currency'],exchange_df['Rate']))
        return df_dict
# Call exchange rate function and use values for transformation
def transform(df):
        exchange_rate = get_exchange_rate(exchange_file)
        df['MC_USD_Billion'] = df['MC_USD_Billion'].astype(float)
        df['MC_GBP_Billion'] = [np.round(x * exchange_rate['GBP'],2) for x in df['MC_USD_Billion']]
        df['MC_EUR_Billion'] = [np.round(x * exchange_rate['EUR'],2) for x in df['MC_USD_Billion']]
        df['MC_INR_Billion'] = [np.round(x * exchange_rate['INR'],2) for x in df['MC_USD_Billion']]
        return df

Task 4: Loading to CSV


  • Write the function to load the transformed data frame to a CSV file, like load_to_csv(), in the path mentioned in the project scenario
  • Make the relevant log entry

Code

def load_to_csv(df, csv_path):
        df.to_csv(csv_path)

Task 5: Loading to Database


Write the function to load the transformed data frame to an SQL database, like, load_to_db().

  • Before calling this function, initiate the connection to the SQLite3 database server
  • Pass this connection object, along with the required table name Largest_banks and the transformed data frame, to the load_to_db() function in the function call

Code

def load_to_db(df, sql_connection, table_name):
        df.to_sql(table_name, sql_connection, if_exists = 'replace', index =False)
        print('Table is loaded with data')

Task 6: Run queries


  • Write the function run_queries() that accepts the query statement, and the SQLite3 Connection object, and generates the output of the query.
  • The query statement should be printed along with the query output.

Code

def run_query(query_statement, sql_connection):
    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)

Task 7: Run ETL


Finally we get to

  • Call all the functions
  • Run our queries
  • Log the progress
# Logging Function
log_progress('Preliminaries complete. Initiating ETL process')
df = extract(url, attribute_list)

log_progress('Data extraction complete. Initiating Transformation process')
df = transform(df)

log_progress('Data transformation complete. Initiating loading process')

load_to_csv(df, transformed_file)

log_progress('Data saved to CSV file')

log_progress('SQL Connection initiated.')

load_to_db(df, sql_connection, table_name)
Table is loaded with data
log_progress('Data loaded to Database as table. Executing queries')

# ----------- Run these Queries ------------------
query_statement = f"SELECT * from {table_name}"
run_query(query_statement, sql_connection)
SELECT * from Largest_banks
                                      Name  ...  MC_INR_Billion
0                           JPMorgan Chase  ...        35910.71
1                          Bank of America  ...        19204.58
2  Industrial and Commercial Bank of China  ...        16138.75
3               Agricultural Bank of China  ...        13328.41
4                                HDFC Bank  ...        13098.63
5                              Wells Fargo  ...        12929.42
6                                     HSBC  ...        12351.26
7                           Morgan Stanley  ...        11681.85
8                  China Construction Bank  ...        11598.07
9                            Bank of China  ...        11348.39

[10 rows x 5 columns]
query_statement = f"SELECT AVG(MC_GBP_Billion) from {table_name}"
run_query(query_statement, sql_connection)
SELECT AVG(MC_GBP_Billion) from Largest_banks
   AVG(MC_GBP_Billion)
0              151.987
query_statement = f"SELECT Name from {table_name} LIMIT 5"
run_query(query_statement, sql_connection)
SELECT Name from Largest_banks LIMIT 5
                                      Name
0                           JPMorgan Chase
1                          Bank of America
2  Industrial and Commercial Bank of China
3               Agricultural Bank of China
4                                HDFC Bank

log_progress('Process Complete.')

sql_connection.close()
log_progress('Server Connection Closed')

Task 8: Review Output


transform() - Output

df = extract(url, attribute_list)
df = transform(df)
df['MC_EUR_Billion'][4]
# OUTPUT 146.86
Name MC_USD_Billion MC_GBP_Billion MC_EUR_Billion MC_INR_Billion
1 JPMorgan Chase 432.92 346.34 402.62 35910.71
2 Bank of America 231.52 185.22 215.31 19204.58
3 Industrial and Commercial Bank of China 194.56 155.65 180.94 16138.75
4 Agricultural Bank of China 160.68 128.54 149.43 13328.41
5 HDFC Bank 157.91 126.33 146.86 13098.63
6 Wells Fargo 155.87 124.70 144.96 12929.42
7 HSBC 148.90 119.12 138.48 12351.26
8 Morgan Stanley 140.83 112.66 130.97 11681.85
9 China Construction Bank 139.82 111.86 130.03 11598.07
10 Bank of China 136.81 109.45 127.23 11348.39

csv file - Output

with open(transformed_file, 'r') as f:
        material = f.read()
        print(material)

run_query() - Output

CSV image of the log.txt file

with open(log_file, 'r') as f:
        material = f.read()
        print(material)