GDP Table - ETL - SQLite3 - Pandas

A different version of this script is detailed in GDP Table - ETL - SQLite3 - BS using BeautifulSoup instead

Scenario


  • You are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).
  • The required information needs to be made accessible as a CSV file Countries_by_GDP.csv
  • as well as a table Countries_by_GDP in
  • a database file World_Economies.db
  • with attributes Country and GDP_USD_billion
  • Run a query on the database table to display only the entries with more than a 100 billion USD economy.
  • Log in a file the entire process of execution named etl_project_log.txt.

Tasks

  • Write a data extraction function to retrieve the relevant information from the provided URL.
  • Transform the available GDP information into ‘Billion USD’ from ‘Million USD’.
  • Load the transformed information to the required CSV file Countries_by_GDP.csv
  • Save the data in a SQLite3 database World_Economies.db in table: Countries_by_GDP
  • Run the required query on the database
  • Log the progress of the code with appropriate timestamps in etl_project_log.txt

Setup


library(reticulate)
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import sqlite3
import requests
from datetime import datetime 

Extract


Global Vars

url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
sql_connection = sqlite3.connect('D:/data/GDP/World_Economies.db')
transformed_file = 'D:/data/GDP/Countries_by_GDP.csv'
table_name = 'Countries_by_GDP'
attribute_list = ['Country', 'GDP_USD_millions']
log_file = 'D:/data/GDP/etl_project_log.txt'

Extract Function

  • extract will read all the tables from the web page
  • Assign the data to tables
  • Extract table[3] the one we want
  • You’ll notice from the table that column names are multi-indexed, they span over two rows
  • So let’s rename the columns with their column index number: we first use data.shape and realize we have (214,8) (rows,cols). Now we can use shape[1] which is 8 and rename the columns with range(.shape[1]) which is range(8) which will give us column names from 0-7
  • Filter the df down to columns 1 and 3, we can use the .copy() to avoid any warnings or use the simple one to drop all columns except the 1st and 3rd which are locations [[0,2]]
  • Drop 1st row because it contains the GDP for the entire world so we start fro 1: and go till the end
  • Assign new column names
  • return the df
def extract(url, attribute_list):
        tables = pd.read_html(url)
        data = tables[3]
        df1 = pd.DataFrame(data)
        df1.columns = range(df1.shape[1])
        df = df1[[0,2]]
        df = df.iloc[1:214,:].copy()
        df.columns = attribute_list
        return df

Transform Function

  • Convert types to float
  • Convert from Millions to Billions
  • Round values to 2 decimals
def transform(df):
        #df['GDP_USD_millions'] = df['GDP_USD_millions'].astype(float)
        df['GDP_USD_millions'] = pd.to_numeric(df['GDP_USD_millions'], errors='coerce')
        df[['GDP_USD_millions']] = df[['GDP_USD_millions']]/1000                      
        df[['GDP_USD_millions']] = np.round(df[['GDP_USD_millions']], 2)
        df.columns = ['Country','GDP_USD_billions']
        return df

Load df to CSV

  • Save the transformed data to Countries_by_GDP.csv
def load_to_csv(df, csv_path):
        df.to_csv(csv_path)

Load to DB

  • Load the df to the table: table_name
def load_to_db(df, sql_connection, table_name):
        df.to_sql(table_name, sql_connection, if_exists = 'replace', index =False)
        print('Table is loaded with data')

Run Queries

def run_query(query_statement, sql_connection):
    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)

Log Progress

def log_progress(message):
        timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
        now = datetime.now() # get current timestamp 
        timestamp = now.strftime(timestamp_format) 
        with open(log_file,"a") as f: 
                f.write(timestamp + ',' + message + '\n') 

Query First 15 Rows

# View 15 rows from Table
query_statement = f"SELECT* FROM {table_name} LIMIT 15"
run_query(query_statement, sql_connection)
SELECT* FROM Countries_by_GDP LIMIT 15
           Country  GDP_USD_billions
0    United States          26854.60
1            China          19373.59
2            Japan           4409.74
3          Germany           4308.85
4            India           3736.88
5   United Kingdom           3158.94
6           France           2923.49
7            Italy           2169.74
8           Canada           2089.67
9           Brazil           2081.24
10          Russia           2062.65
11     South Korea           1721.91
12       Australia           1707.55
13          Mexico           1663.16
14           Spain           1492.43

Query Count of Rows

# Count the number of Countries in the table
query_statement =  f"SELECT COUNT(*) FROM {table_name}"
run_query(query_statement, sql_connection)
SELECT COUNT(*) FROM Countries_by_GDP
   COUNT(*)
0       213

Function Calls


Here is the sequence of function calls:

Task Log message on completion
Declaring known values Preliminaries complete. Initiating ETL process.
Call extract() function Data extraction complete. Initiating Transformation process.
Call transform() function Data transformation complete. Initiating loading process.
Call load_to_csv() Data saved to CSV file.
Initiate SQLite3 connection SQL Connection initiated.
Call load_to_db() Data loaded to Database as table. Running the query.
Call run_query() * Process Complete.
Close SQLite3 connection -
log_progress('Preliminaries complete. Initiating ETL process')

df = extract(url, attribute_list)

log_progress('Data extraction complete. Initiating Transformation process')

df = transform(df)

log_progress('Data transformation complete. Initiating loading process')

load_to_csv(df, transformed_file)

log_progress('Data saved to CSV file')

log_progress('SQL Connection initiated.')

load_to_db(df, sql_connection, table_name)
Table is loaded with data
log_progress('Data loaded to Database as table. Running the query')

query_statement = f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"
run_query(query_statement, sql_connection)
SELECT * from Countries_by_GDP WHERE GDP_USD_billions >= 100
          Country  GDP_USD_billions
0   United States          26854.60
1           China          19373.59
2           Japan           4409.74
3         Germany           4308.85
4           India           3736.88
..            ...               ...
64          Kenya            118.13
65         Angola            117.88
66           Oman            104.90
67      Guatemala            102.31
68       Bulgaria            100.64

[69 rows x 2 columns]

log_progress('Process Complete.')

sql_connection.close()

Complete Script


Here is the script all together in one place. Step by step are described above.

import pandas as pd
import numpy as np
import sqlite3
import requests
from datetime import datetime 

# Global vars
url = 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
sql_connection = sqlite3.connect('D:/data/GDP/World_Economies.db')
transformed_file = 'D:/data/GDP/Countries_by_GDP.csv'
table_name = 'Countries_by_GDP'
attribute_list = ['Country', 'GDP_USD_millions']
log_file = 'D:/data/GDP/etl_project_log.txt'

def extract(url, attribute_list):
        tables = pd.read_html(url)
        data = tables[3]
        df1 = pd.DataFrame(data)
        df1.columns = range(df1.shape[1])
        df = df1[[0,2]]
        df = df.iloc[1:214,:].copy()
        #df = df1.iloc[1:,[0,2]].copy()
        df.columns = attribute_list
        return df

def transform(df):
        #df['GDP_USD_millions'] = df['GDP_USD_millions'].astype(float)
        df['GDP_USD_millions'] = pd.to_numeric(df['GDP_USD_millions'], errors='coerce')
        df[['GDP_USD_millions']] = df[['GDP_USD_millions']]/1000                      
        df[['GDP_USD_millions']] = np.round(df[['GDP_USD_millions']], 2)
        df.columns = ['Country','GDP_USD_billions','GDP in GBP','GDP in EUR','GDP in INR']
        return df


def load_to_csv(df, csv_path):
        df.to_csv(csv_path)
        
        
def load_to_db(df, sql_connection, table_name):
        df.to_sql(table_name, sql_connection, if_exists = 'replace', index =False)
        print('Table is loaded with data')
        

def run_query(query_statement, sql_connection):
    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)
    

def log_progress(message):
        timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
        now = datetime.now() # get current timestamp 
        timestamp = now.strftime(timestamp_format) 
        with open(log_file,"a") as f: 
                f.write(timestamp + ',' + message + '\n') 
                
                
# Call functions
log_progress('Preliminaries complete. Initiating ETL process')

df = extract(url, attribute_list)

log_progress('Data extraction complete. Initiating Transformation process')

df = transform(df)

log_progress('Data transformation complete. Initiating loading process')

load_to_csv(df, transformed_file)

log_progress('Data saved to CSV file')

log_progress('SQL Connection initiated.')

load_to_db(df, sql_connection, table_name)

log_progress('Data loaded to Database as table. Running the query')

query_statement = f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"
run_query(query_statement, sql_connection)

log_progress('Process Complete.')

sql_connection.close()

with open(log_file, 'r') as log:
        progress = log.read()
        print(progress)
2024-Sep-08-10:36:09,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:36:11,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:36:11,Data transformation complete. Initiating loading process
2024-Sep-08-10:36:11,Data saved to CSV file
2024-Sep-08-10:36:11,SQL Connection initiated.
2024-Sep-08-10:36:11,Data loaded to Database as table. Running the query
2024-Sep-08-10:36:12,Process Complete.
2024-Sep-08-10:54:15,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:54:18,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:54:18,Data transformation complete. Initiating loading process
2024-Sep-08-10:54:18,Data saved to CSV file
2024-Sep-08-10:54:18,SQL Connection initiated.
2024-Sep-08-10:54:18,Data loaded to Database as table. Running the query
2024-Sep-08-10:54:18,Process Complete.
2024-Sep-08-10:56:04,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:56:06,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:56:06,Data transformation complete. Initiating loading process
2024-Sep-08-10:56:06,Data saved to CSV file
2024-Sep-08-10:56:06,SQL Connection initiated.
2024-Sep-08-10:56:06,Data loaded to Database as table. Running the query
2024-Sep-08-10:56:07,Process Complete.
2024-Sep-08-10:59:13,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:59:16,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:59:16,Data transformation complete. Initiating loading process
2024-Sep-08-10:59:16,Data saved to CSV file
2024-Sep-08-10:59:16,SQL Connection initiated.
2024-Sep-08-10:59:16,Data loaded to Database as table. Running the query
2024-Sep-08-10:59:16,Process Complete.
2024-Sep-08-11:11:59,Preliminaries complete. Initiating ETL process
2024-Sep-08-11:12:03,Data extraction complete. Initiating Transformation process
2024-Sep-08-11:12:04,Data transformation complete. Initiating loading process
2024-Sep-08-11:12:04,Data saved to CSV file
2024-Sep-08-11:12:04,SQL Connection initiated.
2024-Sep-08-11:12:04,Data loaded to Database as table. Running the query
2024-Sep-08-11:12:04,Process Complete.
2024-Sep-08-14:13:34,Preliminaries complete. Initiating ETL process
2024-Sep-08-15:08:10,Preliminaries complete. Initiating ETL process
2024-Sep-08-15:08:14,Data extraction complete. Initiating Transformation process
2024-Sep-08-15:08:14,Data transformation complete. Initiating loading process
2024-Sep-08-15:08:14,Data saved to CSV file
2024-Sep-08-15:08:14,SQL Connection initiated.
2024-Sep-08-15:08:14,Data loaded to Database as table. Running the query
2024-Sep-08-15:08:14,Process Complete.
2024-Sep-08-16:24:38,Preliminaries complete. Initiating ETL process
2024-Sep-08-16:24:43,Data extraction complete. Initiating Transformation process
2024-Sep-08-16:24:43,Data transformation complete. Initiating loading process
2024-Sep-08-16:24:43,Data saved to CSV file
2024-Sep-08-16:24:43,SQL Connection initiated.
2024-Sep-08-16:24:43,Data loaded to Database as table. Running the query
2024-Sep-08-16:24:43,Process Complete.
2024-Sep-08-16:24:59,Preliminaries complete. Initiating ETL process
2024-Sep-08-16:25:00,Data extraction complete. Initiating Transformation process
2024-Sep-08-16:25:00,Data transformation complete. Initiating loading process
2024-Sep-08-16:25:00,Data saved to CSV file
2024-Sep-08-16:25:00,SQL Connection initiated.
2024-Sep-08-16:25:00,Data loaded to Database as table. Running the query
2024-Sep-08-16:25:00,Process Complete.
2024-Sep-08-16:35:36,Preliminaries complete. Initiating ETL process
2024-Sep-08-16:35:40,Data extraction complete. Initiating Transformation process
2024-Sep-08-16:35:40,Data transformation complete. Initiating loading process
2024-Sep-08-16:35:40,Data saved to CSV file
2024-Sep-08-16:35:40,SQL Connection initiated.
2024-Sep-08-16:35:40,Data loaded to Database as table. Running the query
2024-Sep-08-16:35:40,Process Complete.
2024-Sep-09-13:32:02,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:32:43,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:32:58,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:34:05,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:40:25,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:40:27,Data extraction complete. Initiating Transformation process
2024-Sep-09-13:40:27,Data transformation complete. Initiating loading process
2024-Sep-09-13:40:27,Data saved to CSV file
2024-Sep-09-13:40:27,SQL Connection initiated.
2024-Sep-09-13:40:27,Data loaded to Database as table. Running the query
2024-Sep-09-13:40:27,Process Complete.
2024-Sep-09-18:12:28,Preliminaries complete. Initiating ETL process
2024-Sep-09-18:12:29,Data extraction complete. Initiating Transformation process
2024-Sep-09-18:12:29,Data transformation complete. Initiating loading process
2024-Sep-09-18:12:29,Data saved to CSV file
2024-Sep-09-18:12:29,SQL Connection initiated.
2024-Sep-09-18:12:29,Data loaded to Database as table. Running the query
2024-Sep-09-18:12:29,Process Complete.
2024-Sep-10-17:46:20,Preliminaries complete. Initiating ETL process
2024-Sep-10-17:46:21,Data extraction complete. Initiating Transformation process
2024-Sep-10-17:46:21,Data transformation complete. Initiating loading process
2024-Sep-10-17:46:21,Data saved to CSV file
2024-Sep-10-17:46:21,SQL Connection initiated.
2024-Sep-10-17:46:21,Data loaded to Database as table. Running the query
2024-Sep-10-17:46:21,Process Complete.
2024-Sep-10-19:11:55,Preliminaries complete. Initiating ETL process
2024-Sep-10-19:11:55,Data extraction complete. Initiating Transformation process
2024-Sep-10-19:11:55,Data transformation complete. Initiating loading process
2024-Sep-10-19:11:55,Data saved to CSV file
2024-Sep-10-19:11:55,SQL Connection initiated.
2024-Sep-10-19:11:55,Data loaded to Database as table. Running the query
2024-Sep-10-19:11:56,Process Complete.
2024-Sep-11-11:01:30,Preliminaries complete. Initiating ETL process
2024-Sep-11-11:01:34,Data extraction complete. Initiating Transformation process
2024-Sep-11-11:01:34,Data transformation complete. Initiating loading process
2024-Sep-11-11:01:34,Data saved to CSV file
2024-Sep-11-11:01:34,SQL Connection initiated.
2024-Sep-11-11:01:34,Data loaded to Database as table. Running the query
2024-Sep-11-11:01:34,Process Complete.
2024-Sep-11-11:42:17,Preliminaries complete. Initiating ETL process
2024-Sep-11-11:42:21,Data extraction complete. Initiating Transformation process
2024-Sep-11-11:42:21,Data transformation complete. Initiating loading process
2024-Sep-11-11:42:21,Data saved to CSV file
2024-Sep-11-11:42:21,SQL Connection initiated.
2024-Sep-11-11:42:21,Data loaded to Database as table. Running the query
2024-Sep-11-11:42:21,Process Complete.
2024-Sep-11-11:48:43,Preliminaries complete. Initiating ETL process
2024-Sep-11-11:48:44,Data extraction complete. Initiating Transformation process
2024-Sep-11-11:48:44,Data transformation complete. Initiating loading process
2024-Sep-11-11:48:44,Data saved to CSV file
2024-Sep-11-11:48:44,SQL Connection initiated.
2024-Sep-11-11:48:44,Data loaded to Database as table. Running the query
2024-Sep-11-11:48:44,Process Complete.
2024-Sep-11-12:00:04,Preliminaries complete. Initiating ETL process
2024-Sep-11-12:00:05,Data extraction complete. Initiating Transformation process
2024-Sep-11-12:00:05,Data transformation complete. Initiating loading process
2024-Sep-11-12:00:05,Data saved to CSV file
2024-Sep-11-12:00:05,SQL Connection initiated.
2024-Sep-11-12:00:05,Data loaded to Database as table. Running the query
2024-Sep-11-12:00:05,Process Complete.
2024-Sep-15-14:12:00,Preliminaries complete. Initiating ETL process
2024-Sep-15-14:12:00,Data extraction complete. Initiating Transformation process
2024-Sep-15-14:12:00,Data transformation complete. Initiating loading process
2024-Sep-15-14:12:00,Data saved to CSV file
2024-Sep-15-14:12:00,SQL Connection initiated.
2024-Sep-15-14:12:00,Data loaded to Database as table. Running the query
2024-Sep-15-14:12:00,Process Complete.
2024-Sep-17-15:49:54,Preliminaries complete. Initiating ETL process
2024-Sep-17-15:49:56,Data extraction complete. Initiating Transformation process
2024-Sep-17-15:49:56,Data transformation complete. Initiating loading process
2024-Sep-17-15:49:56,Data saved to CSV file
2024-Sep-17-15:49:56,SQL Connection initiated.
2024-Sep-17-15:49:56,Data loaded to Database as table. Running the query
2024-Sep-17-15:49:56,Process Complete.
2024-Sep-21-19:42:30,Preliminaries complete. Initiating ETL process
2024-Sep-21-19:42:32,Data extraction complete. Initiating Transformation process
2024-Sep-21-19:42:32,Data transformation complete. Initiating loading process
2024-Sep-21-19:42:32,Data saved to CSV file
2024-Sep-21-19:42:32,SQL Connection initiated.
2024-Sep-21-19:42:33,Data loaded to Database as table. Running the query
2024-Sep-21-19:42:33,Process Complete.
2024-Sep-27-10:30:00,Preliminaries complete. Initiating ETL process
2024-Sep-27-10:30:01,Data extraction complete. Initiating Transformation process
2024-Sep-27-10:30:01,Data transformation complete. Initiating loading process
2024-Sep-27-10:30:01,Data saved to CSV file
2024-Sep-27-10:30:01,SQL Connection initiated.
2024-Sep-27-10:30:01,Data loaded to Database as table. Running the query
2024-Sep-27-10:30:01,Process Complete.
2024-Sep-29-10:29:54,Preliminaries complete. Initiating ETL process
2024-Sep-29-10:29:56,Data extraction complete. Initiating Transformation process
2024-Sep-29-10:29:56,Data transformation complete. Initiating loading process
2024-Sep-29-10:29:56,Data saved to CSV file
2024-Sep-29-10:29:56,SQL Connection initiated.
2024-Sep-29-10:29:56,Data loaded to Database as table. Running the query
2024-Sep-29-10:29:56,Process Complete.
2024-Oct-07-11:18:47,Preliminaries complete. Initiating ETL process
2024-Oct-07-11:18:49,Data extraction complete. Initiating Transformation process
2024-Oct-07-11:18:49,Data transformation complete. Initiating loading process
2024-Oct-07-11:18:49,Data saved to CSV file
2024-Oct-07-11:18:49,SQL Connection initiated.
2024-Oct-07-11:18:49,Data loaded to Database as table. Running the query
2024-Oct-07-11:18:49,Process Complete.
2024-Oct-15-10:11:40,Preliminaries complete. Initiating ETL process
2024-Oct-15-10:11:41,Data extraction complete. Initiating Transformation process
2024-Oct-15-10:11:41,Data transformation complete. Initiating loading process
2024-Oct-15-10:11:41,Data saved to CSV file
2024-Oct-15-10:11:41,SQL Connection initiated.
2024-Oct-15-10:11:41,Data loaded to Database as table. Running the query
2024-Oct-15-10:11:41,Process Complete.
2024-Oct-30-17:54:41,Preliminaries complete. Initiating ETL process
2024-Oct-30-17:54:42,Data extraction complete. Initiating Transformation process
2024-Oct-30-17:54:42,Data transformation complete. Initiating loading process
2024-Oct-30-17:54:42,Data saved to CSV file
2024-Oct-30-17:54:42,SQL Connection initiated.
2024-Oct-30-17:54:42,Data loaded to Database as table. Running the query
2024-Oct-30-17:54:42,Process Complete.
  1. Write a function to extract the tabular information from the given URL under the heading By Market Capitalization, and save it to a data frame.

  2. Write a function to transform the data frame by adding columns for Market Capitalization in GBP, EUR, and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.

  3. Write a function to load the transformed data frame to an output CSV file.

  4. Write a function to load the transformed data frame to an SQL database server as a table.

  5. Write a function to run queries on the database table.

  6. Run the following queries on the database table:

    1. Extract the information for the London office, that is Name and MC_GBP_Billion
    2. Extract the information for the Berlin office, that is Name and MC_EUR_Billion
    3. Extract the information for New Delhi office, that is Name and MC_INR_Billion
  7. Write a function to log the progress of the code.

  8. While executing the data initialization commands and function calls, maintain appropriate log entries.