library(reticulate)
GDP Table - ETL - SQLite3 - Pandas
A different version of this script is detailed in GDP Table - ETL - SQLite3 - BS using BeautifulSoup instead
Scenario
- You are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).
- The required information needs to be made accessible as a
CSV
fileCountries_by_GDP.csv
- as well as a table
Countries_by_GDP
in - a database file
World_Economies.db
- with attributes
Country
andGDP_USD_billion
- Run a query on the database table to display only the entries with more than a 100 billion USD economy.
- Log in a file the entire process of execution named
etl_project_log.txt
.
Tasks
- Write a data extraction function to retrieve the relevant information from the provided URL.
- Transform the available GDP information into ‘Billion USD’ from ‘Million USD’.
- Load the transformed information to the required CSV file
Countries_by_GDP.csv
- Save the data in a SQLite3 database
World_Economies.db
in table:Countries_by_GDP
- Run the required query on the database
- Log the progress of the code with appropriate timestamps in
etl_project_log.txt
Setup
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import sqlite3
import requests
from datetime import datetime
Extract
Global Vars
= 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
url = sqlite3.connect('D:/data/GDP/World_Economies.db')
sql_connection = 'D:/data/GDP/Countries_by_GDP.csv'
transformed_file = 'Countries_by_GDP'
table_name = ['Country', 'GDP_USD_millions']
attribute_list = 'D:/data/GDP/etl_project_log.txt' log_file
Extract Function
extract
will read all the tables from the web page- Assign the data to tables
- Extract table[3] the one we want
- You’ll notice from the table that column names are multi-indexed, they span over two rows
- So let’s rename the columns with their column index number: we first use data.shape and realize we have (214,8) (rows,cols). Now we can use shape[1] which is 8 and rename the columns with range(.shape[1]) which is range(8) which will give us column names from 0-7
- Filter the df down to columns 1 and 3, we can use the .copy() to avoid any warnings or use the simple one to drop all columns except the 1st and 3rd which are locations [[0,2]]
- Drop 1st row because it contains the GDP for the entire world so we start fro 1: and go till the end
- Assign new column names
- return the df
def extract(url, attribute_list):
= pd.read_html(url)
tables = tables[3]
data = pd.DataFrame(data)
df1 = range(df1.shape[1])
df1.columns = df1[[0,2]]
df = df.iloc[1:214,:].copy()
df = attribute_list
df.columns return df
Transform Function
- Convert types to float
- Convert from Millions to Billions
- Round values to 2 decimals
def transform(df):
#df['GDP_USD_millions'] = df['GDP_USD_millions'].astype(float)
'GDP_USD_millions'] = pd.to_numeric(df['GDP_USD_millions'], errors='coerce')
df['GDP_USD_millions']] = df[['GDP_USD_millions']]/1000
df[['GDP_USD_millions']] = np.round(df[['GDP_USD_millions']], 2)
df[[= ['Country','GDP_USD_billions']
df.columns return df
Load df to CSV
- Save the transformed data to Countries_by_GDP.csv
def load_to_csv(df, csv_path):
df.to_csv(csv_path)
Load to DB
- Load the df to the table: table_name
def load_to_db(df, sql_connection, table_name):
= 'replace', index =False)
df.to_sql(table_name, sql_connection, if_exists print('Table is loaded with data')
Run Queries
def run_query(query_statement, sql_connection):
print(query_statement)
= pd.read_sql(query_statement, sql_connection)
query_output print(query_output)
Log Progress
def log_progress(message):
= '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
timestamp_format = datetime.now() # get current timestamp
now = now.strftime(timestamp_format)
timestamp with open(log_file,"a") as f:
+ ',' + message + '\n') f.write(timestamp
Query First 15 Rows
# View 15 rows from Table
= f"SELECT* FROM {table_name} LIMIT 15"
query_statement run_query(query_statement, sql_connection)
SELECT* FROM Countries_by_GDP LIMIT 15
Country GDP_USD_billions
0 United States 26854.60
1 China 19373.59
2 Japan 4409.74
3 Germany 4308.85
4 India 3736.88
5 United Kingdom 3158.94
6 France 2923.49
7 Italy 2169.74
8 Canada 2089.67
9 Brazil 2081.24
10 Russia 2062.65
11 South Korea 1721.91
12 Australia 1707.55
13 Mexico 1663.16
14 Spain 1492.43
Query Count of Rows
# Count the number of Countries in the table
= f"SELECT COUNT(*) FROM {table_name}"
query_statement run_query(query_statement, sql_connection)
SELECT COUNT(*) FROM Countries_by_GDP
COUNT(*)
0 213
Function Calls
Here is the sequence of function calls:
Task | Log message on completion |
---|---|
Declaring known values | Preliminaries complete. Initiating ETL process. |
Call extract() function | Data extraction complete. Initiating Transformation process. |
Call transform() function | Data transformation complete. Initiating loading process. |
Call load_to_csv() | Data saved to CSV file. |
Initiate SQLite3 connection | SQL Connection initiated. |
Call load_to_db() | Data loaded to Database as table. Running the query. |
Call run_query() * | Process Complete. |
Close SQLite3 connection | - |
'Preliminaries complete. Initiating ETL process')
log_progress(
= extract(url, attribute_list)
df
'Data extraction complete. Initiating Transformation process')
log_progress(
= transform(df)
df
'Data transformation complete. Initiating loading process')
log_progress(
load_to_csv(df, transformed_file)
'Data saved to CSV file')
log_progress(
'SQL Connection initiated.')
log_progress(
load_to_db(df, sql_connection, table_name)
Table is loaded with data
'Data loaded to Database as table. Running the query')
log_progress(
= f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"
query_statement run_query(query_statement, sql_connection)
SELECT * from Countries_by_GDP WHERE GDP_USD_billions >= 100
Country GDP_USD_billions
0 United States 26854.60
1 China 19373.59
2 Japan 4409.74
3 Germany 4308.85
4 India 3736.88
.. ... ...
64 Kenya 118.13
65 Angola 117.88
66 Oman 104.90
67 Guatemala 102.31
68 Bulgaria 100.64
[69 rows x 2 columns]
'Process Complete.')
log_progress(
sql_connection.close()
Complete Script
Here is the script all together in one place. Step by step are described above.
import pandas as pd
import numpy as np
import sqlite3
import requests
from datetime import datetime
# Global vars
= 'https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'
url = sqlite3.connect('D:/data/GDP/World_Economies.db')
sql_connection = 'D:/data/GDP/Countries_by_GDP.csv'
transformed_file = 'Countries_by_GDP'
table_name = ['Country', 'GDP_USD_millions']
attribute_list = 'D:/data/GDP/etl_project_log.txt'
log_file
def extract(url, attribute_list):
= pd.read_html(url)
tables = tables[3]
data = pd.DataFrame(data)
df1 = range(df1.shape[1])
df1.columns = df1[[0,2]]
df = df.iloc[1:214,:].copy()
df #df = df1.iloc[1:,[0,2]].copy()
= attribute_list
df.columns return df
def transform(df):
#df['GDP_USD_millions'] = df['GDP_USD_millions'].astype(float)
'GDP_USD_millions'] = pd.to_numeric(df['GDP_USD_millions'], errors='coerce')
df['GDP_USD_millions']] = df[['GDP_USD_millions']]/1000
df[['GDP_USD_millions']] = np.round(df[['GDP_USD_millions']], 2)
df[[= ['Country','GDP_USD_billions','GDP in GBP','GDP in EUR','GDP in INR']
df.columns return df
def load_to_csv(df, csv_path):
df.to_csv(csv_path)
def load_to_db(df, sql_connection, table_name):
= 'replace', index =False)
df.to_sql(table_name, sql_connection, if_exists print('Table is loaded with data')
def run_query(query_statement, sql_connection):
print(query_statement)
= pd.read_sql(query_statement, sql_connection)
query_output print(query_output)
def log_progress(message):
= '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
timestamp_format = datetime.now() # get current timestamp
now = now.strftime(timestamp_format)
timestamp with open(log_file,"a") as f:
+ ',' + message + '\n')
f.write(timestamp
# Call functions
'Preliminaries complete. Initiating ETL process')
log_progress(
= extract(url, attribute_list)
df
'Data extraction complete. Initiating Transformation process')
log_progress(
= transform(df)
df
'Data transformation complete. Initiating loading process')
log_progress(
load_to_csv(df, transformed_file)
'Data saved to CSV file')
log_progress(
'SQL Connection initiated.')
log_progress(
load_to_db(df, sql_connection, table_name)
'Data loaded to Database as table. Running the query')
log_progress(
= f"SELECT * from {table_name} WHERE GDP_USD_billions >= 100"
query_statement
run_query(query_statement, sql_connection)
'Process Complete.')
log_progress(
sql_connection.close()
with open(log_file, 'r') as log:
= log.read()
progress print(progress)
2024-Sep-08-10:36:09,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:36:11,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:36:11,Data transformation complete. Initiating loading process
2024-Sep-08-10:36:11,Data saved to CSV file
2024-Sep-08-10:36:11,SQL Connection initiated.
2024-Sep-08-10:36:11,Data loaded to Database as table. Running the query
2024-Sep-08-10:36:12,Process Complete.
2024-Sep-08-10:54:15,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:54:18,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:54:18,Data transformation complete. Initiating loading process
2024-Sep-08-10:54:18,Data saved to CSV file
2024-Sep-08-10:54:18,SQL Connection initiated.
2024-Sep-08-10:54:18,Data loaded to Database as table. Running the query
2024-Sep-08-10:54:18,Process Complete.
2024-Sep-08-10:56:04,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:56:06,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:56:06,Data transformation complete. Initiating loading process
2024-Sep-08-10:56:06,Data saved to CSV file
2024-Sep-08-10:56:06,SQL Connection initiated.
2024-Sep-08-10:56:06,Data loaded to Database as table. Running the query
2024-Sep-08-10:56:07,Process Complete.
2024-Sep-08-10:59:13,Preliminaries complete. Initiating ETL process
2024-Sep-08-10:59:16,Data extraction complete. Initiating Transformation process
2024-Sep-08-10:59:16,Data transformation complete. Initiating loading process
2024-Sep-08-10:59:16,Data saved to CSV file
2024-Sep-08-10:59:16,SQL Connection initiated.
2024-Sep-08-10:59:16,Data loaded to Database as table. Running the query
2024-Sep-08-10:59:16,Process Complete.
2024-Sep-08-11:11:59,Preliminaries complete. Initiating ETL process
2024-Sep-08-11:12:03,Data extraction complete. Initiating Transformation process
2024-Sep-08-11:12:04,Data transformation complete. Initiating loading process
2024-Sep-08-11:12:04,Data saved to CSV file
2024-Sep-08-11:12:04,SQL Connection initiated.
2024-Sep-08-11:12:04,Data loaded to Database as table. Running the query
2024-Sep-08-11:12:04,Process Complete.
2024-Sep-08-14:13:34,Preliminaries complete. Initiating ETL process
2024-Sep-08-15:08:10,Preliminaries complete. Initiating ETL process
2024-Sep-08-15:08:14,Data extraction complete. Initiating Transformation process
2024-Sep-08-15:08:14,Data transformation complete. Initiating loading process
2024-Sep-08-15:08:14,Data saved to CSV file
2024-Sep-08-15:08:14,SQL Connection initiated.
2024-Sep-08-15:08:14,Data loaded to Database as table. Running the query
2024-Sep-08-15:08:14,Process Complete.
2024-Sep-08-16:24:38,Preliminaries complete. Initiating ETL process
2024-Sep-08-16:24:43,Data extraction complete. Initiating Transformation process
2024-Sep-08-16:24:43,Data transformation complete. Initiating loading process
2024-Sep-08-16:24:43,Data saved to CSV file
2024-Sep-08-16:24:43,SQL Connection initiated.
2024-Sep-08-16:24:43,Data loaded to Database as table. Running the query
2024-Sep-08-16:24:43,Process Complete.
2024-Sep-08-16:24:59,Preliminaries complete. Initiating ETL process
2024-Sep-08-16:25:00,Data extraction complete. Initiating Transformation process
2024-Sep-08-16:25:00,Data transformation complete. Initiating loading process
2024-Sep-08-16:25:00,Data saved to CSV file
2024-Sep-08-16:25:00,SQL Connection initiated.
2024-Sep-08-16:25:00,Data loaded to Database as table. Running the query
2024-Sep-08-16:25:00,Process Complete.
2024-Sep-08-16:35:36,Preliminaries complete. Initiating ETL process
2024-Sep-08-16:35:40,Data extraction complete. Initiating Transformation process
2024-Sep-08-16:35:40,Data transformation complete. Initiating loading process
2024-Sep-08-16:35:40,Data saved to CSV file
2024-Sep-08-16:35:40,SQL Connection initiated.
2024-Sep-08-16:35:40,Data loaded to Database as table. Running the query
2024-Sep-08-16:35:40,Process Complete.
2024-Sep-09-13:32:02,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:32:43,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:32:58,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:34:05,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:40:25,Preliminaries complete. Initiating ETL process
2024-Sep-09-13:40:27,Data extraction complete. Initiating Transformation process
2024-Sep-09-13:40:27,Data transformation complete. Initiating loading process
2024-Sep-09-13:40:27,Data saved to CSV file
2024-Sep-09-13:40:27,SQL Connection initiated.
2024-Sep-09-13:40:27,Data loaded to Database as table. Running the query
2024-Sep-09-13:40:27,Process Complete.
2024-Sep-09-18:12:28,Preliminaries complete. Initiating ETL process
2024-Sep-09-18:12:29,Data extraction complete. Initiating Transformation process
2024-Sep-09-18:12:29,Data transformation complete. Initiating loading process
2024-Sep-09-18:12:29,Data saved to CSV file
2024-Sep-09-18:12:29,SQL Connection initiated.
2024-Sep-09-18:12:29,Data loaded to Database as table. Running the query
2024-Sep-09-18:12:29,Process Complete.
2024-Sep-10-17:46:20,Preliminaries complete. Initiating ETL process
2024-Sep-10-17:46:21,Data extraction complete. Initiating Transformation process
2024-Sep-10-17:46:21,Data transformation complete. Initiating loading process
2024-Sep-10-17:46:21,Data saved to CSV file
2024-Sep-10-17:46:21,SQL Connection initiated.
2024-Sep-10-17:46:21,Data loaded to Database as table. Running the query
2024-Sep-10-17:46:21,Process Complete.
2024-Sep-10-19:11:55,Preliminaries complete. Initiating ETL process
2024-Sep-10-19:11:55,Data extraction complete. Initiating Transformation process
2024-Sep-10-19:11:55,Data transformation complete. Initiating loading process
2024-Sep-10-19:11:55,Data saved to CSV file
2024-Sep-10-19:11:55,SQL Connection initiated.
2024-Sep-10-19:11:55,Data loaded to Database as table. Running the query
2024-Sep-10-19:11:56,Process Complete.
2024-Sep-11-11:01:30,Preliminaries complete. Initiating ETL process
2024-Sep-11-11:01:34,Data extraction complete. Initiating Transformation process
2024-Sep-11-11:01:34,Data transformation complete. Initiating loading process
2024-Sep-11-11:01:34,Data saved to CSV file
2024-Sep-11-11:01:34,SQL Connection initiated.
2024-Sep-11-11:01:34,Data loaded to Database as table. Running the query
2024-Sep-11-11:01:34,Process Complete.
2024-Sep-11-11:42:17,Preliminaries complete. Initiating ETL process
2024-Sep-11-11:42:21,Data extraction complete. Initiating Transformation process
2024-Sep-11-11:42:21,Data transformation complete. Initiating loading process
2024-Sep-11-11:42:21,Data saved to CSV file
2024-Sep-11-11:42:21,SQL Connection initiated.
2024-Sep-11-11:42:21,Data loaded to Database as table. Running the query
2024-Sep-11-11:42:21,Process Complete.
2024-Sep-11-11:48:43,Preliminaries complete. Initiating ETL process
2024-Sep-11-11:48:44,Data extraction complete. Initiating Transformation process
2024-Sep-11-11:48:44,Data transformation complete. Initiating loading process
2024-Sep-11-11:48:44,Data saved to CSV file
2024-Sep-11-11:48:44,SQL Connection initiated.
2024-Sep-11-11:48:44,Data loaded to Database as table. Running the query
2024-Sep-11-11:48:44,Process Complete.
2024-Sep-11-12:00:04,Preliminaries complete. Initiating ETL process
2024-Sep-11-12:00:05,Data extraction complete. Initiating Transformation process
2024-Sep-11-12:00:05,Data transformation complete. Initiating loading process
2024-Sep-11-12:00:05,Data saved to CSV file
2024-Sep-11-12:00:05,SQL Connection initiated.
2024-Sep-11-12:00:05,Data loaded to Database as table. Running the query
2024-Sep-11-12:00:05,Process Complete.
2024-Sep-15-14:12:00,Preliminaries complete. Initiating ETL process
2024-Sep-15-14:12:00,Data extraction complete. Initiating Transformation process
2024-Sep-15-14:12:00,Data transformation complete. Initiating loading process
2024-Sep-15-14:12:00,Data saved to CSV file
2024-Sep-15-14:12:00,SQL Connection initiated.
2024-Sep-15-14:12:00,Data loaded to Database as table. Running the query
2024-Sep-15-14:12:00,Process Complete.
2024-Sep-17-15:49:54,Preliminaries complete. Initiating ETL process
2024-Sep-17-15:49:56,Data extraction complete. Initiating Transformation process
2024-Sep-17-15:49:56,Data transformation complete. Initiating loading process
2024-Sep-17-15:49:56,Data saved to CSV file
2024-Sep-17-15:49:56,SQL Connection initiated.
2024-Sep-17-15:49:56,Data loaded to Database as table. Running the query
2024-Sep-17-15:49:56,Process Complete.
2024-Sep-21-19:42:30,Preliminaries complete. Initiating ETL process
2024-Sep-21-19:42:32,Data extraction complete. Initiating Transformation process
2024-Sep-21-19:42:32,Data transformation complete. Initiating loading process
2024-Sep-21-19:42:32,Data saved to CSV file
2024-Sep-21-19:42:32,SQL Connection initiated.
2024-Sep-21-19:42:33,Data loaded to Database as table. Running the query
2024-Sep-21-19:42:33,Process Complete.
2024-Sep-27-10:30:00,Preliminaries complete. Initiating ETL process
2024-Sep-27-10:30:01,Data extraction complete. Initiating Transformation process
2024-Sep-27-10:30:01,Data transformation complete. Initiating loading process
2024-Sep-27-10:30:01,Data saved to CSV file
2024-Sep-27-10:30:01,SQL Connection initiated.
2024-Sep-27-10:30:01,Data loaded to Database as table. Running the query
2024-Sep-27-10:30:01,Process Complete.
2024-Sep-29-10:29:54,Preliminaries complete. Initiating ETL process
2024-Sep-29-10:29:56,Data extraction complete. Initiating Transformation process
2024-Sep-29-10:29:56,Data transformation complete. Initiating loading process
2024-Sep-29-10:29:56,Data saved to CSV file
2024-Sep-29-10:29:56,SQL Connection initiated.
2024-Sep-29-10:29:56,Data loaded to Database as table. Running the query
2024-Sep-29-10:29:56,Process Complete.
2024-Oct-07-11:18:47,Preliminaries complete. Initiating ETL process
2024-Oct-07-11:18:49,Data extraction complete. Initiating Transformation process
2024-Oct-07-11:18:49,Data transformation complete. Initiating loading process
2024-Oct-07-11:18:49,Data saved to CSV file
2024-Oct-07-11:18:49,SQL Connection initiated.
2024-Oct-07-11:18:49,Data loaded to Database as table. Running the query
2024-Oct-07-11:18:49,Process Complete.
2024-Oct-15-10:11:40,Preliminaries complete. Initiating ETL process
2024-Oct-15-10:11:41,Data extraction complete. Initiating Transformation process
2024-Oct-15-10:11:41,Data transformation complete. Initiating loading process
2024-Oct-15-10:11:41,Data saved to CSV file
2024-Oct-15-10:11:41,SQL Connection initiated.
2024-Oct-15-10:11:41,Data loaded to Database as table. Running the query
2024-Oct-15-10:11:41,Process Complete.
2024-Oct-30-17:54:41,Preliminaries complete. Initiating ETL process
2024-Oct-30-17:54:42,Data extraction complete. Initiating Transformation process
2024-Oct-30-17:54:42,Data transformation complete. Initiating loading process
2024-Oct-30-17:54:42,Data saved to CSV file
2024-Oct-30-17:54:42,SQL Connection initiated.
2024-Oct-30-17:54:42,Data loaded to Database as table. Running the query
2024-Oct-30-17:54:42,Process Complete.
Write a function to extract the tabular information from the given URL under the heading By Market Capitalization, and save it to a data frame.
Write a function to transform the data frame by adding columns for Market Capitalization in GBP, EUR, and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.
Write a function to load the transformed data frame to an output CSV file.
Write a function to load the transformed data frame to an SQL database server as a table.
Write a function to run queries on the database table.
Run the following queries on the database table:
- Extract the information for the London office, that is Name and MC_GBP_Billion
- Extract the information for the Berlin office, that is Name and MC_EUR_Billion
- Extract the information for New Delhi office, that is Name and MC_INR_Billion
- Extract the information for the London office, that is Name and MC_GBP_Billion
Write a function to log the progress of the code.
While executing the data initialization commands and function calls, maintain appropriate log entries.