library(reticulate)
Scrape GDP Table Pandas
Case Study
We’ll use pandas to directly scrape a data table from a webpage
- Create a script to scrape a table from an online site
- Need to extract the list of the top 10 largest economies of the world
- In descending order of their GDPs in Billion USD to 2 decimal places
- Data need to be supplied by the IMF (International Monetary Fund)
- Save the dataframe into a CSV file
- Site can be found here: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29
Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
Data
read_html
Remember: read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content
- The table we need is the one on the bottom so let’s read the page and save to tables
= "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"
url = pd.read_html(url)
tables
# Since we know there are 3 tables on the page we can just choose the one we want
= tables[3] df
# As you can see here df1 is the world map data
= tables[2]
df1 10) df1.head(
0 ... 2
0 > $20 trillion $10–20 trillion $5–10 trillion ... ... $50–100 billion $25–50 billion $5–25 billion <...
[1 rows x 3 columns]
# Here is a quick look at our data to make sure it is the correct table
10) df.head(
Country/Territory UN region ... United Nations[15]
Country/Territory UN region ... Estimate Year
0 World — ... 96698005 2021
1 United States Americas ... 23315081 2021
2 China Asia ... 17734131 [n 1]2021
3 Japan Asia ... 4940878 2021
4 Germany Europe ... 4259935 2021
5 India Asia ... 3201471 2021
6 United Kingdom Europe ... 3131378 2021
7 France Europe ... 2957880 2021
8 Italy Europe ... 2107703 2021
9 Canada Americas ... 1988336 2021
[10 rows x 8 columns]
Shape
# let's see the shape of the table: a tuple(214 rows with 8 columns)
df.shape
(214, 8)
# so we know the range of the columns, the second element of the tuple is 0:8
range(df.shape[1])
range(0, 8)
Columns
- If you look at the image, you can see that once you get past column 2 each pair of columns is grouped together
- You can see that in the list below: IMF appears twice, so does World Bank, United Nations
df.columns
MultiIndex([( 'Country/Territory', 'Country/Territory'),
( 'UN region', 'UN region'),
( 'IMF[1][13]', 'Estimate'),
( 'IMF[1][13]', 'Year'),
( 'World Bank[14]', 'Estimate'),
( 'World Bank[14]', 'Year'),
('United Nations[15]', 'Estimate'),
('United Nations[15]', 'Year')],
)
Convert Columnnames
- To avoid confusion let’s just rename all the colums by their position in the range 0:8
= range(df.shape[1])
df.columns
# Let's look at the head once again to make sure it's right
5) df.head(
0 1 2 ... 5 6 7
0 World — 105568776 ... 2022 96698005 2021
1 United States Americas 26854599 ... 2022 23315081 2021
2 China Asia 19373586 ... [n 3]2022 17734131 [n 1]2021
3 Japan Asia 4409738 ... 2022 4940878 2021
4 Germany Europe 4308854 ... 2022 4259935 2021
[5 rows x 8 columns]
Extract Columns
- As requested we are only interested in the name of the country column [0]
- And the GDP for that country column [2]
- So let’s extract them
= df[[0,2]]
df 5) df.head(
0 2
0 World 105568776
1 United States 26854599
2 China 19373586
3 Japan 4409738
4 Germany 4308854
Sort Descending
- It appears that column 2 is sorted in descending order but you always need to check and
- sort it yourself
= df.sort_index(axis=1)
df 11) df.head(
0 2
0 World 105568776
1 United States 26854599
2 China 19373586
3 Japan 4409738
4 Germany 4308854
5 India 3736882
6 United Kingdom 3158938
7 France 2923489
8 Italy 2169745
9 Canada 2089672
10 Brazil 2081235
Extract Top 10
- Remember we wanted to top 10 countries
- If you look at the df you’ll notice the first row is the entire World’s GDP
- So we need to extract rows 1:11
= df.iloc[1:11,:] df
Rename Columns
= ["Country", "GDP (Millions USD)"]
df.columns df
Country GDP (Millions USD)
1 United States 26854599
2 China 19373586
3 Japan 4409738
4 Germany 4308854
5 India 3736882
6 United Kingdom 3158938
7 France 2923489
8 Italy 2169745
9 Canada 2089672
10 Brazil 2081235
Type
# Let's check the type of data in the df
df.dtypes
Country object
GDP (Millions USD) object
dtype: object
Convert Type
- We are supposed to have the GDP in Billions of USD not Millions
- In order to do that we need to convert the data type to int
- Come to think of it we should’ve done that before we renamed the columns !!!
'GDP (Millions USD)'] = df['GDP (Millions USD)'].astype(float) df[
df.dtypes
Country object
GDP (Millions USD) float64
dtype: object
Round Column
- Let’s convert to Billions
- Round to 2 decimals
'GDP (Millions USD)'] = df['GDP (Millions USD)']/1000
df[
# note the [[ ]]
'GDP (Millions USD)']] = df[['GDP (Millions USD)']]/1000
df[[
# set decimal points
'GDP (Millions USD)']] = np.round(df[['GDP (Millions USD)']],2) df[[
# rename column to Billions
= {'GDP (Millions USD)' : 'GDP (Billions USD)'}) df.rename(columns
Country GDP (Billions USD)
1 United States 26.85
2 China 19.37
3 Japan 4.41
4 Germany 4.31
5 India 3.74
6 United Kingdom 3.16
7 France 2.92
8 Italy 2.17
9 Canada 2.09
10 Brazil 2.08
Save to CSV
'./filename') df.to_csv(