Scrape GDP Table Pandas

Case Study


We’ll use pandas to directly scrape a data table from a webpage

Setup


library(reticulate)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

Data


read_html

Remember: read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content

  • The table we need is the one on the bottom so let’s read the page and save to tables
url = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"
tables = pd.read_html(url)

# Since we know there are 3 tables on the page we can just choose the one we want
df = tables[3]
# As you can see here df1 is the world map data
df1 = tables[2]
df1.head(10)
                                                   0  ...                                                  2
0  > $20 trillion $10–20 trillion $5–10 trillion ...  ...  $50–100 billion $25–50 billion $5–25 billion <...

[1 rows x 3 columns]
# Here is a quick look at our data to make sure it is the correct table
df.head(10)
  Country/Territory UN region  ... United Nations[15]           
  Country/Territory UN region  ...           Estimate       Year
0             World         —  ...           96698005       2021
1     United States  Americas  ...           23315081       2021
2             China      Asia  ...           17734131  [n 1]2021
3             Japan      Asia  ...            4940878       2021
4           Germany    Europe  ...            4259935       2021
5             India      Asia  ...            3201471       2021
6    United Kingdom    Europe  ...            3131378       2021
7            France    Europe  ...            2957880       2021
8             Italy    Europe  ...            2107703       2021
9            Canada  Americas  ...            1988336       2021

[10 rows x 8 columns]

Shape

# let's see the shape of the table: a tuple(214 rows with 8 columns)
df.shape
(214, 8)
# so we know the range of the columns, the second element of the tuple is 0:8
range(df.shape[1])
range(0, 8)

Columns

  • If you look at the image, you can see that once you get past column 2 each pair of columns is grouped together
  • You can see that in the list below: IMF appears twice, so does World Bank, United Nations
df.columns
MultiIndex([( 'Country/Territory', 'Country/Territory'),
            (         'UN region',         'UN region'),
            (        'IMF[1][13]',          'Estimate'),
            (        'IMF[1][13]',              'Year'),
            (    'World Bank[14]',          'Estimate'),
            (    'World Bank[14]',              'Year'),
            ('United Nations[15]',          'Estimate'),
            ('United Nations[15]',              'Year')],
           )

Convert Columnnames

  • To avoid confusion let’s just rename all the colums by their position in the range 0:8
df.columns = range(df.shape[1])

# Let's look at the head once again to make sure it's right
df.head(5)
               0         1          2  ...          5         6          7
0          World         —  105568776  ...       2022  96698005       2021
1  United States  Americas   26854599  ...       2022  23315081       2021
2          China      Asia   19373586  ...  [n 3]2022  17734131  [n 1]2021
3          Japan      Asia    4409738  ...       2022   4940878       2021
4        Germany    Europe    4308854  ...       2022   4259935       2021

[5 rows x 8 columns]

Extract Columns

  • As requested we are only interested in the name of the country column [0]
  • And the GDP for that country column [2]
  • So let’s extract them
df = df[[0,2]]
df.head(5)
               0          2
0          World  105568776
1  United States   26854599
2          China   19373586
3          Japan    4409738
4        Germany    4308854

Sort Descending

  • It appears that column 2 is sorted in descending order but you always need to check and
  • sort it yourself
df = df.sort_index(axis=1)
df.head(11)
                 0          2
0            World  105568776
1    United States   26854599
2            China   19373586
3            Japan    4409738
4          Germany    4308854
5            India    3736882
6   United Kingdom    3158938
7           France    2923489
8            Italy    2169745
9           Canada    2089672
10          Brazil    2081235

Extract Top 10

  • Remember we wanted to top 10 countries
  • If you look at the df you’ll notice the first row is the entire World’s GDP
  • So we need to extract rows 1:11
df = df.iloc[1:11,:]

Rename Columns

df.columns = ["Country", "GDP (Millions USD)"]
df
           Country GDP (Millions USD)
1    United States           26854599
2            China           19373586
3            Japan            4409738
4          Germany            4308854
5            India            3736882
6   United Kingdom            3158938
7           France            2923489
8            Italy            2169745
9           Canada            2089672
10          Brazil            2081235

Type

# Let's check the type of data in the df
df.dtypes
Country               object
GDP (Millions USD)    object
dtype: object

Convert Type

  • We are supposed to have the GDP in Billions of USD not Millions
  • In order to do that we need to convert the data type to int
  • Come to think of it we should’ve done that before we renamed the columns !!!
df['GDP (Millions USD)'] = df['GDP (Millions USD)'].astype(float)
df.dtypes
Country                object
GDP (Millions USD)    float64
dtype: object

Round Column

  • Let’s convert to Billions
  • Round to 2 decimals
df['GDP (Millions USD)'] = df['GDP (Millions USD)']/1000

# note the [[ ]]
df[['GDP (Millions USD)']] = df[['GDP (Millions USD)']]/1000

# set decimal points
df[['GDP (Millions USD)']] = np.round(df[['GDP (Millions USD)']],2)
# rename column to Billions
df.rename(columns = {'GDP (Millions USD)' : 'GDP (Billions USD)'})
           Country  GDP (Billions USD)
1    United States               26.85
2            China               19.37
3            Japan                4.41
4          Germany                4.31
5            India                3.74
6   United Kingdom                3.16
7           France                2.92
8            Italy                2.17
9           Canada                2.09
10          Brazil                2.08

Save to CSV


df.to_csv('./filename')