Import CSV - Create db

Objective


Create a database using Python

Load the data from a CSV file as a table to the database

Run basic “queries” on the database to access the information

Scenario


  • Consider a dataset of employee records that is available with an HR team in a CSV file.
  • You are required to create the database called STAFF and
  • load the contents of the CSV file as a table called INSTRUCTORS.
  • The headers of the available data are :
Header Description
ID Employee ID
FNAME First Name
LNAME Last Name
CITY City of residence
CCODE Country code (2 letters)

Import Data


Using the shell

  • From cmd
    • wget the csv file
    • install pandas
wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/INSTRUCTOR.csv

# now the file is stored locally
python3.11 -m pip install pandas
  • Import libraries
import pandas as pd
import sqlite3
  • Data is found here
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/INSTRUCTOR.csv"
destination_file = "D:/Data/INSTRUCTOR.csv"
  • Skip this part if you already used the shell to import the data file
  • There are many ways to read the csv file with pandas: df = pd.read_csv('url')
  • But I will use the with open() as shown below to read the file and copy it locally
library(reticulate)
import requests
import sqlite3
import pandas as pd

With open

with open(url, "r") as source_file:
        with open(destination_file, 'w') as local_file:
                for line in source_file:
                        destination_file.write(line)

Create SQLite3 db


To create a table in the database, you first need to have the attributes of the required table. Attributes are columns of the table. Along with their names, the knowledge of their data types are also required. 

  • We first assign the conn to the name of database
  • Name the table
  • Assign the column names
conn = sqlite3.connect('D:/data/STAFF.db')
file_path = 'D:/data/INSTRUCTOR.csv'
table_name = 'INSTRUCTOR'
attribute_list = ['ID', 'FNAME', 'LNAME', 'CITY', 'CCODE']

Read csv File


  • Now, to read the CSV using Pandas, you use the read_csv() function.
  • Since this CSV does not contain headers
  • Use the keys of the attribute_dict dictionary as a list to assign headers to the data with names=()
df = pd.read_csv(file_path, names = attribute_list)

Load Data to Table


The pandas library provides easy loading of its dataframes directly to the database. For this, you may use the to_sql() method of the dataframe object.

  • If a table with the same name already exists in the database. If so, and it isn’t required anymore, the tables should be replaced with the one you are loading here.
  • You may also need to append some information to an existing table. For this purpose, to_sql() function uses the argument if_exists. The possible usage of if_exists is tabulated below.
Argument usage Description
if_exists = 'fail' Default. The command doesn’t work if a table with the same name exists in the database.
if_exists = 'replace' The command replaces the existing table in the database with the same name.
if_exists = 'append' The command appends the new data to the existing table with the same name.
  • As you need to create a fresh table upon execution, add the following commands to the code.
  • The print command is optional, but helps identify the completion of the steps of code until this point.
df.to_sql(table_name, conn, if_exists = 'replace', index =False)
14
print('Table is ready')
Table is ready

Run Queries


Now that the data is uploaded to the table in the database, anyone with access to the database can retrieve this data by executing SQL queries.

  • Some basic SQL queries to test this data are SELECT queries for viewing data
  • COUNT query to count the number of entries

SQL queries can be executed on the data using the read_sql function in pandas.

  • Now, run the following tasks for data retrieval on the created database.

View Table

query_statement = f"SELECT * FROM {table_name}"
query_output = pd.read_sql(query_statement, conn)
print(query_statement)
SELECT * FROM INSTRUCTOR
print(query_output)
    ID    FNAME      LNAME      CITY CCODE
0    1      Rav      Ahuja   TORONTO    CA
1    2     Raul      Chong   Markham    CA
2    3     Hima  Vasudevan   Chicago    US
3    4     John     Thomas  Illinois    US
4    5    Alice      James  Illinois    US
5    6    Steve      Wells  Illinois    US
6    7  Santosh      Kumar  Illinois    US
7    8    Ahmed    Hussain  Illinois    US
8    9    Nancy      Allen  Illinois    US
9   10     Mary     Thomas  Illinois    US
10  11  Bharath      Gupta  Illinois    US
11  12   Andrea      Jones  Illinois    US
12  13      Ann      Jacob  Illinois    US
13  14     Amit      Kumar  NewDelhi    IN

View 1 column

query_statement = f"SELECT FNAME FROM {table_name}"
query_output = pd.read_sql(query_statement, conn)
print(query_statement)
SELECT FNAME FROM INSTRUCTOR
print(query_output)
      FNAME
0       Rav
1      Raul
2      Hima
3      John
4     Alice
5     Steve
6   Santosh
7     Ahmed
8     Nancy
9      Mary
10  Bharath
11   Andrea
12      Ann
13     Amit

Count Rows

query_statement = f"SELECT COUNT(*) FROM {table_name}"
query_output = pd.read_sql(query_statement, conn)
print(query_statement)
SELECT COUNT(*) FROM INSTRUCTOR
print(query_output)
   COUNT(*)
0        14

Append to Table

Now try appending some data to the table. Consider the following.
a. Assume the ID is 100.
b. Assume the first name, FNAME, is John.
c. Assume the last name as LNAMEDoe.
d. Assume the city of residence, CITY is Paris.
e. Assume the country code, CCODE is FR.

Create a new df to append

data_dict = {'ID' : [100],
            'FNAME' : ['John'],
            'LNAME' : ['Doe'],
            'CITY' : ['Paris'],
            'CCODE' : ['FR']}
data_append = pd.DataFrame(data_dict)

Append df to db


  • Now use the following statement to append the data to the INSTRUCTOR table.
data_append.to_sql(table_name, conn, if_exists = 'append', index =False)
1
print('Data appended successfully')
Data appended successfully

Count Again

  • To verify the new information, recount the rows
query_statement = f"SELECT COUNT(*) FROM {table_name}"
query_output = pd.read_sql(query_statement, conn)
print(query_statement)
SELECT COUNT(*) FROM INSTRUCTOR
print(query_output)
   COUNT(*)
0        15

Close Connection

conn.close()

Run in Shell

  • If you are using shell to run the python file, go to the directory where the script file is located and run
python3.11 db_code.py