Retrieve Daily Weather - ETL

You’ve been tasked by your team to create an automated Extract, Transform, Load (ETL) process to extract daily weather forecast and observed weather data and load it into a live report to be used for further analysis by the analytics team. As part of a larger prediction modelling project, the team wants to use the report to monitor and measure the historical accuracy of temperature forecasts by source and station.

As a proof-of-concept (POC), you are only required to do this for a single station and one source to begin with. For each day at noon (local time), you will gather both the actual temperature and the temperature forecasted for noon on the following day for Casablanca, Morocco.

At a later stage, the team anticipates extending the report to include lists of locations, different forecasting sources, different update frequencies, and other weather metrics such as wind speed and direction, precipitation, and visibility.

Data

We’ll use the weather data package provided by the open source project wttr.in, a web service that provides weather forecast information in a simple and text-based format. For further information, you can read more about the service on its GitHub Repo.

  • First, you’ll use the curl command to scrape weather data via the wttr.in website. For example, to get data for Casablanca
  • You can actually go to their website wttr.in and if you don’t specify a location it will provide your local weather forecast, in the provided link I specifically targeted Casablanca
curl wttr.in/casablanca
  • which prints the following to stdout

Tasks

  • Download raw weather data
  • Extract data of interest from the raw data
  • Transform the data as required
  • Load the data into a log file using a tabular format
  • Schedule the entire process to run automatically at a set time daily

Weather reporting tasks

You must extract and store the following data every day at noon, local time, for Casablanca, Morocco:

  • The actual temperature (in degrees Celsius)
  • The forecasted temperature (in degrees Celsius) for the following day at noon

Here is an example of what the the resulting weather report should look like:

year month day obs_tmp fc_temp
2023 1 1 10 11
2023 1 2 11 12
2023 1 3 12 10
2023 1 4 13 13
2023 1 5 10 9
2023 1 6 11 10

Setup Output File


Create text file

This will be your POC weather report log file, simply a text file which contains a growing history of the daily weather data you will scrape.

Each entry in the log file corresponds to a row as in Table 1.

$ touch rx_poc.log

Add Header to Report

Your header should consist of the column names from Table 1, delimited by tabs.

  • Write the header to your weather report.
  • Using variables in such case makes for much cleaner code, which is easier to understand and safer to modify by others or even yourself at a later date.
  • echo/print the header to the file
$ header=$(echo -e "year\tmonth\tday\thour\tobs_tmp\tfc_temp")
$ echo $header>rx_poc.log

# test the file header
$ cat rx_poc.log
year    month   day     hour    obs_tmp fc_temp

# check permissions
$ ls -l rx_poc.log
-rw-r--r-- 1 theia users 36 Sep 26 10:36 rx_poc.log
  • Can also use a redirection
echo -e "year\tmonth\tday\thour\tobs_tmp\tfc_temp">rx_poc.log

Write Script File


Create a BASH Script file

  • Make this an executable filecalled rx_poc.sh
$ touch rx_poc.sh
# check permissions
$ ls -l rx_poc.sh
-rw-r--r-- 1 theia users 0 Sep 26 10:39 rx_poc.sh

Set Permissions

chmod u+x rx_poc.sh
# check permissions
$ ls -l rx_poc.sh
-rwxr--r-- 1 theia users 0 Sep 26 10:39 rx_poc.sh
  • make it a bash script (with the .sh)
#! /bin/bash

Download Weather Data


Tip: It’s good practice to keep a copy of the raw data you are using to extract the data you need.

  • By appending a date or time stamp to the file name, you can ensure it’s name is unique.
  • This builds a history of the weather forecasts which you can revisit at any time to recover from errors or expand the scope of your reports
  • Using the prescribed date format ensures that when you sort the files, they will be sorted chronologically. It also enables searching for the report for any given date.
  • If needed, you can compress and archive the files periodically. You can even automate this process by scheduling it.

Setup File

Download and save your report as a datestamped file named raw_data_<YYYMMDD>

# Setup file to download weather data for today into raw_data_<timestamp>
today=$(date +%Y%m%d)
# gives today the value of
$ echo $today
20240926

weather_report=raw_data_$today
# this sets the vale at
$ echo $weather_report
raw_data_20240926
# So this is our filename for today

Download Casablanca Data

  • Download the wttr.in weather report for Casablanca and save it to raw_data_<timestamp> above
# Download Casablanca data from wttr.in into the raw_data_<timestamp> file
city=Casablanca
curl wttr.in/$city --output $weather_report

# CONFIRMATION
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--   100  9087  100  9087    0     0   8500      0  0:00:01  0:00:01 --:--:--  8100  9087  100  9087    0     0   8500      0  0:00:01  0:00:01 --:--:--  8500

Extract Data


  • Before extracting the data let’s look at it
    • remember our file name above
$ cat raw_data_20240926

  • Extract the required data from the raw file and assign them to variables obs_tmp and fc_temp the two columns in the table
  • Extracting the required data is a process that will take some trial and error until you get it right.
    Study the weather report you downloaded, and determine what you need to extract. Look for patterns.
    You must find a way to ‘chip away’ at the weather report:
    • Use shell commands to extract only the data you need (the signal)
    • Filter everything else out (the noise)
    • Combine your filters into a pipeline (recall the use of pipes to chain filters together)

tr

Trim repeated characters to a single character.

$ echo "There are    too    many spaces in this    sentence." | tr -s " "
There are too many spaces in this sentence.

xargs

xargs can be used to trim leading and trailing spaces from a string

$ echo " Never start or end a sentence with a space. " | xargs
Never start or end a sentence with a space.

rev

reverse the order of characters on a line of text

$ echo ".sdrawkcab saw ecnetnes sihT" | rev
This sentence was backwards.

cut

rev, and xargs are very useful when combined with the cut comand

# print the last field of the string
$ echo "three two one" | rev | cut -d " " -f 1 | rev
one

# Unfortunately, this prints the last field of the string, which is empty:
$ echo "three two one " | rev | cut -d " " -f 1 | rev

# But if you trim the trailing space first, you get the expected result:
$ echo "three two one " | xargs | rev | cut -d " " -f 1 | rev
one

Extract Data

  • Extract only those lines that contain temperatures from the weather report, and write your result to file.
  • Use grep and redirect the result to file.
  • Extract current temp and store it in a shell variable obs_tmp
    • which line is the current temp on
    • Is there a character you can use as a delimiter to appropriately parse the line into fields?
    • How about any leading or trailing white spaces?
  • Extract tomorrow’s forecast for noon, and store it in sell variable fc_tmp
    • Follow the pipeline we just created for temp
# Extract the data from the raw data file
grep °C $weather_report > temperatures.txt

# Extract current temp and store it in sell var obs_tmp
obs_tmp=$(head -1 temperatures.txt | tr -s " " | xargs | rev | cut -d " " -f2 | rev)

# Extract tomorrow's forecast for noon and store in fc_tmp
fc_temp=$(head -3 temperatures.txt | tail -1 | tr -s " " | xargs | cut -d "C" -f2 | rev | cut -d " " -f2 | rev)

Store Timestamp

  • Store the current hour, day, month, and year in variable
  • Don’t set the time to 12 for noon because it is not the local time for the city we are extracting data for
  • Use command substitution and the date command with the correct formatting options.
    The time zone for Casablanca happens to be UTC+1.
  • To get the local time for Casablanca, you can set the time-zone environment variable, TZ, as follows:
  • TZ='Morocco/Casablanca'
# Store timestamp in its variable
hour=$(TZ='Morocco/Casablanca' date -u +%H) 
day=$(TZ='Morocco/Casablanca' date -u +%d) 
month=$(TZ='Morocco/Casablanca' date +%m)
year=$(TZ='Morocco/Casablanca' date +%Y)

Merge Fields into Tab Record

  • Merge the fields into a tab-delimited record, corresponding to a single row in Table 1.
  • Append the resulting record as a row of data to your weather log file.
  • create a tab-delimited record
  • recall the header was created as follows:
    • header=$(echo -e “year\tmonth\tday\thour_UTC\tobs_tmp\tfc_temp”)
    • echo $header>rx_poc.log
# Append the resulting record as a row into our log file
record=$(echo -e "$year\t$month\t$day\t$hour\t$obs_tmp\t$fc_temp")
echo $record>>rx_poc.log

# make sure you save the script file - shown at end of page

Schedule Task


  • Remember that we wanted to load the data at noon local time for subject city
  • Check the time difference between your system’s default time and UTC
  • Calculate the difference
# If date is my system time then date -u will be my date - UTC and we see a difference ofUTC+5, 5 hours ahead of mine
$ date
Mon Feb 13 11:28:12 EST 2023
$ date -u
Mon Feb 13 16:28:16 UTC 2023

# we know from info given that Casablanca is UTC+1 so that makes it 5-1=4 hours before me

Edit Crontab

crontab -e
0 8 * * * /home/project/rx_poc.sh

# save the file and exit editor

Script File


#! /bin/bash


# Create file to download weather data for today into raw_data_<timestamp>
today=$(date +%Y%m%d)
weather_report=raw_data_$today

# Download Casablanca data from wttr.in into the raw_data_<timestamp> file
city=Casablanca
curl wttr.in/$city --output $weather_report

# Extract the data from the raw data file
grep °C $weather_report > temperatures.txt

# Extract current temp and store it in sell var obs_tmp
obs_tmp=$(head -1 temperatures.txt | tr -s " " | xargs | rev | cut -d " " -f2 | rev)

# Extract tomorrow's forecast for noon and store in fc_tmp
fc_temp=$(head -3 temperatures.txt | tail -1 | tr -s " " | xargs | cut -d "C" -f2 | rev | cut -d " " -f2 | rev)

# Store timestamp in its variable
hour=$(TZ='Morocco/Casablanca' date -u +%H) 
day=$(TZ='Morocco/Casablanca' date -u +%d) 
month=$(TZ='Morocco/Casablanca' date +%m)
year=$(TZ='Morocco/Casablanca' date +%Y)

# Append the resulting record as a row into our log file
record=$(echo -e "$year\t$month\t$day\t$hour\t$obs_tmp\t$fc_temp")
echo $record>>rx_poc.log

Local Drive Script


This section will go through the same process but done on the local drive instead of the cloud as explained above

Change Directories

# ___ change directories
PS C:\..> cd D:\data

# ___ create directory
PS D:\data> mkdir Linux_projects
PS D:\data> cd Linux_projects

# ___ create file to contain output
PS D:\data\Linux_projects> New-Item rx_poc.log
# ___ add header to file (column names)