library(tidyverse)
EPA - EDA 3
Packages
Data
We’ll be using another version of the EPA PM2.5 data used in the the previous EPA - EDA pages. This version has been modified and saved locally in several csv files.
- pm0 contains data from 1999
- pm1 contains data from2012
- cnames.txt is a variable containing list of characters for column names
- wcol_df is a df holds the indices of the 5 columns we
- site0 and site1 are for sensors in the NY state area for 1999 and 2012 respectively described as County.Code and Site.ID concatenated together with “.” as seperator
Save csv & txt files
List of the files and how they were saved
if(!file.exists("D:/yourdataiq/dataiq/datasets/pm0.csv"))
write_csv(pm0,"D:/yourdataiq/dataiq/datasets/pm0.csv")}
{if(!file.exists("D:/yourdataiq/dataiq/datasets/pm1.csv"))
write_csv(pm1,"D:/yourdataiq/dataiq/datasets/pm1.csv")}
{if(!file.exists("D:/yourdataiq/dataiq/datasets/cnames.csv"))
writeLines(cnames,"D:/yourdataiq/dataiq/datasets/cnames.txt")}
{if(!file.exists("D:/yourdataiq/dataiq/datasets/wcol.csv"))
write_csv(wcol_df,"D:/yourdataiq/dataiq/datasets/wcol.csv")}
{if(!file.exists("D:/yourdataiq/dataiq/datasets/site0.csv"))
writeLines(site0,"D:/yourdataiq/dataiq/datasets/site0.txt")}
{if(!file.exists("D:/yourdataiq/dataiq/datasets/site1.csv"))
writeLines(site1,"D:/yourdataiq/dataiq/datasets/site1.txt")} {
Load
<- read.csv("D:/yourdataiq/dataiq/datasets/pm0.csv")
pm0 <- read.csv("D:/yourdataiq/dataiq/datasets/pm1.csv")
pm1 <- readLines("D:/yourdataiq/dataiq/datasets/cnames.txt")
cnames <- read.csv("D:/yourdataiq/dataiq/datasets/wcol.csv")
wcol_df <- readLines("D:/yourdataiq/dataiq/datasets/site0.txt")
site0_ <- readLines("D:/yourdataiq/dataiq/datasets/site1.txt") site1_
Tidy
Case Study
We were given an index of the columns we will use in wcol
We were given a complete list of column names in cnames
So we need to extract the names from cnames corresponding to the wcol indices and assign to the column names of pm0
Let’s look at some of the data and cnames
dim(pm0)
[1] 117421 5
head(pm0)
V1 V2 V3 V4 V5
1 1 27 1 19990103 NA
2 1 27 1 19990106 NA
3 1 27 1 19990109 NA
4 1 27 1 19990112 8.841
5 1 27 1 19990115 14.920
6 1 27 1 19990118 3.878
make.names
- We were given a df of the columns we will use wcol
- We were given column names in cnames
- So we need to extract the names from cnames for only the wcol columns and assign to the names of the df pm0
# is a list of 1 : chr cnames
[1] "# RD|Action Code|State Code|County Code|Site ID|Parameter|POC|Sample Duration|Unit|Method|Date|Start Time|Sample Value|Null Data Code|Sampling Frequency|Monitor Protocol (MP) ID|Qualifier - 1|Qualifier - 2|Qualifier - 3|Qualifier - 4|Qualifier - 5|Qualifier - 6|Qualifier - 7|Qualifier - 8|Qualifier - 9|Qualifier - 10|Alternate Method Detectable Limit|Uncertainty"
strsplit
- It appears the column names are separated by |, so let’s split the vector by | and get the column names only. With
fixed=TRUE
we are asking to match | exactly. - Since we had to convert cnames to df to save it, we can no longer use the following code because cnames is now a df.
- So in the next line of code we convert the df to string then apply strsplit to it
<- strsplit(cnames,"|", fixed = TRUE)
cnames_stripped cnames_stripped
[[1]]
[1] "# RD" "Action Code"
[3] "State Code" "County Code"
[5] "Site ID" "Parameter"
[7] "POC" "Sample Duration"
[9] "Unit" "Method"
[11] "Date" "Start Time"
[13] "Sample Value" "Null Data Code"
[15] "Sampling Frequency" "Monitor Protocol (MP) ID"
[17] "Qualifier - 1" "Qualifier - 2"
[19] "Qualifier - 3" "Qualifier - 4"
[21] "Qualifier - 5" "Qualifier - 6"
[23] "Qualifier - 7" "Qualifier - 8"
[25] "Qualifier - 9" "Qualifier - 10"
[27] "Alternate Method Detectable Limit" "Uncertainty"
Now that we have the column names as we want. Let’s look at which columns we need, so let’s look at wcol
wcol_df
wcol
1 3
2 4
3 5
4 11
5 13
convert wcol_df to vector
<- wcol_df[[1]]
wcol_vector wcol_vector
[1] 3 4 5 11 13
make.names 1
Make syntactically valid names out of character vectors.
Syntax
make.names(names, unique = FALSE, allow_ = TRUE)
names |
character vector to be coerced to syntactically valid names. This is coerced to character if necessary. |
unique |
logical; if TRUE , the resulting elements are unique. This may be desired for, e.g., column names. |
allow_ |
logical. For compatibility with R prior to 1.9.0. |
Let’s build this in two pieces:
- The code below will take the list of cnames_stripped and extract the wcol positions which are (3,4,5,11,13) and leaves us with a list of names for the desired columns
- The second block of code assigns the list to the colnames of the df: pm0_w_names
- So let’s copy pm0 to pm0_wnames first, to preserve the original df, and then perform our code
<- pm0
pm0_wnames head(pm0_wnames)
V1 V2 V3 V4 V5
1 1 27 1 19990103 NA
2 1 27 1 19990106 NA
3 1 27 1 19990109 NA
4 1 27 1 19990112 8.841
5 1 27 1 19990115 14.920
6 1 27 1 19990118 3.878
1]][wcol_vector] cnames_stripped[[
[1] "State Code" "County Code" "Site ID" "Date" "Sample Value"
make.names
names(pm0_wnames) <- make.names(cnames_stripped[[1]][wcol_vector])
head(pm0_wnames)
State.Code County.Code Site.ID Date Sample.Value
1 1 27 1 19990103 NA
2 1 27 1 19990106 NA
3 1 27 1 19990109 NA
4 1 27 1 19990112 8.841
5 1 27 1 19990115 14.920
6 1 27 1 19990118 3.878
make names 2
Let’s do it without using make.names.
Let’s add colnames to the pm0 and save it in named_pm0
<- pm0
named_pm0 names(named_pm0) <- cnames_stripped[[1]][wcol_vector]
head(named_pm0)
State Code County Code Site ID Date Sample Value
1 1 27 1 19990103 NA
2 1 27 1 19990106 NA
3 1 27 1 19990109 NA
4 1 27 1 19990112 8.841
5 1 27 1 19990115 14.920
6 1 27 1 19990118 3.878
NOTE
Do you notice the difference between named_pm0 and pm0_wnames? The latter was created with make.names which creates Systactically Valid Names (meaning: A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as “.2way” are not valid, and neither are the reserved words.)
So notice the “.” in the column names.
make.names pm1
Let’s repeat the steps we used for pm0 but for pm1 instead and we end up with pm1_wnames
<- pm1
pm1_wnames head(pm1_wnames)
V1 V2 V3 V4 V5
1 1 3 10 20120101 6.7
2 1 3 10 20120104 9.0
3 1 3 10 20120107 6.5
4 1 3 10 20120110 7.0
5 1 3 10 20120113 5.8
6 1 3 10 20120116 8.0
names(pm1_wnames) <- make.names(cnames_stripped[[1]][wcol_vector])
head(pm1_wnames)
State.Code County.Code Site.ID Date Sample.Value
1 1 3 10 20120101 6.7
2 1 3 10 20120104 9.0
3 1 3 10 20120107 6.5
4 1 3 10 20120110 7.0
5 1 3 10 20120113 5.8
6 1 3 10 20120116 8.0
EDA
2008 data
General Research
Column to vector
Assign Sample.Value column to a numeric vector with 117421 length and with NAs
<- pm0_wnames$Sample.Value
x0 str(x0)
num [1:117421] NA NA NA 8.84 14.92 ...
Percent NAs
mean(is.na(x0)) # a little over 11%
[1] 0.1125608
2012 data
General Research
Column to vector
<- pm1_wnames$Sample.Value
x1 str(x1)
num [1:1304287] 6.7 9 6.5 7 5.8 8 7.9 8 6 9.6 ...
Percent NAs
mean(is.na(x1)) # a little over 5%
[1] 0.05607125
Summary
summary(x0)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 7.20 11.50 13.74 17.90 157.10 13217
summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-10.00 4.00 7.63 9.14 12.00 908.97 73133
Observations:
- Both mean and median decreased
- NA’s increased tremendously
- Negative values appeared in the data?
Negative values
- Assign negative values to a vector
- Count the occurences out of the 1.3+Million observations
- Of course we can calculate the % but let R do it using the mean once again: 2.1%
<- x1<0
negative sum(negative, na.rm = TRUE)
[1] 26474
mean(negative, na.rm = TRUE)
[1] 0.0215034
head(negative)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
- We can ignore the negative values, but let’s dig into it a bit more, maybe there are certain months the negative values occur?
- Let’s look at the Date column
Integer to date
str(pm1_wnames$Date)
int [1:1304287] 20120101 20120104 20120107 20120110 20120113 20120116 20120119 20120122 20120125 20120128 ...
$Date <- as.Date(as.character(pm0_wnames$Date), "%Y%m%d")
pm0_wnames$Date <- as.Date(as.character(pm1_wnames$Date), "%Y%m%d")
pm1_wnamesstr(pm0_wnames$Date)
Date[1:117421], format: "1999-01-03" "1999-01-06" "1999-01-09" "1999-01-12" "1999-01-15" ...
str(pm1_wnames$Date)
Date[1:1304287], format: "2012-01-01" "2012-01-04" "2012-01-07" "2012-01-10" "2012-01-13" ...
histogram of negatives
hist(pm1_wnames$Date[negative],"day",main="Negative readings per day")
hist(pm1_wnames$Date[negative],"month",main="Negative readings per month")
Common sensors
Let’s find common sensors between 1999 and 2012 and compare the values to see what we can find. - As noted from the Data section: site0 is for 1999 NY state sensors - site1 is for 2012 NY state sensors - Let’s look at the makeup of the data with str()
str(site0_)
chr [1:33] "1.5" "1.12" "5.73" "5.80" "5.83" "5.110" "13.11" "27.1004" ...
str(site1_)
chr [1:18] "1.5" "1.12" "5.80" "5.133" "13.11" "29.5" "31.3" "47.122" ...
Intersect
- Let’s find the intersect of both 1999 and 2012
- Data shows 10 monitors are still in use in 2012
<- intersect(site0_,site1_)
both both
[1] "1.5" "1.12" "5.80" "13.11" "29.5" "31.3" "63.2008"
[8] "67.1015" "85.55" "101.3"
Paste two columns
Let’s concatenate both County.Code and Site.ID columns into one: we call it county.site, with the values being separated by “.” just as the column name is
library(dplyr)
<- pm0_wnames |>
pm0_wnames mutate(county.site = paste(County.Code,Site.ID,sep = "."))
<- pm1_wnames |>
pm1_wnames mutate(county.site = paste(County.Code,Site.ID,sep = "."))
Extract values
Let’s extract the observations for those sensors that are in common for both time dfs - remember State.code for NY is == 36 and - the common sensors are in both
<- subset(pm0_wnames, State.Code == 36 & county.site %in% both)
cnt0 <- subset(pm1_wnames, State.Code == 36 & county.site %in% both)
cnt1 head(cnt0)
State.Code County.Code Site.ID Date Sample.Value county.site
65873 36 1 5 1999-07-02 NA 1.5
65874 36 1 5 1999-07-05 NA 1.5
65875 36 1 5 1999-07-08 NA 1.5
65876 36 1 5 1999-07-11 NA 1.5
65877 36 1 5 1999-07-14 11.8 1.5
65878 36 1 5 1999-07-17 49.4 1.5
head(cnt1)
State.Code County.Code Site.ID Date Sample.Value county.site
835337 36 1 5 2012-01-01 7.04 1.5
835338 36 1 5 2012-01-02 2.08 1.5
835339 36 1 5 2012-01-03 3.58 1.5
835340 36 1 5 2012-01-04 6.45 1.5
835341 36 1 5 2012-01-07 17.29 1.5
835342 36 1 5 2012-01-10 NA 1.5
Compare readings
sapply to group and count
Let’s use sapply to - split the data by groups of county.site - count the number of rows/observations at each sensor site
sapply(split(cnt0,cnt0$county.site), nrow)
1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55
61 122 152 61 61 183 61 122 122 7
sapply(split(cnt1,cnt1$county.site), nrow)
1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55
31 64 31 31 33 15 31 30 31 31
Let’s choose sensor sites with a large number of observations so our comparison would have more merit. from what we see above we could choose county.site 63.2008
Extract one sensor site
Let’s subset for that sensor location from both 1999 and 2012
<- subset(cnt0, county.site == "63.2008")
pmosub <- subset(cnt1, county.site == "63.2008") pm1sub
Plot side by side
- Note the ranges of the plots are not equivalent.
par(mfrow=c(1,2), max=c(4,4,2,1))
Warning in par(mfrow = c(1, 2), max = c(4, 4, 2, 1)): "max" is not a graphical
parameter
with(pmosub,plot(Date, Sample.Value, pch=20))
with(pmosub,abline(h=median(Sample.Value, na.rm=TRUE), lwd=2))
# repeat for the second plot of 2012 data
with(pm1sub,plot(Date, Sample.Value, pch=20))
with(pm1sub,abline(h=median(Sample.Value, na.rm=TRUE), lwd=2))
Calculate the ranges
As you can see from below that range0 includes range1 so it’s obvious the total range will be equal to range0
<- range(pmosub$Sample.Value,na.rm=TRUE)
range0 range0
[1] 3.0 40.1
<- range(pm1sub$Sample.Value,na.rm = TRUE)
range1 range1
[1] 3.25 14.00
<- range(range0,range1)
range range
[1] 3.0 40.1
Set ylim
Let’s set the range to the ylim so we can have a more even comparison.
par(mfrow=c(1,2), max=c(4,4,2,1))
Warning in par(mfrow = c(1, 2), max = c(4, 4, 2, 1)): "max" is not a graphical
parameter
with(pmosub,plot(Date, Sample.Value, pch=20, ylim=range))
with(pmosub,abline(h=median(Sample.Value, na.rm=TRUE), lwd=2))
# repeat for the second plot of 2012 data
with(pm1sub,plot(Date, Sample.Value, pch=20, ylim=range))
with(pm1sub,abline(h=median(Sample.Value, na.rm=TRUE), lwd=2))
Average by State
We need to calculate the mean/average pollution levels by state, this could be done with group_by, aggregate… we’ll do it with tapply
- We’ll tapply a function “mean” to
- The Sample.Value for each state separately, basically we are calculating the mean of Sample.Value for each state as a group
- So that the output will be a list of numbers for each state’s mean, and a list of “char” for the names of each state
tapply mean/state
<- with(pm0_wnames,tapply(Sample.Value,State.Code, mean, na.rm=TRUE))
mn0 str(mn0)
num [1:53(1d)] 19.96 6.67 10.8 15.68 17.66 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:53] "1" "2" "4" "5" ...
head(mn0)
1 2 4 5 6 8
19.956391 6.665929 10.795547 15.676067 17.655412 7.533304
<- with(pm1_wnames,tapply(Sample.Value,State.Code, mean, na.rm=TRUE))
mn1 str(mn1)
num [1:52(1d)] 10.13 4.75 8.61 10.56 9.28 ...
- attr(*, "dimnames")=List of 1
..$ : chr [1:52] "1" "2" "4" "5" ...
head(mn1)
1 2 4 5 6 8
10.126190 4.750389 8.609956 10.563636 9.277373 4.117144
Summary
We see from the summaries that all 6 columns decreased since 1999
summary(mn0)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.862 9.519 12.315 12.406 15.640 19.956
summary(mn1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.006 7.355 8.729 8.759 10.613 11.992
Create data.frame
Let’s separate the “char” from mn0 into the names of the states, and the integer list into their means and assign each to a new column in a new dataframe for both years. d1 has 52 entries, and d0 has 53
<- data.frame(state=names(mn0),mean=mn0)
d0 <- data.frame(state=names(mn1), mean=mn1)
d1 head(d1)
state mean
1 1 10.126190
2 2 4.750389
4 4 8.609956
5 5 10.563636
6 6 9.277373
8 8 4.117144
Merge dfs
Let’s merge the two by State
<- merge(d0,d1, by="state")
mrg dim(mrg)
[1] 52 3
head(mrg)
state mean.x mean.y
1 1 19.956391 10.126190
2 10 14.492895 11.236059
3 11 15.786507 11.991697
4 12 11.137139 8.239690
5 13 19.943240 11.321364
6 15 4.861821 8.749336
Same plot
We’ll be plotting both on the same plot so we’ll use plot() once and follow it with points()
Plot with rep()
- The first argument is rep(1,52). This tells the plot routine that the x coordinates for all 52 points are 1.
- The second argument are the Y coordinates which is the second column of mrg or mrg[,2] which holds the 1999 data.
- The third argument is the range of x values we want, namely xlim set to c(.5,2.5).
- This works since we’ll be plotting 2 columns of points, one at x=1 and the # other at x=2.
- Plot the second plot with minor changes one being: we start with rep(2,52)
- And the data will be the third column for the year 2012 mrg[, 3]
- No need to repeat xlim since it is set already
with(mrg, plot(rep(1,52),mrg[ ,2], xlim= c(.5,2.5)))
with(mrg, points(rep(2,52),mrg[ ,3]))
Connect Plots
Segments
We want to connect each state point from the 1999 plot with its corresponding point in the 2012 plot
We’ll use the function segments() with these arguments: - First 2 arguments are for the x and y of the left plot - Second pair of arguments is for the x and y of the right plot - As you can see the majority have a down sloping line except for a couple visible ones
with(mrg, plot(rep(1,52),mrg[ ,2], xlim= c(.5,2.5)))
with(mrg, points(rep(2,52),mrg[ ,3]))
segments(rep(1,52), mrg[ ,2], rep(2,52), mrg[ ,3])
Extract higher means
The easiest way to extract the means that were higher in 2012 (or up sloping lines in plot), and we find that only 4 states had an increase in average emissions.
<- mrg[mrg$mean.x < mrg$mean.y, ]
wentup wentup
state mean.x mean.y
6 15 4.861821 8.749336
23 31 9.167770 9.207489
27 35 6.511285 8.089755
33 40 10.657617 10.849870
Information
The sections below were not used but included as reference and additional information:
Vector to df
frame and wcol are two vectors, in order to save them with write_csv I will convert to df first then save
<- data.frame(wcol) wcol_df
df to vector of strings
<- toString(cnames) cnames_st cnames_st
convert wcol_df to string as well
This is not needed for this process, included for information only.
<- toString(wcol_df) wcol_st wcol_st
convert df to string
This is not needed for this process, included for information only.
<- strsplit(as.character(cnames_st),"|", fixed=TRUE) cnames_st cnames_st