Exploratory data analysis (EDA) is an approach to analyzing data sets to provide an overview of data characteristics
Often, EDA is visual, which will be our focus today
EDA can also be useful in identifying outlying observations as part ofo the data cleaning process
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
Data visualization is the creation and study of the visual representation of data.
R is one of many tools for visualizing data , and many approaches/systems exist within R for making data visualizations
We will use the NC birth data, and measure whether there is a relationship between the response, birth weight, and the explanatory variables, gestational age and biological sex.
#Read in birth data
o_data <- read.csv("~/Documents/TEACHING/vitalstats/Yr1116Birth.csv",
na.strings=c("99", "9999"))
birth_data <- na.omit(o_data)
install.packages("tidyverse")
library(tidyverse)
glimpse(birth_data)
## Observations: 715,549
## Variables: 14
## $ YOB <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2...
## $ SEX <int> 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2...
## $ CORES <int> 78, 70, 11, 54, 25, 60, 13, 79, 10, 4, 49, 12, 1, 25, 2...
## $ CIGPN <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 20, 20, 0, 0, 0, 0...
## $ CIGFN <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 10, 0, 0, 0, 0, 0,...
## $ CIGSN <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 10, 0, 0, 0, 0, 0,...
## $ CIGLN <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 10, 0, 0, 0, 0, 0,...
## $ BWTG <int> 3062, 2977, 2549, 4309, 2835, 2837, 4032, 3590, 3090, 3...
## $ GEST <int> 37, 39, 38, 40, 38, 38, 39, 39, 41, 39, 38, 41, 39, 41,...
## $ PLUR <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ MAGE <int> 20, 23, 21, 36, 22, 20, 26, 25, 27, 28, 22, 23, 28, 20,...
## $ MRACER <int> 2, 2, 1, 0, 1, 1, 1, 1, 1, 1, 2, 8, 1, 1, 2, 1, 1, 1, 1...
## $ MHISP <fct> N, N, N, M, N, N, N, N, N, N, N, M, N, N, N, N, N, N, N...
## $ PARITY <int> 3, 1, 1, 3, 5, 1, 1, 2, 6, 6, 4, 3, 3, 1, 2, 1, 1, 3, 1...
table(birth_data$SEX)
##
## 1 2 9
## 365673 349866 10
table(birth_data$MRACER)
##
## 0 1 2 3 4 5 6 7 8
## 83807 418182 174188 10117 3329 576 93 2114 23143
Ok, I can make a very good guess at the coding for the sex variable based on knowledge of the birth ratio in the US, but maternal race is trickier. Ideally we would have more informative labels for these values in our data.
Value labels are useful to help us remember whether 1=boys and 2=girls or the opposite. Value labels are not necessary for continuous or count variables but can be quite helpful for keeping track of categorical data. Let’s add labels to sex, maternal race, and maternal Hispanic ethnicity (county is another good candidate, but NC has 100 counties, so I’ll save the typing for you!), using the vital records data dictionary as a key.
birth_data$SEX=factor(birth_data$SEX, levels=c(1,2,9),
labels=c("Male","Female","Unspecified"))
birth_data$MRACER=factor(birth_data$MRACER, levels=0:8,
labels=c("Other","White","Black",
"Ind. Amer",
"Chinese","Japanese","Nat. HI",
"Filipino","Other As"))
birth_data$MHISP=factor(birth_data$MHISP, levels=c("C","M","N","O","P","S","U"),
labels=c("Cuban","Mexican","Non-Hispanic","Other Hispanic",
"Puerto Rican","Central/South American","Unknown"))
table(birth_data$SEX)
##
## Male Female Unspecified
## 365673 349866 10
table(birth_data$MRACER)
##
## Other White Black Ind. Amer Chinese Japanese Nat. HI
## 83807 418182 174188 10117 3329 576 93
## Filipino Other As
## 2114 23143
Much better!
ggplot +
geom_xxx
or, more precisely
ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) +
geom_xxx() +
other options
ggplot(data = birth_data, mapping = aes(x = BWTG)) +
geom_histogram(binwidth = 200) + xlab("Birth weight (g)")
Make a prediction: What relationship do you expect to see between gestational age and birth weight?
ggplot(data = birth_data, mapping = aes(x = GEST, y = BWTG)) +
geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") +
ggtitle("NC Births, 2011-2016")
Because it will complicate all our plotting, we’re going to set the gestational age of the baby that is presumably NOT an elephant given its weight (elephants have the longest gestation of all mammals) to NA. Because this is a file of live births, we’re going to do the same to gestational periods less than 20 weeks and birth weights below 500 g. Ideally, we would use a criterion to flag unreasonable combinations of birth weight and gestational age as well (but we will save that for you, if you’d like!).
birth_data$GEST_C=birth_data$GEST; birth_data$BWTG_C=birth_data$BWTG
birth_data$GEST_C[birth_data$GEST_C>50]=NA
birth_data$GEST_C[birth_data$GEST_C<20]=NA
birth_data$BWTG_C[birth_data$BWTG_C<500]=NA
ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C)) +
geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") +
ggtitle("NC Births, 2011-2016")
## Warning: Removed 1703 rows containing missing values (geom_point).
Much better, though I still don’t believe them all!
Can display additional variables with
aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)
Visual characteristics of plotting characters that can be mapped to data are
color
size
shape
alpha
(transparency)
ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C, color=SEX)) +
geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") +
ggtitle("NC Births, 2011-2016")
That’s a bit hard to see with all the data points!
Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data
ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C)) +
facet_grid(. ~ SEX) +
geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") +
ggtitle("NC Births, 2011-2016")
ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C)) +
geom_smooth() + geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") +
ggtitle("NC Births, 2011-2016")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Box plots are a good way to compare distributions across groups.
ggplot(data = birth_data, mapping = aes(y = BWTG_C, x = SEX)) +
geom_boxplot() + ylab("Birth weight (g)")
## Warning: Removed 1649 rows containing non-finite values (stat_boxplot).
Bar plots are a nice way to compare relative group sizes.
ggplot(data = birth_data, mapping = aes(x = MRACER)) +
geom_bar() + xlab("Maternal race")
ggplot(data = birth_data, mapping = aes(x = MRACER)) +
geom_bar() + coord_flip() + xlab("Maternal race")
ggplot(data = birth_data, mapping = aes(x = MRACER, fill = MHISP)) +
geom_bar() + xlab("Maternal race")
birth2016=subset(birth_data,birth_data$YOB=='2016')
#CORES code for Durham Co is 32
durhamb16=subset(birth2016,birth2016$CORES=='32')
+geom_point(color='blue') #try other colors!
+geom_point(alpha=1/2) # try fractions down to 1/10 -- helpful?
+ geom_point(shape=3) #try options!
+ geom_point(size=2) #try options - fractions too
Explore various combinations to improve this graphic, also recalling the Weight by age by gender slide (with the color option in the mapping argument).