Birth Weight Exploratory Analysis

What is EDA?

Exploratory data analysis (EDA) is an approach to analyzing data sets to provide an overview of data characteristics
Often, EDA is visual, which will be our focus today
EDA can also be useful in identifying outlying observations as part ofo the data cleaning process

Data visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Data visualization is the creation and study of the visual representation of data.
R is one of many tools for visualizing data , and many approaches/systems exist within R for making data visualizations

Read birth weight data into R

We will use the NC birth data, and measure whether there is a relationship between the response, birth weight, and the explanatory variables, gestational age and biological sex.

Check for NAs in data - appear as 99, 9999
For now, remove the rows with missing values
Note these data have not been carefully cleaned but represent values on file in NC birth registry

#Read in birth data
o_data <- read.csv("~/Documents/TEACHING/vitalstats/Yr1116Birth.csv", 
  na.strings=c("99", "9999"))

birth_data <- na.omit(o_data)

Load Tidyverse

Many R functions have been made much easier to use by the Tidyverse package, which we will now load. If this is your first time using the package, type

install.packages("tidyverse")

library(tidyverse)

Peek at the Data

glimpse(birth_data)

## Observations: 715,549
## Variables: 14
## $ YOB    <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2...
## $ SEX    <int> 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2...
## $ CORES  <int> 78, 70, 11, 54, 25, 60, 13, 79, 10, 4, 49, 12, 1, 25, 2...
## $ CIGPN  <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 20, 20, 0, 0, 0, 0...
## $ CIGFN  <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 10, 0, 0, 0, 0, 0,...
## $ CIGSN  <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 10, 0, 0, 0, 0, 0,...
## $ CIGLN  <int> 0, 0, 5, 0, 10, 0, 0, 0, 0, 0, 4, 0, 10, 0, 0, 0, 0, 0,...
## $ BWTG   <int> 3062, 2977, 2549, 4309, 2835, 2837, 4032, 3590, 3090, 3...
## $ GEST   <int> 37, 39, 38, 40, 38, 38, 39, 39, 41, 39, 38, 41, 39, 41,...
## $ PLUR   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ MAGE   <int> 20, 23, 21, 36, 22, 20, 26, 25, 27, 28, 22, 23, 28, 20,...
## $ MRACER <int> 2, 2, 1, 0, 1, 1, 1, 1, 1, 1, 2, 8, 1, 1, 2, 1, 1, 1, 1...
## $ MHISP  <fct> N, N, N, M, N, N, N, N, N, N, N, M, N, N, N, N, N, N, N...
## $ PARITY <int> 3, 1, 1, 3, 5, 1, 1, 2, 6, 6, 4, 3, 3, 1, 2, 1, 1, 3, 1...

How many birth records are in the data?
How many variables accompany each record?
How are the variables coded?

Coding Woes

table(birth_data$SEX)

## 
##      1      2      9 
## 365673 349866     10

table(birth_data$MRACER)

## 
##      0      1      2      3      4      5      6      7      8 
##  83807 418182 174188  10117   3329    576     93   2114  23143

Ok, I can make a very good guess at the coding for the sex variable based on knowledge of the birth ratio in the US, but maternal race is trickier. Ideally we would have more informative labels for these values in our data.

Value Labels

Value labels are useful to help us remember whether 1=boys and 2=girls or the opposite. Value labels are not necessary for continuous or count variables but can be quite helpful for keeping track of categorical data. Let’s add labels to sex, maternal race, and maternal Hispanic ethnicity (county is another good candidate, but NC has 100 counties, so I’ll save the typing for you!), using the vital records data dictionary as a key.

birth_data$SEX=factor(birth_data$SEX, levels=c(1,2,9),
                      labels=c("Male","Female","Unspecified"))
birth_data$MRACER=factor(birth_data$MRACER, levels=0:8, 
                         labels=c("Other","White","Black",
                                  "Ind. Amer",
                                  "Chinese","Japanese","Nat. HI",
                                  "Filipino","Other As"))
birth_data$MHISP=factor(birth_data$MHISP, levels=c("C","M","N","O","P","S","U"), 
                        labels=c("Cuban","Mexican","Non-Hispanic","Other Hispanic",
                                "Puerto Rican","Central/South American","Unknown"))

Value labels

table(birth_data$SEX)

## 
##        Male      Female Unspecified 
##      365673      349866          10

table(birth_data$MRACER)

## 
##     Other     White     Black Ind. Amer   Chinese  Japanese   Nat. HI 
##     83807    418182    174188     10117      3329       576        93 
##  Filipino  Other As 
##      2114     23143

Much better!

ggplot2

ggplot2 is a data visualization package that is part of the tidyverse suite of packages

ggplot2

In ggplot2 the structure of the code for plots can often be summarized as

ggplot + 
  geom_xxx

or, more precisely

ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options

Geoms, short for geometric objects, describe the type of plot you will produce
Don’t worry, we’ll go through several examples!

Exploring Birth Weight Distribution

ggplot(data = birth_data, mapping = aes(x = BWTG)) +
  geom_histogram(binwidth = 200) + xlab("Birth weight (g)")

Exploring the birth weight data

Make a prediction: What relationship do you expect to see between gestational age and birth weight?

Weight as a function of gestational age

ggplot(data = birth_data, mapping = aes(x = GEST, y = BWTG)) +
  geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") + 
  ggtitle("NC Births, 2011-2016")

Time for data cleaning! (Unless an Asian elephant snuck in….)
Why does the plot have a barred structure?

Data Cleaning

Because it will complicate all our plotting, we’re going to set the gestational age of the baby that is presumably NOT an elephant given its weight (elephants have the longest gestation of all mammals) to NA. Because this is a file of live births, we’re going to do the same to gestational periods less than 20 weeks and birth weights below 500 g. Ideally, we would use a criterion to flag unreasonable combinations of birth weight and gestational age as well (but we will save that for you, if you’d like!).

birth_data$GEST_C=birth_data$GEST; birth_data$BWTG_C=birth_data$BWTG
birth_data$GEST_C[birth_data$GEST_C>50]=NA
birth_data$GEST_C[birth_data$GEST_C<20]=NA
birth_data$BWTG_C[birth_data$BWTG_C<500]=NA

Birth weight by Gestational Age (Cleaned)

ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C)) +
  geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") + 
  ggtitle("NC Births, 2011-2016")

## Warning: Removed 1703 rows containing missing values (geom_point).

Much better, though I still don’t believe them all!

Extending the Plots

Can display additional variables with

aesthetics (like shape, colour, size), or
faceting (small multiples displaying different subsets)

Aesthetics options

Visual characteristics of plotting characters that can be mapped to data are

color
size
shape
alpha (transparency)

Weight by age by gender

ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C, color=SEX)) +
  geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") + 
  ggtitle("NC Births, 2011-2016")

That’s a bit hard to see with all the data points!

Faceting options

Smaller plots that display different subsets of the data
Useful for exploring conditional relationships and large data

Weight by age by gender

ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C)) +
  facet_grid(. ~ SEX) +
  geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") + 
  ggtitle("NC Births, 2011-2016")

Adding a Scatterplot Smoother

ggplot(data = birth_data, mapping = aes(x = GEST_C, y = BWTG_C)) +
  geom_smooth() + geom_point() + xlab("Gestational age (weeks)") + ylab("Birth weight (g)") + 
  ggtitle("NC Births, 2011-2016")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Boxplots

Box plots are a good way to compare distributions across groups.

ggplot(data = birth_data, mapping = aes(y = BWTG_C, x = SEX)) +
  geom_boxplot() + ylab("Birth weight (g)")

## Warning: Removed 1649 rows containing non-finite values (stat_boxplot).

Bar plots

Bar plots are a nice way to compare relative group sizes.

ggplot(data = birth_data, mapping = aes(x = MRACER)) +
  geom_bar() + xlab("Maternal race")

Flip plot

ggplot(data = birth_data, mapping = aes(x = MRACER)) +
  geom_bar() + coord_flip() + xlab("Maternal race")

Segmented bar plots, counts

ggplot(data = birth_data, mapping = aes(x = MRACER, fill = MHISP)) +
  geom_bar() + xlab("Maternal race")

In-class activities

Subset data to 2016 births in Durham county

birth2016=subset(birth_data,birth_data$YOB=='2016')
#CORES code for Durham Co is 32
durhamb16=subset(birth2016,birth2016$CORES=='32')

Create a scatterplot of the Durham data with birth weight on the y-axis and gestational age on the x-axis. Adapt the plot by exploring options in the geom_point() call. For example, you can try the following.

+geom_point(color='blue') #try other colors!
+geom_point(alpha=1/2) # try fractions down to 1/10 -- helpful?
+ geom_point(shape=3) #try options!
+ geom_point(size=2) #try options - fractions too

Explore various combinations to improve this graphic, also recalling the Weight by age by gender slide (with the color option in the mapping argument).

Explore box plots of birth weight by race and Hispanic ethnicity in Durham