Infant Mortality Around the World

CIA Factbook Data

Using data relevant to countries around the world, we will explore data science tools to tell complex stories through data visualization. We will consider data on infant mortality in class, and for the report you are welcome to use any data that can be plotted by country.

Additional Case Study Data

Case Study Goals

Using data from the World Bank or a complex data source of your own choosing,

  • create a Tableau dashboard interactive visual presentation of our world in data
  • explain your figures to class in an oral presentation by the group

Case for Class: Infant Mortality

Infant death is defined as the death of a live-born infant during the first year of life. The infant mortality rate is the number of infant deaths per 1000 live births. The infant mortality rate, which is relatively easily measured even in resource-poor settings, is often used as a marker of the overall health of a society because of the overlap in causes of infant mortality and factors that are likely to influence the health of an overall population, such as general living conditions, poverty and socioeconomic status, quality of medical care, and rates of illness. Infant mortality rates can be compared across different areas or across time to evaluate the impact of public health program or to determine where new investments in health are needed.

Changes in US Infant Mortality over Time

Infant Mortality in US, 1915-2007 (NCHS)
Infant Mortality in US, 1915-2007 (NCHS)

What do you think about the way the authors chose to display time in the plot?

Disparities in Infant Mortality Rate

While overall infant mortality rates in the US have declined over time, major disparities in infant mortality rates exist, shown here as a function of race and ethnicity. Infant Mortality in US by Race/Ethnicity, 2005-2007

Infant Mortality Worldwide

Worldmapper created a visualization of infant mortality worldwide in 2002, rescaling each country by its contribution to infant deaths worldwide.

2002 Contributions to Worldwide Infant Mortality
2002 Contributions to Worldwide Infant Mortality

Mapping Infant Mortality

In order to create our map of infant mortality around the world, we will need to identify the following data.

  • Infant mortality data for time period of interest by country
  • Country mapping information and link to infant mortality data

Extracting Latitude and Longitude of Country Centroids

First, we will obtain latitude and longitude information for the center of each country. We can then link that to infant mortality rates for each country.

latlonDF <-read.csv("http://dev.maxmind.com/static/csv/codes/country_latlon.csv")

Population & Infant Mortality Data

The data files for population and infant mortality in 2014 are available on the course website.

load("data/popDF.RData")
load("data/infMortDF.RData")

Merging the Demographic Data

First we’ll combine the demographic data on infant mortality and population size because these come from the same source and consequently should have fewer problems with matching.

IMPop = merge(infMortDF, popDF, by = "ctry", all = FALSE)

Missed Merges

In the merge of infMortDF and popDF, some rows were excluded from the resulting data frame. Determine which rows in one data frame did not find a match in the other.

Check to see if all countries in the infant mortality dataset are also in the population size dataset:

which((!infMortDF$ctry %in% popDF$ctry))
## integer(0)

Missed Merges

Which countries are in the population data set but not the infant mortality data set?

popDF$ctry[which((!popDF$ctry %in% infMortDF$ctry))]
##  [1] AX KT CK UC DX FK VT KV MJ NE NF PC TB RN SV TL
## 272 Levels:  AA AC AE AF AG AJ AL AM AN AO AQ AR AS AT AU AV AX AY ... ZI

Appendix D of the CIA World Factbook contains country codes: AX (Akrotiri, a special UK overseas territory near Cyprus), KT (Christmas Island), CK (Cocos/Keeling Islands), UC (Curacao), DX (Dhekelia, see AX), FK (Falkland Islands), VT (Vatican City) , KV (Kosovo), MJ (Montenegro), NE (Niue), NF (Norfolk Island), PC (Pitcairn Islands), TB (St. Barthelemy), RN (St. Martin), SV (Svalbard), TL (Tokelau)

Merging the geographic and demographic data sets

latlonDF$code <- latlonDF$iso.3166.country
allCtryData = merge(IMPop, latlonDF, by.x = "ctry", by.y = "code", all = FALSE)

Initial Plot - The Colors

First, we’ll begin by selecting the colors for our plot. We’ll select only 5 colors, and consequently 5 levels of mortality, because it’s difficult for our eyes to distinguish among more than 5 to 7 colors.

library(RColorBrewer)
#display.brewer.all()
cols = brewer.pal(9, "YlOrRd")[c(1, 2, 4, 6, 7)]

Initial Plot - Categorizing Infant Mortality

Next, we’ll need to categorize infant mortality rate so that it can be connected to a color.

# Equally Divide the Range by Width 
newInfMort = cut(allCtryData$infMort, breaks = 5)
summary(newInfMort)
## (1.69,24.9]   (24.9,48]   (48,71.1] (71.1,94.1]  (94.1,117] 
##         102          19          17           7           3

Hmm, the good news is that the right tail is not very fat. However, this may not be a great choice for displaying the data.

Categorizing Mortality Rate

We want our colors to be more evenly distributed than they are based on the default. One option is to put 20% of the data in each color group, though that may lead to cutpoints at odd locations.

quantile(allCtryData$infMort, probs = seq(0, 1, by = 0.2))
##      0%     20%     40%     60%     80%    100% 
##   1.810   5.238  11.362  20.388  45.366 117.230

A drawback of quintiles is that we may not want to focus on countries with lower rates, instead isolating the very high risk countries.

Categorizing Mortality Rate

Another option is to look at a histogram and pick cutpoints based on easy references that further isolate the right tail.

hist(allCtryData$infMort, breaks = 20, main = "",
     xlab = "Infant Mortality per 1000 Live Births")

Categorizing Mortality Rate

InfMortDiscrete = cut(allCtryData$infMort,
                      breaks = c(0, 10, 25, 50, 75, 125))
summary(InfMortDiscrete)
##   (0,10]  (10,25]  (25,50]  (50,75] (75,125] 
##       51       51       19       21        6

Affected Lives

The infant mortality rate is not the only factor in decision-making. We are more likely to intervene when more lives are at risk. Let’s examine population size.

hist(allCtryData$pop, breaks = 20, main = "",
     xlab = "Population Size")

Large variance – let’s try a scale transformation.

Population Size

hist(sqrt(allCtryData$pop), breaks = 20, main = "",
     xlab = "sqrt(Population Size)")

That’s much better. We’ll work with some function of square root of population size (further scalng to be sure the radius of each point is not too large for our plot).

Initial Plot

Now we are ready to plot the mortality rates around the world. Size of the points will be a function of population size, while color will be a function of the mortality rate (darker=higher).

# Create map
library(maps)
world = map(database = "world", fill = TRUE, col="light grey")

# Add discs, circles sets radius of discs, fg outlines discs and bg fills them in
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
        circles= sqrt(allCtryData$pop)/4000, inches = FALSE,
        fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])

# Add legend
legend(x = -150, y = 0, title = "Infant Mortality",
       legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)

Infant Mortality Worldwide

Hmm, what’s wrong with this picture?

Correcting Radius Size

The small countries have symbols that are so tiny that we can barely see them, if at all, and we certainly can’t see their colors (check out the Caribbean). We will fix this by having a minimum radius, so any country with a population below a certain value is still visible.

# Set minimum radius size
rads = pmax(sqrt(allCtryData$pop)/4000, 1)

# Fix Discs
world = map(database = "world", fill = TRUE, col="light grey")
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
        circles= rads, inches = FALSE,
        fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])
legend(x = -150, y = 0, title = "Infant Mortality",
       legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)

Correcting Radius Size

Further Corrections

Looking at our plot we can see that something is still very wrong! Some European country has a huge radius. China’s population size is also incorrect, and does Russia really have no infant mortality?

We probably made an error in the merge step somehow. Here was our code.

latlonDF$code <- latlonDF$iso.3166.country
allCtryData = merge(IMPop, latlonDF, by.x = "ctry", by.y = "code", all = FALSE)

For the latitude/longitude files, we used ISO 3166 country codes, set by the International Organization for Standardization. Did the CIA Factbook use the same codes?

Merging by Country Name

allCtryData[ allCtryData$ctry %in% c("CH","GB","RS","NI"), ]
##     ctry infMort        pop iso.3166.country latitude longitude
## 29    CH   14.79 1355692576               CH       47         8
## 50    GB   47.03    1672597               GB       54        -2
## 99    NI   74.09  177155754               NI       13       -85
## 114   RS    7.08  142470272               RS       44        21

Corrected Data Merging

The R data frame codeMapDF (extracted from the CIA factbook) contains the crosswalk between the CIA and ISO country coding and can be used to conduct a corrected merge of the data.

# Fix Data Merging
load("data/codeMapDF.RData")
latlonCtry = merge(latlonDF, codeMapDF, by.x = "code", by.y = "iso", all = FALSE)
IMpopCtry = merge(IMPop, codeMapDF, by.x="ctry", by.y = "cia", all=FALSE)
allCtryData <- merge(latlonCtry, IMpopCtry, by="name")
names(allCtryData)[1] <- "Country"
save(allCtryData, file="data/allCtryData.RData") # save for Tableau tutorial

# Check Data
allCtryData[allCtryData$ctry %in% c("CH", "SZ", "GB", "UK", "NU","NI","RI","RS"), 
            c(1,6,10,4,5,8,9)]
##            Country cia iso latitude longitude infMort        pop
## 43           China  CH  CN       35    105.00   14.79 1355692576
## 70           Gabon  GB  GA       -1     11.75   47.03    1672597
## 141      Nicaragua  NU  NI       13    -85.00   20.36    5848641
## 143        Nigeria  NI  NG       10      8.00   74.09  177155754
## 159         Russia  RS  RU       60    100.00    7.08  142470272
## 171         Serbia  RI  RS       44     21.00    6.16    7209764
## 186    Switzerland  SZ  CH       47      8.00    3.73    8061516
## 203 United Kingdom  UK  GB       54     -2.00    4.44   63742977

Fixing Map

InfMortDiscrete = cut(allCtryData$infMort,
                      breaks = c(0, 10, 25, 50, 75, 150))
# Redo Map
world = map(database = "world", fill = TRUE, col="light grey")
rads = pmax(sqrt(allCtryData$pop)/4000, 1)
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
        circles= rads, inches = FALSE,
        fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])
legend(x = -150, y = 0, title = "Infant Mortality",
       legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)

Fixing Map

In case you’re curious!

This map includes the countries for which there is latitude and longitude, even if there is no demographic information.

# Remerge the data
allCtryData <- merge(latlonCtry, IMpopCtry, by="name", all.x=TRUE, all.y =FALSE)

# Update the colors & add NA as a level
cols = brewer.pal(9, "YlOrRd")[c(1, 2, 4, 6, 7, 9)]
InfMortDiscrete = cut(allCtryData$infMort,
                      breaks = c(0, 10, 25, 50, 75, 150))
InfMortDiscrete <- addNA(InfMortDiscrete) 

# Add radius size for NA
rads = pmax(sqrt(allCtryData$pop)/4000, 1)
for(i in 1:length(rads)){
  if(is.na(rads[i])){rads[i] <- 1}
}

Revised Map

world = map(database = "world", fill = TRUE, col="light grey")
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
        circles= rads, inches = FALSE,
        fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])
legend(x = -150, y = 0, title = "Infant Mortality",
       legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)