Using data relevant to countries around the world, we will explore data science tools to tell complex stories through data visualization. We will consider data on infant mortality in class, and for the report you are welcome to use any data that can be plotted by country.
Using data from the World Bank or a complex data source of your own choosing,
Infant death is defined as the death of a live-born infant during the first year of life. The infant mortality rate is the number of infant deaths per 1000 live births. The infant mortality rate, which is relatively easily measured even in resource-poor settings, is often used as a marker of the overall health of a society because of the overlap in causes of infant mortality and factors that are likely to influence the health of an overall population, such as general living conditions, poverty and socioeconomic status, quality of medical care, and rates of illness. Infant mortality rates can be compared across different areas or across time to evaluate the impact of public health program or to determine where new investments in health are needed.
What do you think about the way the authors chose to display time in the plot?
While overall infant mortality rates in the US have declined over time, major disparities in infant mortality rates exist, shown here as a function of race and ethnicity.
Worldmapper created a visualization of infant mortality worldwide in 2002, rescaling each country by its contribution to infant deaths worldwide.
In order to create our map of infant mortality around the world, we will need to identify the following data.
First, we will obtain latitude and longitude information for the center of each country. We can then link that to infant mortality rates for each country.
latlonDF <-read.csv("http://dev.maxmind.com/static/csv/codes/country_latlon.csv")
The data files for population and infant mortality in 2014 are available on the course website.
load("data/popDF.RData")
load("data/infMortDF.RData")
First we’ll combine the demographic data on infant mortality and population size because these come from the same source and consequently should have fewer problems with matching.
IMPop = merge(infMortDF, popDF, by = "ctry", all = FALSE)
In the merge of infMortDF and popDF, some rows were excluded from the resulting data frame. Determine which rows in one data frame did not find a match in the other.
Check to see if all countries in the infant mortality dataset are also in the population size dataset:
which((!infMortDF$ctry %in% popDF$ctry))
## integer(0)
Which countries are in the population data set but not the infant mortality data set?
popDF$ctry[which((!popDF$ctry %in% infMortDF$ctry))]
## [1] AX KT CK UC DX FK VT KV MJ NE NF PC TB RN SV TL
## 272 Levels: AA AC AE AF AG AJ AL AM AN AO AQ AR AS AT AU AV AX AY ... ZI
Appendix D of the CIA World Factbook contains country codes: AX (Akrotiri, a special UK overseas territory near Cyprus), KT (Christmas Island), CK (Cocos/Keeling Islands), UC (Curacao), DX (Dhekelia, see AX), FK (Falkland Islands), VT (Vatican City) , KV (Kosovo), MJ (Montenegro), NE (Niue), NF (Norfolk Island), PC (Pitcairn Islands), TB (St. Barthelemy), RN (St. Martin), SV (Svalbard), TL (Tokelau)
latlonDF$code <- latlonDF$iso.3166.country
allCtryData = merge(IMPop, latlonDF, by.x = "ctry", by.y = "code", all = FALSE)
First, we’ll begin by selecting the colors for our plot. We’ll select only 5 colors, and consequently 5 levels of mortality, because it’s difficult for our eyes to distinguish among more than 5 to 7 colors.
library(RColorBrewer)
#display.brewer.all()
cols = brewer.pal(9, "YlOrRd")[c(1, 2, 4, 6, 7)]
Next, we’ll need to categorize infant mortality rate so that it can be connected to a color.
# Equally Divide the Range by Width
newInfMort = cut(allCtryData$infMort, breaks = 5)
summary(newInfMort)
## (1.69,24.9] (24.9,48] (48,71.1] (71.1,94.1] (94.1,117]
## 102 19 17 7 3
Hmm, the good news is that the right tail is not very fat. However, this may not be a great choice for displaying the data.
We want our colors to be more evenly distributed than they are based on the default. One option is to put 20% of the data in each color group, though that may lead to cutpoints at odd locations.
quantile(allCtryData$infMort, probs = seq(0, 1, by = 0.2))
## 0% 20% 40% 60% 80% 100%
## 1.810 5.238 11.362 20.388 45.366 117.230
A drawback of quintiles is that we may not want to focus on countries with lower rates, instead isolating the very high risk countries.
Another option is to look at a histogram and pick cutpoints based on easy references that further isolate the right tail.
hist(allCtryData$infMort, breaks = 20, main = "",
xlab = "Infant Mortality per 1000 Live Births")
InfMortDiscrete = cut(allCtryData$infMort,
breaks = c(0, 10, 25, 50, 75, 125))
summary(InfMortDiscrete)
## (0,10] (10,25] (25,50] (50,75] (75,125]
## 51 51 19 21 6
The infant mortality rate is not the only factor in decision-making. We are more likely to intervene when more lives are at risk. Let’s examine population size.
hist(allCtryData$pop, breaks = 20, main = "",
xlab = "Population Size")
Large variance – let’s try a scale transformation.
hist(sqrt(allCtryData$pop), breaks = 20, main = "",
xlab = "sqrt(Population Size)")
That’s much better. We’ll work with some function of square root of population size (further scalng to be sure the radius of each point is not too large for our plot).
Now we are ready to plot the mortality rates around the world. Size of the points will be a function of population size, while color will be a function of the mortality rate (darker=higher).
# Create map
library(maps)
world = map(database = "world", fill = TRUE, col="light grey")
# Add discs, circles sets radius of discs, fg outlines discs and bg fills them in
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
circles= sqrt(allCtryData$pop)/4000, inches = FALSE,
fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])
# Add legend
legend(x = -150, y = 0, title = "Infant Mortality",
legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)
Hmm, what’s wrong with this picture?
The small countries have symbols that are so tiny that we can barely see them, if at all, and we certainly can’t see their colors (check out the Caribbean). We will fix this by having a minimum radius, so any country with a population below a certain value is still visible.
# Set minimum radius size
rads = pmax(sqrt(allCtryData$pop)/4000, 1)
# Fix Discs
world = map(database = "world", fill = TRUE, col="light grey")
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
circles= rads, inches = FALSE,
fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])
legend(x = -150, y = 0, title = "Infant Mortality",
legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)
Looking at our plot we can see that something is still very wrong! Some European country has a huge radius. China’s population size is also incorrect, and does Russia really have no infant mortality?
We probably made an error in the merge step somehow. Here was our code.
latlonDF$code <- latlonDF$iso.3166.country
allCtryData = merge(IMPop, latlonDF, by.x = "ctry", by.y = "code", all = FALSE)
For the latitude/longitude files, we used ISO 3166 country codes, set by the International Organization for Standardization. Did the CIA Factbook use the same codes?
allCtryData[ allCtryData$ctry %in% c("CH","GB","RS","NI"), ]
## ctry infMort pop iso.3166.country latitude longitude
## 29 CH 14.79 1355692576 CH 47 8
## 50 GB 47.03 1672597 GB 54 -2
## 99 NI 74.09 177155754 NI 13 -85
## 114 RS 7.08 142470272 RS 44 21
The R data frame codeMapDF (extracted from the CIA factbook) contains the crosswalk between the CIA and ISO country coding and can be used to conduct a corrected merge of the data.
# Fix Data Merging
load("data/codeMapDF.RData")
latlonCtry = merge(latlonDF, codeMapDF, by.x = "code", by.y = "iso", all = FALSE)
IMpopCtry = merge(IMPop, codeMapDF, by.x="ctry", by.y = "cia", all=FALSE)
allCtryData <- merge(latlonCtry, IMpopCtry, by="name")
names(allCtryData)[1] <- "Country"
save(allCtryData, file="data/allCtryData.RData") # save for Tableau tutorial
# Check Data
allCtryData[allCtryData$ctry %in% c("CH", "SZ", "GB", "UK", "NU","NI","RI","RS"),
c(1,6,10,4,5,8,9)]
## Country cia iso latitude longitude infMort pop
## 43 China CH CN 35 105.00 14.79 1355692576
## 70 Gabon GB GA -1 11.75 47.03 1672597
## 141 Nicaragua NU NI 13 -85.00 20.36 5848641
## 143 Nigeria NI NG 10 8.00 74.09 177155754
## 159 Russia RS RU 60 100.00 7.08 142470272
## 171 Serbia RI RS 44 21.00 6.16 7209764
## 186 Switzerland SZ CH 47 8.00 3.73 8061516
## 203 United Kingdom UK GB 54 -2.00 4.44 63742977
InfMortDiscrete = cut(allCtryData$infMort,
breaks = c(0, 10, 25, 50, 75, 150))
# Redo Map
world = map(database = "world", fill = TRUE, col="light grey")
rads = pmax(sqrt(allCtryData$pop)/4000, 1)
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
circles= rads, inches = FALSE,
fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])
legend(x = -150, y = 0, title = "Infant Mortality",
legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)
This map includes the countries for which there is latitude and longitude, even if there is no demographic information.
# Remerge the data
allCtryData <- merge(latlonCtry, IMpopCtry, by="name", all.x=TRUE, all.y =FALSE)
# Update the colors & add NA as a level
cols = brewer.pal(9, "YlOrRd")[c(1, 2, 4, 6, 7, 9)]
InfMortDiscrete = cut(allCtryData$infMort,
breaks = c(0, 10, 25, 50, 75, 150))
InfMortDiscrete <- addNA(InfMortDiscrete)
# Add radius size for NA
rads = pmax(sqrt(allCtryData$pop)/4000, 1)
for(i in 1:length(rads)){
if(is.na(rads[i])){rads[i] <- 1}
}
world = map(database = "world", fill = TRUE, col="light grey")
symbols(allCtryData$longitude, allCtryData$latitude, add = TRUE,
circles= rads, inches = FALSE,
fg = cols[InfMortDiscrete], bg = cols[InfMortDiscrete])
legend(x = -150, y = 0, title = "Infant Mortality",
legend = levels(InfMortDiscrete), fill = cols, cex = 0.8)