Saturday, 14 January 2012

R: ggplot2 for social scientists

I've been tied up with other work, so things have been a bit slow on the mixed methods development. I hope to do another post on the analysis in about a week, but today I would like to do an introduction to ggplot2 and demonstrate some of its usefulness to social scientists when one is dealing with grouped data, e.g. in a regression.

If you remember from the last post, I got as far as fitting a linear mixed effects model to the data using the lme4 package. That gave a mixed effects model, that was linear in both the fixed (age and day of intake) and random (subject) effects.

This is a multivariate model, so one method to examine the model appropriateness, given that it is fitted in multidimensional space, is to look at a plot of the residuals against the fitted values. I showed the plot in the previous post, but for ease of comparison, it is reproduced here:

We look at residuals plots to see if the model assumptions are not met. The types of assumptions that can be tested using a residuals plot are:

  1. The residuals should have a mean of 0. This is demonstrated by equal scatter of residuals above and below the horizontal line at y=0.
  2. The residuals should be independent, e.g. there should be no autocorrelation. This can be an issue with time series data, where there is (for example) a repeating seasonal effect.
  3. The residuals should show constant variance over the fitted values. The graph does not appear to show a  "fanning out" of the residuals, which would suggest heteroscedasticity (the technical name for non-constant variance).
  4. The residuals should be normally distributed. This does appear to be an issue from the graph, as the left hand, middle, and right hand sides of the residuals graph do not appear to be similarly distributed.
ggplot2 can help with our examination of the residuals plot. This sets up the data that I use in ggplot2 and creates the first graph:
Male.Graph <-
names(Male.Graph)[names(Male.Graph)=="fitted(Male.lme1)"] <- "fits"
Male.Graph$resids <- c(residuals(Male.lme1))
Male.Graph$AgeFactor <- Male.Data$AgeFactor
Male.Plot <- qplot(fits,resids, data=Male.Graph, geom=c("point","smooth"))
This is a much more attractive version of the first graph above, and the fitted line clearly shows something not quite right with the residuals. A normal probability plot is needed, and this can be done in base R. The output confirms the suspicion that the residuals are not normally distributed. The second line of code below plots the line so that it is easier to see where deviations from normality occur.
Now we can go back to having some fun with ggplot2. There are four different age groups of males in the data, how do the groups affect the residuals? ggplot2 allows for easy group identification in plots:
Male.Plot2 <- qplot(fits,resids, data=Male.Graph, colour=AgeFactor)

That looks interesting, and the values are as expected: as the children get older, generally their energy intake increases. But are the trend lines for the residuals for each group the same? Again, ggplot2 can help, and I have set quite a wide trend bar and shrunk the size of the points so the the bar is more evident:
Male.PlotBasic <- ggplot(Male.Graph, aes(fits, resids)) +
    geom_point(size=0.01) +
    geom_smooth(size=1.5, se=F)
Male.PlotBasic + aes(colour=factor(AgeFactor))

The lines are clearly not parallel, so there are differences in model fit between the age groups. 

There are a lot more options in the ggplot2 package, which makes these types of plots much easier and faster to produce compared to using base R.

A small plug for the author: I have used the book ggplot2: Elegant Graphics for Data Analysis to skill myself up on this package. I purchased the Kindle version as I am trying to get my technical books into e-book format and Hadley Wickham, who is the author of the package as well as the author of the book, was very helpful answering my questions about the book before I purchased it.