Tuesday 27 March 2012

R: ifelse statements with multiple variables and NAs

ifelse statements in R are the bread and butter of recoding variables. Normally these are pretty easy to do, particularly when we are recoding off one variable, and that variable contains no missing values. There are lots of examples on how to do this simple coding already available, so I will simply redirect you to the posts here and here, which are excellent summaries of how to do simple ifelse coding.

Things get more complicated when:

  • the recoding is defined off more than one variable, and
  • the variables contain missing values (NA in R speak)
I've been working on this type of coding for the past day, and I thought I would share my tips with you as it took me quite a while to figure out how to do this, with numerous internet searches. Hopefully I can save someone else the time!

My data frame is a survey of the general population, looking at consumption of a particular type of food (we'll call it Widgets), because there was insufficient information about Widget-eaters to be able to cut a better sample frame. The end result is that I have a data frame where only around 10% of respondents say they have eaten Widgets. This means that I have lots of information about the relevant 10% of the population, and a much smaller amount of information about the other 90% (age, location, and sex only). Currently I am holding, and working on, all the data in one data frame without subsetting, although I may subset later. The upshot is that I have a data frame where 90% of the data is NA.

The current problem I am faced with is that there were a number of open-ended response options to questions because we had "Other, please specify" type options. These other responses were selected by only a subset of the Widget eaters (who are 10% of my respondents). These "Other" response options, which are answers to questions like "Where did you last buy Widgets", "Where did you obtain information about the benefits of eating Widgets" were coded into updated categories by subject matter experts and provided to me as Excel files. My role here is to merge the updated coding into the existing response option coding for each question where there is an open-ended response.

Normally, that's been pretty easy. Most of the questions have required a "pick one only" type response so it's just been a matter of updating an existing variable, from a response of "Other" to a better coded response. That's simply replacement coding, where an existing variable level is overwritten by a new one.

However, I have just been recoding response options to a "choose all that apply" question. And this is where my ifelse has smacked up against the NA problem. Here, I am recoding "Other" responses to the question  "Where did you obtain information about the benefits of eating Widgets?" 

There were two recodes I had to make:
  • update an existing variable to reflect that the open-ended response should have been coded to an existing response option (e.g. because of post-hoc decision to expand a response option, e.g. so that the response option "TV" now also includes "radio")
  • to construct new response option categories because the existing ones are not sufficient to capture the "other" response and there are a sufficiently large number of similar response options to justify creating a new response option.
There are 5001 respondents in the data frame. Of the 460 Widget eaters, 73 provided "other" responses to the question. The data came to me with 21 response options precoded, and the 73 responses to the "other" option were split off into Excel for recoding by a subject matter expert. I received back the recodes in the Excel file. I saved a copy of the Excel file to csv format, read that into R, and created a new variable "Other.Source" to capture all of the numeric recodes.  (I have created a data dictionary that shows what each numeric code stands for.) Inside the master R data frame ("foo"), the 20 non-"Other" responses are coded as separate variables (because this is an "all that apply" question), from C1a to C1t, with C1r as the "Other" category. The codes are:
  • 1 if the Widget eater gave that response
  • 0 if the Widget eater did not give that response
  • NA if the respondent is not a Widget eater
The Other.Source value equals the response option number for question C1 in the survey. For example, where the response should have been coded to response option 1 (i.e. variable C1a in foo), the value in Other.Source has been set to 1.

I have 4 recodes to apply to existing response option categories, and 4 new response option categories that need to be constructed as new variables in the data frame (taking the range C1u to C1x to keep with the extant metadata framework). I have had to use a slightly different approach to recode, depending on whether I am updating an existing variable, or creating a new one. I show both approaches below.

Other.Source has been merged into foo, ahead of the work below. This means that all respondents who did not have an "other" response for the question, regardless of whether or not they are Widget eaters, are NA for Other.Source in foo.

First, updating an existing variable, based on values to two variables. This shows how I updated the 7th response option, which was coded to variable C1g in foo. I had been constructing a temporary variable to take the recodes (foo$temp) because I was having so many issues getting my code correct)
table(foo$C1g)
#code below is complicated because a missing value screws up the counts, so need to exclude missing values
#explicitly, hence all the !is.na() syntax = ensuring that the value is NOT missing
foo$temp <- ifelse(!is.na(foo$C1g) & (foo$C1g==1) | (!is.na(foo$Other.Source) & foo$Other.Source==7),1,
                      ifelse(foo$C1g==0 & !is.na(foo$Other.Source) & foo$Other.Source!=7,0,
                             ifelse(foo$C1g==0 & is.na(foo$Other.Source),0,NA)))
foo$C1g <- foo$temp

Walking through this code:
  • I produce a table of the original responses coded to C1g so I can make sure that the update means that I have at least the same number of  "1" responses after recoding. If I have less, something has gone wrong.
  • Some comments to remind myself of why I have such complicated code: it's all down to the NAs!
  • The first ifelse is giving "1" codes to temp where C1g contains 1 or where Other.Source contains 7 (the value indicating the "other" response should have been coded to the 7th response option, i.e. C1g). Because both C1g and Other.Source contain NAs, !is.na() needs to be added as an extra condition for both variable checks, otherwise NA values interfere with getting the correct test result. Note that this contains an OR (|) condition: if either C1g OR Other.Source indicates that the 7th response option is valid for the respondent, set temp to "1". 
  • The second ifelse puts temp to 0 if C1g is 0 (i.e. the respondent is a Widget eater and they were not coded to the 7th response option originally), and that they gave an "other" response that was not related to the 7th response option. Note how the !is.na() condition has been included for Other.Source, again it is included because only using  foo$Other.Source!=7 without the !is.na would give incorrect results.
  • The third ifelse is there because only a subset of Widget eaters gave "other" responses, so most Widget eaters will be NA for Other.Source, and we want to set them to "0" (did not give the 7th response option) and not NA (were not Widget eaters).
  • The NA result will therefore only be assigned where the respondent is not a Widget eater.
  • The last line replaces the original (and incorrect) C1g variable with the updated (and correct) results stored in the temp variable.
What about creating a new variable? Well, as I mentioned above I had to do this four times. The code is a little different because we don't have an extant response option variable to match off. What I do have is a UseMentioned variable that indicates whether the respondent is a Widget eater (value="Yes") or not (value="No"). So there are no NAs in the UseMentioned variable, which is part of foo. The code to do the new variable construction is below. We are constructing the 24th variable, which is named C1x*:
foo$temp <- ifelse(!is.na(foo$Other.Source) & foo$Other.Source==25,1,
                      ifelse((is.na(foo$Other.Source) | foo$Other.Source!=25) & foo$UseMentioned=="Yes",0,NA))
foo$C1x <- Aussie$temp

Walking through this code:
  • The first ifelse gives the value 1 to the new variable if the respondent picked this response option, with the !is.na() preventing problems arising from NA values in Other.Source.
  • The second ifelse gives the value 0 to the new variable if the respondent is a Widget eater (UseMentioned=="Yes") and the respondent did not provide this response option. Note the use of is.na() and not the !is.na()that was all that was used in the previous recoding example. While it would appear that all that is required is the  UseMentioned=="Yes" criterion, the presence of NAs in the Other.Source variable causes issues so such a simple criterion cannot be used.
* The recode value is 25 and not 24, because of some complexity in the metadata where I have one variable that takes different values depending on which of two countries the respondent lives in, but the letter of the alphabet used for the variable has been made the same. So the variable letters are one out.

I hope you find this useful.


Saturday 14 January 2012

R: ggplot2 for social scientists

I've been tied up with other work, so things have been a bit slow on the mixed methods development. I hope to do another post on the analysis in about a week, but today I would like to do an introduction to ggplot2 and demonstrate some of its usefulness to social scientists when one is dealing with grouped data, e.g. in a regression.

If you remember from the last post, I got as far as fitting a linear mixed effects model to the data using the lme4 package. That gave a mixed effects model, that was linear in both the fixed (age and day of intake) and random (subject) effects.

This is a multivariate model, so one method to examine the model appropriateness, given that it is fitted in multidimensional space, is to look at a plot of the residuals against the fitted values. I showed the plot in the previous post, but for ease of comparison, it is reproduced here:

We look at residuals plots to see if the model assumptions are not met. The types of assumptions that can be tested using a residuals plot are:

  1. The residuals should have a mean of 0. This is demonstrated by equal scatter of residuals above and below the horizontal line at y=0.
  2. The residuals should be independent, e.g. there should be no autocorrelation. This can be an issue with time series data, where there is (for example) a repeating seasonal effect.
  3. The residuals should show constant variance over the fitted values. The graph does not appear to show a  "fanning out" of the residuals, which would suggest heteroscedasticity (the technical name for non-constant variance).
  4. The residuals should be normally distributed. This does appear to be an issue from the graph, as the left hand, middle, and right hand sides of the residuals graph do not appear to be similarly distributed.
ggplot2 can help with our examination of the residuals plot. This sets up the data that I use in ggplot2 and creates the first graph:
library(ggplot2)
Male.Graph <- as.data.frame(fitted(Male.lme1))
names(Male.Graph)[names(Male.Graph)=="fitted(Male.lme1)"] <- "fits"
Male.Graph$resids <- c(residuals(Male.lme1))
Male.Graph$AgeFactor <- Male.Data$AgeFactor
Male.Plot <- qplot(fits,resids, data=Male.Graph, geom=c("point","smooth"))
Male.Plot
This is a much more attractive version of the first graph above, and the fitted line clearly shows something not quite right with the residuals. A normal probability plot is needed, and this can be done in base R. The output confirms the suspicion that the residuals are not normally distributed. The second line of code below plots the line so that it is easier to see where deviations from normality occur.
qqnorm(Male.Graph$resids)
qqline(Male.Graph$resids)
Now we can go back to having some fun with ggplot2. There are four different age groups of males in the data, how do the groups affect the residuals? ggplot2 allows for easy group identification in plots:
Male.Plot2 <- qplot(fits,resids, data=Male.Graph, colour=AgeFactor)
Male.Plot2


That looks interesting, and the values are as expected: as the children get older, generally their energy intake increases. But are the trend lines for the residuals for each group the same? Again, ggplot2 can help, and I have set quite a wide trend bar and shrunk the size of the points so the the bar is more evident:
Male.PlotBasic <- ggplot(Male.Graph, aes(fits, resids)) +
    geom_point(size=0.01) +
    geom_smooth(size=1.5, se=F)
Male.PlotBasic + aes(colour=factor(AgeFactor))

The lines are clearly not parallel, so there are differences in model fit between the age groups. 

There are a lot more options in the ggplot2 package, which makes these types of plots much easier and faster to produce compared to using base R.

A small plug for the author: I have used the book ggplot2: Elegant Graphics for Data Analysis to skill myself up on this package. I purchased the Kindle version as I am trying to get my technical books into e-book format and Hadley Wickham, who is the author of the package as well as the author of the book, was very helpful answering my questions about the book before I purchased it.