Tuesday 14 May 2013

Easier confidence interval estimation with matrices and similar arrays in R

When dealing with survey data in particular, social scientists are often wanting to produce proportions from the data, and associated confidence intervals. The prop.test command in R can be used to generate the desired results. When dealing with small tables, such as 2x2, prop.test is easily applied to the output of an xtabs command, e.g.
load(MASS)
test <- xtabs(~low+smoke, data=birthwt)
test
#get 95% CI estimate of the proportion of low birth weight babies who had mothers who smoked during pregnancy
prop.test(test[4], test[2]+test[4], 0.05)

We can see that the xtab output can be referenced in the prop.test command by putting in the cell reference information for the data. The cell references are column-by-column, so go down each row in the first column, before moving onto the second column. We can also see that this type of prop.test command is quite short when the xtabs output is relatively small.

Issues arise when there are multiple rows or columns in the xtabs output. For example, if we wanted to examine number of physician's visits in the first trimester by mother's age:
test2 <- xtabs(~age+ftv, data=birthwt) 
test2 

Now if we want to know the proportion and associated CIs for each age of the mother, for each number of physician visits, there are 24 cells to sum for each physician visit count. This is way too long to enter into our prop.test; while prop.test can handle summing all these numbers, the R syntax will be hard to understand.

This is where colSums is a handy command. The column sums for test2 are simply calculated using:
colSums(test2)

 This command can be easily used in prop.test. For example, to get the proportion of mothers who saw no physical, who were 19 years of age, the prop.test command is:
prop.test(test2[6], colSums(test2)[1], 0.05)

Because the count we want is cell number 6 in test2, and the colSum we want is for the first column. We can instruct R to conduct the colSums test within prop.test, or alternatively we could assign the colSums result to an object, and then reference the cell in the object.
 
 

Friday 4 January 2013

Shortening code in R: "with" and "within" are your friends

If you're anything like me, you use words to name both data frames and variables. I use metadata documentation to keep track of new variables I have constructed, and the logic I have used to construct them (and why I bothered!) but I find many variable names (such as var1) or using cutdown phrases (such as  F1519Inc for an income variable for females aged 15 to 19 years) to describe a variable become relatively noninformative and confusing as time passes since I constructed my variables.

The main issue with using longer dataframe and variable names is that they take up so much space in the code. You can end up spending more time trying to lay out your code so it is readable, and less time thinking and doing analyses. But R comes with two handy tools for shortening code:

  • with
  • within
I had been using R for a few years before I found out about these two handy commands, and now I use them all the time.

Let's work through these with an example. Because R comes with the package MASS present, I'll use data supplied with MASS. We're going to use "minn38" which is data relating to  Minnesota high school graduates of 1938. I'm using social science data as the example as this is the type of data I use most often.

First we need to get MASS into our workspace. Once we load the package in, the MASS datasets become available for us to use:
library("MASS")

and we check we can see the minn38 package:
str(minn38)

Now let's construct a new data frame off this one with longer, more informative names:
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="hs"]<-"High.School.Rank"

names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="phs"]<-"Post.High.School.Status"
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="fol"]<-"Father.Occup.Level"
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="f"]<-"Count"

if you run the str command on the new dataframe, you will see that it contains the updated variable names.

Now, just for fun, let's create a new variable that interacts Sex and High School Rank. We can use the ifelse command. Because the names are so long in this example, I've only coded for two levels and then cheated by setting every other level to "Other".
Minn.High.School.Grad.1938$Rank.by.Sex <- ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" & Minn.High.School.Grad.1938$sex=="F", "Low Rank Female",

                                                 ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" & Minn.High.School.Grad.1938$sex=="M", "Low Rank Male", "Other"))

That's some loooooooooooong code lines. So, the pro of using long data frame and variable names is that you can easily see what data frame and what variable you should use. The con is that it makes your code so much harder to lay out.

Let's shorten the code by using with. The R description of with is available here. We can remove the repeated call to the data frame name in our ifelse statements, because the with command is telling R that we are using that single data frame for all variable calls:
Minn.High.School.Grad.1938$Rank.by.Sex2 <- with(Minn.High.School.Grad.1938, {ifelse(High.School.Rank=="L" & 

  sex=="F", "Low Rank Female",
                                                 ifelse(High.School.Rank=="L" & sex=="M", "Low Rank Male",
                                                        "Other"))})

Alternatively, the within command can be used to the same end:

Minn.High.School.Grad.1938 <- within (Minn.High.School.Grad.1938, {
  Rank.by.Sex2 <- ifelse(High.School.Rank=="L" & sex=="F", "Low Rank Female",
                     ifelse(High.School.Rank=="L" & sex=="M", "Low Rank Male", "Other"))})

The within method is preferred if you have a lot of variables to recode simultaneously, as you're only specifying the data frame at the start. All the other variables can be inserted ahead of the closing curly brackets.

Key points to note:
  • there must be a comma following the data frame name
  • both commands uses curly brackets, so remember to change your bracket type
  • the with command can be used in situations other than data preparation, e.g. in the formula for a regression, see the R examples
  • the two commands are only functionally equivalent in this example because the with command is being used to construct a permanent variable, there are examples (e.g. in regressions) where with has a transient effect