Thursday 26 March 2015

Getting variable labels in R, from SPSS

When using R, we may need find our data has been saved in a different statistics package. While there are some export functions in other statistical software that will export to a different filetype, or we may simply use a .csv file, R can import some datasets from their native filetype.

SPSS is one of those filetypes. SPSS datafiles have a .sav extension, and we can import these into R using the foreign package. This package is installed by default as part of the R core installation.

Ensure the foreign library is attached:
library(foreign)

There is a nifty trick to getting the filepath for the SPSS datafile you wish to import, use:
file.choose()

Copy and paste the filepath into this code:
dataset = read.spss("[filepath including filename goes here]", to.data.frame=TRUE)
The option at the end creates the R file as a dataframe, which is the type of data object I want in R.

Note: I am using dataset as my dataset name in this example. Use whatever name is best for you, and remember to change all instances of dataset to your actual dataset name in later code.

Unfortunately, if your SPSS datafile had variable labels (e.g. "Sex of respondent"), these aren't shown in the R dataframe, only the variable names are shown (e.g. Sex). While the name is often clear for variables such as sex, you may find that the names are less clear for other options (e.g. for a survey containing multiple "select all that apply" type questions/responses). It is therefore very useful to have the list of variable names and their associated labels.

You can simply print the concordance to the console by using:
attr(dataset, "variable.labels")

I didn't find this helpful, for two reasons:
  • I have a lot of variables, so it takes up a sizeable amount of console space
  • I am going to keep referring to the labels when I need to do analyses, and retaining the information in the console is not helpful if I have to keep scrolling back, or reissuing the command
I found it much more efficient to output the variable names and their labels to a separate dataframe that I can use:
dataset.labels <- as.data.frame(attr(dataset, "variable.labels"))

Voila, 5 lines of code to get my SPSS data and variable labels into R.

Tuesday 14 May 2013

Easier confidence interval estimation with matrices and similar arrays in R

When dealing with survey data in particular, social scientists are often wanting to produce proportions from the data, and associated confidence intervals. The prop.test command in R can be used to generate the desired results. When dealing with small tables, such as 2x2, prop.test is easily applied to the output of an xtabs command, e.g.
load(MASS)
test <- xtabs(~low+smoke, data=birthwt)
test
#get 95% CI estimate of the proportion of low birth weight babies who had mothers who smoked during pregnancy
prop.test(test[4], test[2]+test[4], 0.05)

We can see that the xtab output can be referenced in the prop.test command by putting in the cell reference information for the data. The cell references are column-by-column, so go down each row in the first column, before moving onto the second column. We can also see that this type of prop.test command is quite short when the xtabs output is relatively small.

Issues arise when there are multiple rows or columns in the xtabs output. For example, if we wanted to examine number of physician's visits in the first trimester by mother's age:
test2 <- xtabs(~age+ftv, data=birthwt) 
test2 

Now if we want to know the proportion and associated CIs for each age of the mother, for each number of physician visits, there are 24 cells to sum for each physician visit count. This is way too long to enter into our prop.test; while prop.test can handle summing all these numbers, the R syntax will be hard to understand.

This is where colSums is a handy command. The column sums for test2 are simply calculated using:
colSums(test2)

 This command can be easily used in prop.test. For example, to get the proportion of mothers who saw no physical, who were 19 years of age, the prop.test command is:
prop.test(test2[6], colSums(test2)[1], 0.05)

Because the count we want is cell number 6 in test2, and the colSum we want is for the first column. We can instruct R to conduct the colSums test within prop.test, or alternatively we could assign the colSums result to an object, and then reference the cell in the object.
 
 

Friday 4 January 2013

Shortening code in R: "with" and "within" are your friends

If you're anything like me, you use words to name both data frames and variables. I use metadata documentation to keep track of new variables I have constructed, and the logic I have used to construct them (and why I bothered!) but I find many variable names (such as var1) or using cutdown phrases (such as  F1519Inc for an income variable for females aged 15 to 19 years) to describe a variable become relatively noninformative and confusing as time passes since I constructed my variables.

The main issue with using longer dataframe and variable names is that they take up so much space in the code. You can end up spending more time trying to lay out your code so it is readable, and less time thinking and doing analyses. But R comes with two handy tools for shortening code:

  • with
  • within
I had been using R for a few years before I found out about these two handy commands, and now I use them all the time.

Let's work through these with an example. Because R comes with the package MASS present, I'll use data supplied with MASS. We're going to use "minn38" which is data relating to  Minnesota high school graduates of 1938. I'm using social science data as the example as this is the type of data I use most often.

First we need to get MASS into our workspace. Once we load the package in, the MASS datasets become available for us to use:
library("MASS")

and we check we can see the minn38 package:
str(minn38)

Now let's construct a new data frame off this one with longer, more informative names:
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="hs"]<-"High.School.Rank"

names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="phs"]<-"Post.High.School.Status"
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="fol"]<-"Father.Occup.Level"
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="f"]<-"Count"

if you run the str command on the new dataframe, you will see that it contains the updated variable names.

Now, just for fun, let's create a new variable that interacts Sex and High School Rank. We can use the ifelse command. Because the names are so long in this example, I've only coded for two levels and then cheated by setting every other level to "Other".
Minn.High.School.Grad.1938$Rank.by.Sex <- ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" & Minn.High.School.Grad.1938$sex=="F", "Low Rank Female",

                                                 ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" & Minn.High.School.Grad.1938$sex=="M", "Low Rank Male", "Other"))

That's some loooooooooooong code lines. So, the pro of using long data frame and variable names is that you can easily see what data frame and what variable you should use. The con is that it makes your code so much harder to lay out.

Let's shorten the code by using with. The R description of with is available here. We can remove the repeated call to the data frame name in our ifelse statements, because the with command is telling R that we are using that single data frame for all variable calls:
Minn.High.School.Grad.1938$Rank.by.Sex2 <- with(Minn.High.School.Grad.1938, {ifelse(High.School.Rank=="L" & 

  sex=="F", "Low Rank Female",
                                                 ifelse(High.School.Rank=="L" & sex=="M", "Low Rank Male",
                                                        "Other"))})

Alternatively, the within command can be used to the same end:

Minn.High.School.Grad.1938 <- within (Minn.High.School.Grad.1938, {
  Rank.by.Sex2 <- ifelse(High.School.Rank=="L" & sex=="F", "Low Rank Female",
                     ifelse(High.School.Rank=="L" & sex=="M", "Low Rank Male", "Other"))})

The within method is preferred if you have a lot of variables to recode simultaneously, as you're only specifying the data frame at the start. All the other variables can be inserted ahead of the closing curly brackets.

Key points to note:
  • there must be a comma following the data frame name
  • both commands uses curly brackets, so remember to change your bracket type
  • the with command can be used in situations other than data preparation, e.g. in the formula for a regression, see the R examples
  • the two commands are only functionally equivalent in this example because the with command is being used to construct a permanent variable, there are examples (e.g. in regressions) where with has a transient effect