The main issue with using longer dataframe and variable names is that they take up so much space in the code. You can end up spending more time trying to lay out your code so it is readable, and less time thinking and doing analyses. But R comes with two handy tools for shortening code:
- with
- within
I had been using R for a few years before I found out about these two handy commands, and now I use them all the time.
Let's work through these with an example. Because R comes with the package MASS present, I'll use data supplied with MASS. We're going to use "minn38" which is data relating to Minnesota high school graduates of 1938. I'm using social science data as the example as this is the type of data I use most often.
First we need to get MASS into our workspace. Once we load the package in, the MASS datasets become available for us to use:
library("MASS")
and we check we can see the minn38 package:
str(minn38)
Now let's construct a new data frame off this one with longer, more informative names:
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="hs"]<-"High.School.Rank"
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="phs"]<-"Post.High.School.Status"
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="fol"]<-"Father.Occup.Level"
names(Minn.High.School.Grad.1938)[names(Minn.High.School.Grad.1938)=="f"]<-"Count"
if you run the str command on the new dataframe, you will see that it contains the updated variable names.
Now, just for fun, let's create a new variable that interacts Sex and High School Rank. We can use the ifelse command. Because the names are so long in this example, I've only coded for two levels and then cheated by setting every other level to "Other".
Minn.High.School.Grad.1938$Rank.by.Sex <- ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" & Minn.High.School.Grad.1938$sex=="F", "Low Rank Female",
ifelse(Minn.High.School.Grad.1938$High.School.Rank=="L" & Minn.High.School.Grad.1938$sex=="M", "Low Rank Male", "Other"))
That's some loooooooooooong code lines. So, the pro of using long data frame and variable names is that you can easily see what data frame and what variable you should use. The con is that it makes your code so much harder to lay out.
Let's shorten the code by using with. The R description of with is available here. We can remove the repeated call to the data frame name in our ifelse statements, because the with command is telling R that we are using that single data frame for all variable calls:
Minn.High.School.Grad.1938$Rank.by.Sex2 <- with(Minn.High.School.Grad.1938, {ifelse(High.School.Rank=="L" &
sex=="F", "Low Rank Female",
ifelse(High.School.Rank=="L" & sex=="M", "Low Rank Male",
"Other"))})
Alternatively, the within command can be used to the same end:
Minn.High.School.Grad.1938 <- within (Minn.High.School.Grad.1938, {
Rank.by.Sex2 <- ifelse(High.School.Rank=="L" & sex=="F", "Low Rank Female",
ifelse(High.School.Rank=="L" & sex=="M", "Low Rank Male", "Other"))})
The within method is preferred if you have a lot of variables to recode simultaneously, as you're only specifying the data frame at the start. All the other variables can be inserted ahead of the closing curly brackets.
Key points to note:
- there must be a comma following the data frame name
- both commands uses curly brackets, so remember to change your bracket type
- the with command can be used in situations other than data preparation, e.g. in the formula for a regression, see the R examples
- the two commands are only functionally equivalent in this example because the with command is being used to construct a permanent variable, there are examples (e.g. in regressions) where with has a transient effect