Wednesday 28 December 2011

Nutrient intake data, finalising the data in R

I run plain R in the normal gui under Windows 7, which means no bells and whistles. This means that I find the R gui somewhat awkward to program in. Thanks to advice I received a number of years ago, I use Notepad ++ as my programming environment. It has line numbering, and when you use Language > R through the menu to set the programming language, you get colour coded syntax. It also has the nice feature of emphasizing the current bracket set that you are using, which makes it very easy to see whether you have remembered to close all your brackets - it counts backward from the last open bracket.

We're finally in R. :) The code below sets up the data sets for nutrient intake analysis, which will be the subject of my next posts. If you're following along in the SAS macro, the code is the R substitute for the data preparation in the starting macro called "example1_amount_mixtran_distrib.sas" from the Example 1 zip file which is downloadable from this webpage if you don't want to download the zip immediately.

SAS syntax files, which are identifiable by the .sas as a file extension, can be viewed with any text reader, and I use Notepad++ for that as well. If you open that SAS syntax file, the code below prepares the data for analysis in the mixtran macro, basically down to line 114.

You'll notice that I comment my code a lot, probably more than most. That is because I have had numerous experiences of coming back to code I wrote 6 months, or a couple of years earlier, and needing to revise it. I have found that what is obvious at the time of programming may not be so obvious as time passes and other programming projects have been completed.

You'll see the use of the reshape2 package. The data is basically a repeated measures design, as there are two 24-hour recall periods for nutrient intake per person. The data coming in from the .csv file constructed earlier has one row per person, with the nutrient intakes as two variables. For repeated measures, the data analysis later requires one row per intake (i.e. two rows per person). As this is the main data preparation stage, it makes sense to reform the data frame now.

While I cannot supply the data at this point, I will post the header() result before the melt so you can see the type of data in the data frame.

#This section of code duplicates the SAS code from example1_amount_mixtran_distrib.sas from line 1 through line 114
#Read in the Australian energy data
Imported.Data <- read.csv("foo.csv",header=T)
#check that headers have imported fine
names(Imported.Data)
length(Imported.Data)
nrow(Imported.Data)
#sort the data frame by subject
Imported.Data <- Imported.Data[order(Imported.Data$RespondentID),]#check sort worked, look at first few observations
head(Imported.Data)
#melt data frame so that each repeated measure (intake) is one row, and
#create factor to indicate whether it's a day1 or day2 intake.
#remember that reshape2 package must be installed at this point
library(reshape2)
Long.Data <- melt(Imported.Data, id=1:6, variable="IntakeDay",
measured=c("Day1Intake", "Day2Intake"))
names(Long.Data)[names(Long.Data)=="value"]<-"IntakeAmt"
#construct age group factors, lowest age group number = youngest age group
#age groups for analysis are set here (latest edition): http://www.nhmrc.gov.au/guidelines/publications/n35-n36-n37
#ASSUMPTION: no children <1 year old in data
#construct one variable that contains all the age factors
#evaluate from lowest to highest age, evaluation stops when condition is first met
#evaluate from lowest to highest age, evaluation stops when condition is first met
Long.Data$AgeF <-ifelse(Long.Data$Age<=3,1, ifelse(Long.Data$Age<=8,2, ifelse(Long.Data$Age<=13,3,
    ifelse(Long.Data$Age<=18,4, ifelse(Long.Data$Age<=30,5, ifelse(Long.Data$Age<=50,6,
    ifelse(Long.Data$Age<=70,7, ifelse(Long.Data$Age>70,8,""))))))))
Long.Data$AgeFactor <- as.factor(Long.Data$AgeF)
levels(Long.Data$AgeFactor) <- c("1to3","4to8","9to13","14to18","19to30","31to50","51to70","71Plus")
table(Long.Data$AgeF, Long.Data$AgeFactor)#Delete AgeF and any unused AgeFactor levels
Long.Data$AgeF <- NULL
Long.Data$AgeFactor <- Long.Data$AgeFactor[,drop=TRUE]
#Make RespondentID into a factor, it should not be treated as numeric
Long.Data$RespondentID <- as.factor(Long.Data$RespondentID)
#males and females are analysed separately, do not need to be specified as factors,
#construct different data frames for each - the code will duplicate the analysis for the second gender
#ASSUMPTION: males = 1 and females = 2
Male.Data <- subset(Long.Data, Gender==1)
Female.Data <- subset(Long.Data, Gender==2)

The result from head(Long.Data) is:
  NutrientID RespondentID Gender Age BodyWeight SampleWeight Day1Intake Day2Intake
1        267       100013      2  15       59.4    0.3335521   8591.535   8747.908
2        267       100020      1  12       51.6    0.4952835  12145.852  13495.798
3        267       100050      2  15       62.1    0.3335521  14202.496  13724.582
4        267       100100      2   4       18.5    0.3563699   8621.690   6218.391
5        267       100128      2   2       13.2    0.1666111   5140.690   6427.673
6        267       100370      2   7       24.9    0.3563699   7418.029  13620.542