Thursday, 29 December 2011

R: Nutrient intake data, mixed methods analysis for males

To recap:
  • we've cleaned the data (see earlier posts)
  • we're up to constructing the output data that will be used to estimate the population distribution values for the particular nutrient we are examining (energy)
The code below is my implementation of the mixtran sas macro for amount, where the nutrient is consumed daily. This means that a bunch of the macro code is irrelevant for the current analysis, as we don't need to estimate daily consumption probabilities, or apply them in the analysis, because probability of consumption = 1.

A boxcox analysis is performed to identify the best transformation to normality prior to undertaking the mixed methods analysis. The data contains age as single year, so I am interested in the effect of analysing age as both
  1. age bands used for nutrient reference values - there are standard age bands used for the reference values, so it makes sense to create age factors using these bands and then analyse age as a factor based on these bands, and
  2. a continuous variable, although because the method below is linear it assumes a linear effect of age on nutrient consumption and this is probably an unrealistic assumption
The R code is below. Code lines mentioned relate to the line in the mixtran macro linked above. Note that this method is using a linear mixed effects model on the data, the SAS macro uses a nonlinear mixed effects model, but I am currently unsuccessful in duplicating that exact model into R. The data that I have is much less complex from a covariates perspective compared to the data that was analysed in SAS, so at this point I am not too worried about the change in this part of the method.
#The data equivalency with the SAS macro data:
#SAS macro variable        SAS                R
#subject                seqn            RespondentID
#response                add_sug            IntakeAmt
#repeat                    drddaycd        IntakeDay
#covars_amt                {Age groups}    AgeFactor        Note: no dummies as AgeFactor is an ordered factor
#replicate_var            rndw1            SampleWeight    Note: used as a frequency variable ("replicate" in SAS), used as weight in R
#seq                    seq2            IntakeDay        Note: no dummy as IntakeDay is an ordered factor
#                        indwt            SampleWeight

#Delete any unused RespondentID levels that are related to the female subjects
Male.Data$RespondentID <- Male.Data$RespondentID[,drop=TRUE]
#row 419 reduce the data to one record per person to calculate weights
Male.Subset <- subset(Male.Data,!duplicated(RespondentID), select=c(RespondentID, SampleWeight, IntakeAmt, AgeFactor))
#row 446 add the total weights  to the data frame
Male.Subset$TotWts <- sum(Male.Subset$SampleWeight)
#row 455 create the subgroup weights, these are all by age only
Male.Grpwt <- aggregate(Male.Subset$SampleWeight,by= data.frame(Male.Subset$AgeFactor),sum)
names(Male.Grpwt)[names(Male.Grpwt)=="Male.Subset.AgeFactor"] <- "AgeFactor"
names(Male.Grpwt)[names(Male.Grpwt)=="x"] <- "GrpWts"
Male.Subset <- merge(Male.Subset,Male.Grpwt, by=c("AgeFactor"))
#row 472 get the number of persons in total and by subgroup if given
#the SAS macro produces a dataset of 1 row per agegroup, with agegrp, num subjects in age group, total num subjects output
count <- function(x) {
Numsub <- aggregate(Male.Subset$RespondentID,by=data.frame(Male.Subset$AgeFactor),count)
names(Numsub)[names(Numsub)=="Male.Subset.AgeFactor"] <- "AgeFactor"
names(Numsub)[names(Numsub)=="x"] <- "NumSubjects"
Numsub$TotSubjects <- sum(Numsub$NumSubjects)
#row 492 merge numsub to _persons
Male.Subset <- merge(Male.Subset,Numsub, by=c("AgeFactor"))
Male.Subset <- Male.Subset[order(Male.Subset$RespondentID),]
#convert proc transreg, from row 803
#fitting a linear model with a BoxCox transformation, on the data that includes replicates
Male.Boxcox <-, {
    boxcox(IntakeAmt ~ AgeFactor + IntakeDay,
    lambda= seq(0.05, 1, 0.05), plotit=FALSE)
#locate the row containing the largest log likelihood
Lambda.Value <- Male.Boxcox[Male.Boxcox$y == max(Male.Boxcox$y),1 ]
#row 818 add boxcox values to Male.Data
Male.Data$BoxCoxXY <- (Male.Data$IntakeAmt^Lambda.Value-1)/Lambda.Value
#clean up the R workspace by deleting data frames no longer required
rm(Imported.Data, Long.Data, Male.Boxcox, Male.Grpwt, Numsub)
#row 840 through nlmixed starting at row 996, do all the data preparation in one hit
#working through nlmixed lines now, prework on variable construction
#not sure why subject is fit as a nonlinear effect, fit using a linear method
#use a model with an intercept for fixed effects
Male.lme1 <- lmer(BoxCoxXY ~ AgeFactor + IntakeDay + (1|RespondentID),
            data = Male.Data,
            weights = SampleWeight)
plot(fitted(Male.lme1), residuals(Male.lme1))
Male.Data$lmefits <- fitted(Male.lme1)
#Compare with using age as a continuous variable
Male.Boxcox.Age <-, {
    boxcox(IntakeAmt ~ Age + IntakeDay,
    lambda= seq(0.05, 1, 0.05), plotit=FALSE)
Lambda.Value.Age <- Male.Boxcox.Age[Male.Boxcox.Age$y == max(Male.Boxcox.Age$y),1 ]
Male.Data.Age$BoxCoxXY <- (Male.Data.Age$IntakeAmt^Lambda.Value.Age-1)/Lambda.Value.Age
#Age as continuous fixed effect
Male.lme2 <- lmer(BoxCoxXY ~ Age + IntakeDay + (1|RespondentID),
            data = Male.Data,
            weights = SampleWeight)
plot(fitted(Male.lme2), residuals(Male.lme2))
Male.Data$Agefits <- fitted(Male.lme2)

Some highlights for me:
  1. The same transformation was identified as the best for both analyses.
  2. An ANOVA comparison of the two models showed no significant improvement from using age as a continuous variable, so the age band method is retained for subsequent analyses:
Data: Male.Data
Male.lme2: BoxCoxXY ~ Age + IntakeDay + (1 | RespondentID)
Male.lme1: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
          Df    AIC     BIC  logLik Chisq Chi Df Pr(>Chisq)
Male.lme2  5 9894.4  9926.4 -4942.2                       
Male.lme1  7 9966.2 10011.1 -4976.1     0      2          1

 The residuals plot for the age band analysis is reasonably nicely behaved given there are >4000 data points, although there could be a slight upwards trend:
The output from the age band lmer analysis is:

Linear mixed model fit by REML
Formula: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
   Data: Male.Data
  AIC   BIC logLik deviance REMLdev
 9994 10039  -4990     9952    9980
Random effects:
 Groups       Name        Variance Std.Dev.
 RespondentID (Intercept) 0.19408  0.44055
 Residual                 0.37491  0.61230
Number of obs: 4498, groups: RespondentID, 2249

Fixed effects:
                    Estimate Std. Error t value
(Intercept)         13.98016    0.03405   410.6
AgeFactor4to8        0.50572    0.04084    12.4
AgeFactor9to13       0.94329    0.04159    22.7
AgeFactor14to18      1.30654    0.04312    30.3
IntakeDayDay2Intake -0.13871    0.01809    -7.7

Correlation of Fixed Effects:
            (Intr) AgFc48 AgF913 AF1418
AgeFactr4t8 -0.775                    
AgeFctr9t13 -0.761  0.634             
AgFctr14t18 -0.734  0.612  0.601      
IntkDyDy2In -0.266  0.000  0.000  0.000