Wednesday, 28 December 2011

Nutrients consumed daily, R analysis story board

With any relatively complicated programming task, I prefer to first create a story board that provides the steps I will take to complete the overall task. This may end up being a heavily edited post if I find that my plans change part way through.

The point of this analysis is to produce a cleaned distribution of nutrient intakes for the Australian population, using two 24-hour intake recall periods. The general method being followed is from here where the method has been implemented for SAS. I was a SAS programmer for around 12 years, but never had much to do with macros as all my programming tasks were one-offs. This has meant that understanding the SAS code has taken me a reasonable period of time.

The approach will use the data sets constructed in the previous blog post, which I cleaned in Excel using VBA.

The main steps are:
  1. Identify a test data set (in this case, energy (KJ) intake),
  2. Reprogram the SAS code into R, hard coding in the variable names from the cleaned data set as these will be standard for the data analysis of the various nutrients,
  3. Compare the output distribution to that obtained by the previous method of analysis, then if all goes well
  4. Output the cleaned distribution to a .csv file for subsequent use by the client, and then finally
  5. Automate the process for the client, e.g. include GUI features for the client to select the input .csv data set, so the user does not need to change any of the R code.
 The method will use a number of R packages as well as the standard installation. For example,
  • the MASS package is used to Box Cox transform the data to normality, 
  • the reshape2 package is required to melt the data so that repeated measures are separate observations, not separate variables - this will double the length of the data set as all observations have two 24-hour recall periods, and
  • mixed effects analysis is being undertaken using lme4.
I have been working with two of the SAS macros for the past couple of months, and the R code will be dramatically shorter compared to the SAS code as there will be no interim data sets output. Because this process is only addressing nutrients consumed daily (rather than those consumed episodically, e.g. alcohol, or for foods rather than nutrients), the SAS code is simplified into R by not having to generate probability distributions for intake. Once this R analysis has been implemented, I will rework it for the episodic case. To ensure that the correct R program is used, I will incorporate a "missing value check" to ensure that the correct program is used. For the "consumed daily" nutrients, the data set contains only observations with consumption so by definition there is no missing data - but it is a good idea to check all assumptions.

Along the way, I will be doing some additional testing. The NCI method linked above uses a number of covariates, such as age group. The client has been analysing the data (not using the NCI method) separately by age group. I will be testing the effect on the overall intake distribution by:
  1. analysing separately for age group (3 age groups), versus
  2. using age group as fixed effect covariates, versus
  3. using age-in-years as a continuous fixed effect variable instead of age group covariates (there are >4000 observations so there are multiple observations per age-in-years)