Tuesday 27 December 2011

Data preparation for analysis, use VBA

Data on daily nutrient intake information is stored within a database, and the users are extracting the relevant data on a nutrient-by-nutrient basis through an export procedure. The exported data is being stored in Excel 2010 data sets, one for each nutrient. There is a set of variables being saved in each export, as well as an optional variable that does not need passing to R, but - when it is present - is situated between subsets of relevant variables. The names of the variables are not exactly the same between the Excel data sets, nor is the starting row for the variable names, and a couple of variable names are saved across two rows instead of one. This creates a bit of a headache for creating a .csv file to import into R.

For one file, the starting rows look like:

Energy, including dietary fibre. 2007 ANCNPAS. Including supps.












nutrient_code Respondent ID Gender Age Body weight samplewt Intake- day1 Intake- day2 Units
267 100013 2 15 59.4 0.333552084 8591.5354 8747.9084 KJ
267 100020 1 12 51.6 0.495283471 12145.8524 13495.798 KJ

For another file, the starting rows look like:

iodine









with supplements















day1 day 2

nut_code id sex age Body weight seifa samplewt Intake Intake Units Upper safe level (UL)
315 100013 2 15 59.4 0.33355208 103.7881 128.07576 ug 900
315 100020 1 12 51.6 0.49528347 140.46202 218.31528 ug 600
315 100050 2 15 62.1 0.33355208 118.21546 184.33722 ug 900

Because there are multiple data sets that will need conversion to R, the best method is to construct some VBA code in an Excel workbook to handle the data cleaning and preparation. The saving grace is that the relative order of the variables appears to be the same across the different data sets.

The code below cleans both files using a button click event in Excel 2010. The cleaned data is saved as a .csv file, using the same file name as the input .xlsx file and into the same directory. The Respondent IDs are integer, but must be dimmed as Long because they exceed the maximum value that can be stored in an Integer type. The variables are defined as arrays, and then redimmed to the correct length just before they have values allocated, so the method will work with data sets of varying numbers of observations. For those unfamiliar with VBA, the ' is used to comment code.

The Open file has been restricted to only Excel 2010 workbooks - this should save the user from scrolling through irrelevant non-Excel files. Because we have constructed the .csv filename to have the correct file extension, we just need to put in the file name at the end but we do need to amend the type.

Sub Button1_Click()
    Dim FilePath As String
    Dim FileName As String
    Dim CSVFile As String
    Dim WB As Workbook
    Dim StartRow As Range
    Dim Count As Integer                    'Used to store the size of the array for each variable, can only be a whole number
    Dim i As Integer                        'Count in the values to each array
    Dim NutrientCode() As Integer           'Numeric code for nutrient, same value for all rows in the same sheet
    Dim Respondent() As Long                'Respondent IDs are larger than 32K so need to make these long rather than integer
    Dim Gender() As Integer                 'Sex, codes are 1 or 2
    Dim Age() As Integer
    Dim BodyWeight() As Double
    Dim SampleWeight() As Double
    Dim Day1Intake() As Double
    Dim Day2Intake() As Double
   
    On Error GoTo Err_Clr

    'From http://msdn.microsoft.com/en-us/library/ff834966.aspx
    'Get workbook name
    FilePath = Application _
        .GetOpenFilename("Excel Files (*.xlsx), *.xlsx")
        If FilePath <> False Then
        'MsgBox "Open " & FilePath
        End If
       
    'Save filename and path for output to csv
    FileName = Dir(FilePath)
    'MsgBox "The filename is " & FileName
    CSVFile = Left(FileName, Len(FileName) - 4) & "csv"
    'MsgBox "The filename is " & CSVFile
               
    'Open workbook
    Set WB = Workbooks.Open(FilePath)
    WB.Activate
   
    'Locate variable name row
    Range("A1").Select
    Do Until (ActiveCell.Value = "nutrient_code") Or (ActiveCell.Value = "nut_code")
        ActiveCell.Offset(1, 0).Select
        Loop
    Set StartRow = ActiveCell
    'MsgBox "Startrow is with cell " & StartRow
    'Get Count value
    Count = 0
    Do While ActiveCell.Value <> ""
        ActiveCell.Offset(1, 0).Select
        Count = Count + 1
        Loop
    'Count is 1 too high due to looping to empty row, reduce by 1
    Count = Count - 1
    'MsgBox "The number of observations is " & Count
    'Go back to variable name row
    StartRow.Select
    'Nutrient intakes
    ReDim NutrientCode(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        NutrientCode(i) = ActiveCell.Value
        Next
    'Respondent IDs
    StartRow.Select
    ActiveCell.Offset(0, 1).Select
    ReDim Respondent(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        Respondent(i) = ActiveCell.Value
        Next
    'Gender
    StartRow.Select
    ActiveCell.Offset(0, 2).Select
    ReDim Gender(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        Gender(i) = ActiveCell.Value
        Next
    'Age
    StartRow.Select
    ActiveCell.Offset(0, 3).Select
    ReDim Age(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        Age(i) = ActiveCell.Value
        Next
    'Body Weight
    StartRow.Select
    ActiveCell.Offset(0, 4).Select
    ReDim BodyWeight(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        BodyWeight(i) = ActiveCell.Value
        Next
    'need to check for seifa column at this point
    'Sample Weight
    StartRow.Select
    ActiveCell.Offset(0, 5).Select
    If ActiveCell.Value = "seifa" Then ActiveCell.Offset(0, 1).Select
    ReDim SampleWeight(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        SampleWeight(i) = ActiveCell.Value
        Next
    'Day 1 intake
    StartRow.Select
    ActiveCell.Offset(0, 5).Select
    If ActiveCell.Value = "seifa" Then
        ActiveCell.Offset(0, 2).Select
        Else
            ActiveCell.Offset(0, 1).Select
        End If
    ReDim Day1Intake(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        Day1Intake(i) = ActiveCell.Value
        Next
    'Day 2 intake
    StartRow.Select
    ActiveCell.Offset(0, 5).Select
    If ActiveCell.Value = "seifa" Then
        ActiveCell.Offset(0, 3).Select
        Else
            ActiveCell.Offset(0, 2).Select
        End If
    ReDim Day2Intake(Count)
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        Day2Intake(i) = ActiveCell.Value
        Next
       
    'Save all the array values to a new workbook
    Workbooks.Add
    'Output Nutrient ID
    ActiveCell.Value = "NutrientID"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = NutrientCode(i)
        Next
    'Output Respondent
    Range("B1").Select
    ActiveCell.Value = "RespondentID"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = Respondent(i)
        Next
    'Output Gender
    Range("C1").Select
    ActiveCell.Value = "Gender"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = Gender(i)
        Next
    'Output Age
    Range("D1").Select
    ActiveCell.Value = "Age"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = Age(i)
        Next
   'Output Body Weight
    Range("E1").Select
    ActiveCell.Value = "BodyWeight"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = BodyWeight(i)
        Next
    'Output SampleWeight
    Range("F1").Select
    ActiveCell.Value = "SampleWeight"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = SampleWeight(i)
        Next
    'Output Day 1 Intake
    Range("G1").Select
    ActiveCell.Value = "Day1Intake"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = Day1Intake(i)
        Next
    'Output Day 2 Intake
    Range("H1").Select
    ActiveCell.Value = "Day2Intake"
    For i = 1 To Count
        ActiveCell.Offset(1, 0).Select
        ActiveCell.Value = Day2Intake(i)
        Next
       
    'Save the worksheet as a csv file for R
    ActiveWorkbook.SaveAs FileName:=CSVFile, FileFormat:=xlCSV


Err_Clr:
    If Err <> 0 Then
        Err.Clear
        Resume Next
        End If
   
   
End Sub