Wednesday, January 15, 2014

Tips & Tricks 1: Averaging Data By Group

I thought it would be helpful to new users of Geomorph and R to discuss some tips and tricks for manipulating morphometric datasets in R. So I am starting a series called "Tips & Tricks". These will be published as I think of them, and assume you have already inputted your data into R using geomorph functions, such as readland.nts, readmulti.nts, readland.tps, read.morphologika, or simple read.csv. Remember that the read... functions in Geomorph turn the morphometric data into a 3D array (like a stack of filing cards; each card representing an individual and having on it a p landmarks by k dimensions matrix of coordinates). 

Exercise 1 - How to average a dataset of morphometric data (shape data) by group.

Lets start with the input objects. We have a 3D array of 30 specimens shape data (p by k coordinate data) called data, and a grouping variable group, which is of class factor, with m levels. We want to average the morphometric data by these m levels.

[Here you should familiarise yourself with the factor functions, such as  as.factor() and levels(), see here for more details].

The first thing to do is to change the 3D array data into a 2D array (n by p x k) with geomorph function two.d.array()

  x <- two.d.array(data)

I will explain a long-winded way of doing things, which is intuitive but cumbersome. Then I will show a simple one line version that utilises two useful base functions in R, which make life much simpler!

1) Set up the output matrix, which will be filled by a for loop:

  p <- dim(data)[1] # the number of landmarks
  k <- dim(data)[2] # the dimensions of the coordinates
  Y <- array(NA, dim=c(p,k,length(levels(group))) #new empty array to fill
  dimnames(Y)[[3]] <- levels(group)# set group levels as new names

Now write a for loop that subsets the 2D array by group and calculates the mean shape using geomorph function mshape():

for (i in 1: length(levels(group))){
grp <- x[which(group==levels(group)[i]),]# makes a subset 2D array with only specimens of the ith group by using which().
foo <- arrayspecs(grp ,p,k, byLand=F) # turn back into an 3D array
    Y[,,i] <- mshape(foo) # place into the new 3D array
Since mshape() requires a 3D array, the switching between 2D and #D is necessary, but cumbersome. While this procedure is very long-winded, it does allow you to get used to manipulating arrays, and learn how to do procedures by levels of a grouping factor.

But if you look under the hood, you see that mshape is simply taking the means of the rows and columns of the matrix. This leads us to:

2) This for loop can in fact be replaced with one line:
means <- rowsum(x, group)/as.vector(table(group))# rowsum() is a simple base function to get the sum of the rows, while table() is a great function for getting group sizes (group n).
Y <- arrayspecs(means ,dim(data)[1],dim(data)[2], byLand=F)# then all you have to do is put the averaged data back into an 3D array.

Both ways produce the same result. As is often the case with R, 'there are many ways to skin a cat'! The first example works with functions you likely already know, and allows you to become more proficient with writing your own for loops and searching or subsetting with which().  The second example shows you that looking deeper into R base functions will uncover little gem functions, which make our life much easier!

Enjoy averaging by group!



  1. Thanks Emma!

    There is another way to find means (or other statistics) by groups

    means <- aggregate(x ~ group, FUN=mean)

    It will append the group levels as the first column, but that can be fixed by

    means <- (aggregate(x ~ group, FUN=mean))[,-1]

    What's that you say, why would this method be better to use?

    1) You can use any function. E.g., FUN = mean, sd, var, etc. Even make a special one as a separate function, if you like.

    2) It can aggregate over multiple factors. For example,

    means <- aggregate(x ~ A + B, FUN=mean)

    will produce group means for all levels of B within all levels of A. This is the same as doing this

    group <- factor(paste(A, B))
    means <- aggregate(x ~ group, FUN=mean)

    So, it is a bit simpler (appended columns for factor levels, notwithstanding)

  2. Hello there!

    I am a newcomer in both R and geomorph, and here I am struggling with a dataset of my own, in order to establish comparisons on the shape of five different species of fish. I have used ImageJ for landmarks and MorphoJ for Procrustes fit and outliers visualization... Resulting in a 63 rows x 49 columns data matrix, where each pair of column consist in the general XnYn coordinates combination, being the last column the fish IDs. As all examples I have read on geomorph are in a completely different array of data, I am now wondering if I can use the package for establishing the comparisons I need... Please, give me your thoughts!

    Regards, Bowie

    1. Hi Bowie, thanks for your message. I've actually written a blog post showing how to read in those type of data matrices. Please see:

  3. Hi,

    This may be a dumb question, but do you use Procrustes coordinates or raw landmark coordinates for this? If I understand, using raw landmarks could seriously skew the mean because of rotation and such?

    1. The Procrustes coordinates should only be used in this instance, yes.

    2. Thanks for the info. Would you then want to do a second gpa alignment on the mean data? Also can this be modified for centroid size?

    3. One GPA is all that's needed, on the raw specimen coordinate data. The group means after GPA are ready for further analysis, like a PCA using plotTangentSpace, or in a phylogenetic analysis.

      And yes, for centroid size, CS, you do: CS.means <- (aggregate(CS ~ group, mean))[,-1]