Sunday, July 27, 2014

If you're handing out advice, you'd better not be talking out of your ass

This is a story of ignorance and liberal phrasing. Sounds complicated, but bare with me, it'll be a fun ride.
So, this lovely primer found its way into the world:
http://www.cell.com/cell/abstract/S0092-8674(14)00864-2

It teaches all us boys and girls how to setup meta-genomic studies and analyze the data that may come out of them. Actually, it only focuses on 16s and does a rather bad job discussion anything relevant, but hey, what do you expect? You want a power analysis or to know how many samples you need, ask your statistician! These people clearly aren't the ones to answer any of those questions anyway.
None of this would be a problem, had they not actually gotten some things completely ass backwards. Thus, this unpleasant rant.
First, there's this:
"Lauber and colleagues recently showed that storage for 2 weeks at temperatures ranging from −80°C to 20°C did not significantly affect patterns of between-sample diversity or the abundance of major taxa (Lauber et al. 2010)". So basically, they're saying you can keep the samples in your pocket for a week and it will all be fine. Really? I work on stool samples and in my experience they are rather sensitive to thawing. But ok, I have been wrong before (once!). What does the original publication say? Well, ignoring the appalling setup and analysis, here's a nice quote: "One sub-sample was omitted from the data set (Fecal 1 Day 14, 20°C replicate 2) due to visible fungal growth prior to DNA extraction". Are you seriously expecting me to believe that this won't change the composition of your sample? You must be out of your mind. I extend a challenge to Dr. Lauber: that he eat his lunch after having left it at 20°C for 2 weeks; what could go wrong? Sure, you won't see the difference if you compare to just one other sample (yes, they have replicates of TWO samples) as the differences between them are huge. But to take this and conclude that the shifts in community composition are "minor" is idiotic. You might even say it's the result of eating two week old lunch...
I've tried getting my hands on their data but the SRA number doesn't actually exist. I'll get back to this once I have it in my hands.

And then there's this:
"The number of sequences obtained in a sequencing run can vary across samples for technical rather than biological reasons, and these sequencing depth artifacts can affect diversity estimates. One approach to account for variable sequencing depth is to use frequencies of OTUs (operational taxonomic units, described below) within samples (i.e., to normalize by total sample sequence count). We recommend against this approach, as we have found that it is subject to statistical pitfalls and can lead to samples clustering by sequencing depth (Friedman and Alm, 2012; C. Lozupone, J.G.C., and R.K., unpublished data)."
It goes on to propose rarefaction and then to cite a paper that explains quite plainly why rarefaction is generally a stupid thing to do (McMurdie and Holmes, 2013). But that's not my problem at all. You're free to do rarefaction and throw 50% of your data out the window. What do i care? It's not my money. Just don't tell your funding body.
No, my problem is with the assertion that abundances "[are] subject to statistical pitfalls and can lead to samples clustering by sequencing depth", which is simply not true. They cite a lovely paper by Friedman and Alm, which I doubt they have read. Because if they would have, they would know just how silly their point is.
I do recommend reading the Friedman paper, but i'll put their point plainly here: You cannot use correlation analysis on compositional data, because it's going to be crap. This is because each measurement is by definition dependent on all others and this will break the correlation. Pearson (yeah, that one) had figured this out in 1897. Then they clearly show a way of getting around this, by employing a log-ratio transformation from the good old Aitchison. So, there's no problem with abundance values, you should just not use them wrongly. As to the "clustering by sequencing depth" i'll repudiate that off the bat since they can't be bothered to show any data for it.
But it gets better. The compositional problem doesn't arise from a total sum scaling (the one that gives you abundances that sum to 100%), as the authors of the primer would have you believe (and i'm sure they believe it themselves). It comes from the way the measurement is done. Let me put it this way: any value that you measure is only valid in the context of the number of measurements you've done. So, if you have 80 measurements that reflect the presence of Bacteria A and you don't tell me how many times you've measured, then 80 doesn't mean anything! It only becomes a coherent measure when you're saying 80 out of 256 are Bacteria A. And here, the compositional issue is already present. Because if in your "community" Bacteria A grows and is peppy and everything else stays the same, then in a next measurement it might be 170 out of 256. And thus, because of the way you measure, the values of Bacteria A and "the rest" will have a prefect negative correlation. They will simply have to.
One last thing: Friedman and Alm also nicely make the point that the compositional effect is going to be stronger the less diverse your community is. Rarefaction does exactly that! It minimizes the diversity in your sample, thus exacerbating the compositional effect. And this is how you get things ass backwards!