Wednesday, February 15, 2017

Open data my a**

Reading data sharing policies in most major scientific journals, one would think “open data” is almost a given considering all of the standards and sharing platforms that are in place. Moreover, in the “reproducibility” crisis that the web and journal editorials are abuzz with, one would furthermore be inclined to think that there is a constant and fervent enforcing of this kind of sharing: if you publish something, you must share the data on which your conclusions are based. End of story.

For the most part, in the field of metagenomics at least, one would be wrong. Over the past couple of years, whenever I have tried to download a dataset (be it full shotgun sequencing or just 16S data), I have almost always run into unnecessary and uncomfortable obstruction. One prominent exception is the human microbiome project (HMP). Hats off to all involved for providing a great example of open data sharing for an entire community. All the shotgun and 16S data generated are a couple of clicks away. The subject of metadata...well, that is a horse of a different color.

Now to the “bad”. I am not going to start listing all the culprits that have shown how it should not be done, rather I prefer to present the latest ordeal of a kafkaesque nature when trying to request published data. The paper in question is as follows: “Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity”. Sorry for those who cannot access this paper, you will have to take my word for it that it is an interesting paper that has been built on a fascinating dataset (1135 gut metagenomes, shotgun sequencing as well as 16S data). I was especially interested in the raw reads, so when I scrolled to the end of the publication to find that “The raw sequence data for both MGS and 16S rRNA gene sequencing data sets, and age and gender information per sample are available from the European genome-phenome archive at accession number EGAS00001001704” I was excited to see an accession number (sometimes you don’t even have this). My excitement was short lived when I followed the link and realized the data was not available because the data is not actually “public”. It is under some sort of “embargo”, with the rights to it controlled by “Lifelines-DEEP” according to the EBI’s website.

This was immediately starting to reek of something foul. I had not even thought about trying to access the metadata and I am already required to jump through hoops just to get the raw reads. But fine, a couple of emails asking for data access, getting some sort of username and password and then getting the data is not the end of the world (but just to be clear this is not the way these things should work).

A couple of polite emails later (both to the authors and the journal editors), I am asked to fill in the “data request” form. Take a couple of minutes to scroll through that carefully… What does that read like to you? Cause to me, this is not a trivial data request form, especially not for data that I should be able to download without having to let the authors even know about it. Some questions come to mind: what binds the recipients of this proposal to confidentiality? (am I just letting the “competition” know exactly what I am up to?) and more importantly what if I get the data and then do something else with it? What is this proposal, a binding contract? Who will enforce its terms?

So, I decided to dig a bit deeper into Lifelines-DEEP and found their data access policy. This contains a lot of standard stuff, but also some obvious “gems”. On a positive note they do assure confidentiality. However they also state that “After approval of the proposal, the applicant receives a financial agreement and a Data/Material Transfer Agreement (DMTA). The DMTA specifies the conditions for the use of the LifeLines data and/or biomaterials, Intellectual Property (IP) and warranties. The access fee for data and biomaterials supports the handling and service costs of LifeLines and includes a small contribution to the ongoing data and sample maintenance. After signing the offer and DMTA the researcher is granted access (but not any ownership rights) to use the data and/or samples to conduct the approved research project for a particular period of time.“. Given that the EBI already provides data maintenance at no costs for the whole scientific community, it seems that the only “added value” that LifeLines is providing is this extra level of control. Effectively, the fee serves to finance the bureaucracy that charges the fee.

At this point, it was obvious to me that getting access to the raw data will not be trivial and once I had that data I could only make restricted use of it. How restricted? Well, according to the same document: “Prior to submission, researchers are asked to send their abstracts and manuscripts to the Lifelines Research Office for a general check on correct reference to LifeLines, to check whether the content of the manuscript fits the initial approved research proposal, and to identify possible privacy risks.”

In the meantime, I had been in contact with the editor and had received a final decision that they “established that the institutional conditions for access to these data are not unreasonable”. Additionally, a correction had been added to the manuscript: “A statement about the informed consent regulations for data access to the Lifelines population cohort was inadvertently omitted from the acknowledgments.”

Since I only wanted to “play around” with this data and try some things (including checking some specific aspects of the original analysis which I found troubling), I do not actually qualify for its access. And here we are, months later, with no data and the very real prospect of writing a “request of reproducing results” proposal that will most likely be rejected. Maybe I should just let it go and forget about this dataset. But in the past month, I found another one, again with data produced by some “third party” which is not willing to share. This is exactly why we should fight this, as setting a very troubling precedent. I am sympathetic and equally understand the privacy issues that are often not initially obvious, however this has nothing to do with such issues. There are thousands of human microbiome samples readily available and there should be no exceptions to making new samples public. Subject metadata gets a bit more complicated and maybe I will write about that at some later stage.

Wednesday, August 20, 2014

Stop being so god damn nice all the time

Sure, ad hominem is a terrible fallacy. Sure, calling people names is not nice. In today's world there seems to be a consensus that we need to be nicer to each other and open to new ideas. I hold that these two things are incompatible. I will argue in the following that scientific discourse needs to get less nice, more bellicose, funny and vitriolic and that will result in a more friendly environment as well as much better output.

Let's start off easy, by focusing on our friends. I have a couple of very good friends, for which I would literally do anything (I really mean anything. Would even make out with a woman for them). It is with them that I have had all the conversations that have ever changed my mind. It is to them that I go with an open mind to discuss issues that I am genuinely undecided on. It is them that I call names all the time. It is also them that continuously take the piss out of me. And it bloody works! Here I'll assert that even if you have the nicest of friends (maybe boring Mormons or something) you're still more likely to be loose-tongued around them than in any other circumstance. And that's because we're much more relaxed around our friends and we can really be ourselves. Here's my point: if you take your boring politically correct hat off and really become yourself again, you're a cursing, judgmental, well poisoning, ad-hominem throwing little twerp. If you don't buy that from my brilliant friends analysis, then just fucking go online and see what people sounds like behind the veil of anonymity that the internet provides.
The same holds for scientist (though they're not really actual people). We read a paper (not even one that disagrees directly with out work) and a lot of the time go: "how did this piece of shit ever get published?". You know what happens then? Nothing! We ridicule it with a colleague, point out its terrible flaws and then shred it (I don't really have a shredder but now i know what i want for Christmas). This is unacceptable. What should happen is we should make fun of whomever published it and make the record clear that that one paper is crap. But we're going to be nice about it. And even if we go as far as publicly disapproving of it, we'll do it in such a high brow, nice way, that it won't really be clear to everyone that the paper should be considered a joke. So what. you'll ask. They live to research another day and the paper will probably we discarded in the long run anyway. Well, true, but with two attached costs:
1. Some people will base novel research on that publication. They, pardon my french, will get a surprise fisting. So will their funding body.
2. The above isn't that big of a problem. Scientist spend loads of time and money chasing crap. It's all business as usual. The big problem is that the bar gets lowered. If shit can so easily make it through the review process, then why bother doing things proper? Even worst than that, the reasoning usually also considers the level at which your competition may be satisfied publishing at and scooping you. And because you already have a rather low opinion of your competition, that level will be quite low. And then, when it's out, mum's the word.

Trying to be nice in scientific discourse is stupid and counterproductive! The reason we do it is that we think it leads to a more focused discussion. Oh, if only we phrase things in the language of constructive criticism. What a load of bollocks. What the hell is that? Have you ever received criticism that didn't make you cringe and want to punch someone? No, you haven't! (Unless it was from you friends, and they usually called you a moron in the process of outlining their position). Off the bat, there is so much investment in anything you say as a scientist that criticism cannot but seem a siege on everything you hold dear. This problem has not and will not be solved by being nice. The only thing being nice does is stifle genuine conversation and debate, because we fear we'll hurt someones feelings.
Thus, I propose the solution to be a forum where the two opposing sides are free to throw shit at each other. Basically, British parliamentary debate. One other rule though: they have to grab a drink together afterwards.
So come on idiots, let's do this shit together! Call your collaborator a moron, for science.

Sunday, July 27, 2014

If you're handing out advice, you'd better not be talking out of your ass

This is a story of ignorance and liberal phrasing. Sounds complicated, but bare with me, it'll be a fun ride.
So, this lovely primer found its way into the world:

It teaches all us boys and girls how to setup meta-genomic studies and analyze the data that may come out of them. Actually, it only focuses on 16s and does a rather bad job discussion anything relevant, but hey, what do you expect? You want a power analysis or to know how many samples you need, ask your statistician! These people clearly aren't the ones to answer any of those questions anyway.
None of this would be a problem, had they not actually gotten some things completely ass backwards. Thus, this unpleasant rant.
First, there's this:
"Lauber and colleagues recently showed that storage for 2 weeks at temperatures ranging from −80°C to 20°C did not significantly affect patterns of between-sample diversity or the abundance of major taxa (Lauber et al. 2010)". So basically, they're saying you can keep the samples in your pocket for a week and it will all be fine. Really? I work on stool samples and in my experience they are rather sensitive to thawing. But ok, I have been wrong before (once!). What does the original publication say? Well, ignoring the appalling setup and analysis, here's a nice quote: "One sub-sample was omitted from the data set (Fecal 1 Day 14, 20°C replicate 2) due to visible fungal growth prior to DNA extraction". Are you seriously expecting me to believe that this won't change the composition of your sample? You must be out of your mind. I extend a challenge to Dr. Lauber: that he eat his lunch after having left it at 20°C for 2 weeks; what could go wrong? Sure, you won't see the difference if you compare to just one other sample (yes, they have replicates of TWO samples) as the differences between them are huge. But to take this and conclude that the shifts in community composition are "minor" is idiotic. You might even say it's the result of eating two week old lunch...
I've tried getting my hands on their data but the SRA number doesn't actually exist. I'll get back to this once I have it in my hands.

And then there's this:
"The number of sequences obtained in a sequencing run can vary across samples for technical rather than biological reasons, and these sequencing depth artifacts can affect diversity estimates. One approach to account for variable sequencing depth is to use frequencies of OTUs (operational taxonomic units, described below) within samples (i.e., to normalize by total sample sequence count). We recommend against this approach, as we have found that it is subject to statistical pitfalls and can lead to samples clustering by sequencing depth (Friedman and Alm, 2012; C. Lozupone, J.G.C., and R.K., unpublished data)."
It goes on to propose rarefaction and then to cite a paper that explains quite plainly why rarefaction is generally a stupid thing to do (McMurdie and Holmes, 2013). But that's not my problem at all. You're free to do rarefaction and throw 50% of your data out the window. What do i care? It's not my money. Just don't tell your funding body.
No, my problem is with the assertion that abundances "[are] subject to statistical pitfalls and can lead to samples clustering by sequencing depth", which is simply not true. They cite a lovely paper by Friedman and Alm, which I doubt they have read. Because if they would have, they would know just how silly their point is.
I do recommend reading the Friedman paper, but i'll put their point plainly here: You cannot use correlation analysis on compositional data, because it's going to be crap. This is because each measurement is by definition dependent on all others and this will break the correlation. Pearson (yeah, that one) had figured this out in 1897. Then they clearly show a way of getting around this, by employing a log-ratio transformation from the good old Aitchison. So, there's no problem with abundance values, you should just not use them wrongly. As to the "clustering by sequencing depth" i'll repudiate that off the bat since they can't be bothered to show any data for it.
But it gets better. The compositional problem doesn't arise from a total sum scaling (the one that gives you abundances that sum to 100%), as the authors of the primer would have you believe (and i'm sure they believe it themselves). It comes from the way the measurement is done. Let me put it this way: any value that you measure is only valid in the context of the number of measurements you've done. So, if you have 80 measurements that reflect the presence of Bacteria A and you don't tell me how many times you've measured, then 80 doesn't mean anything! It only becomes a coherent measure when you're saying 80 out of 256 are Bacteria A. And here, the compositional issue is already present. Because if in your "community" Bacteria A grows and is peppy and everything else stays the same, then in a next measurement it might be 170 out of 256. And thus, because of the way you measure, the values of Bacteria A and "the rest" will have a prefect negative correlation. They will simply have to.
One last thing: Friedman and Alm also nicely make the point that the compositional effect is going to be stronger the less diverse your community is. Rarefaction does exactly that! It minimizes the diversity in your sample, thus exacerbating the compositional effect. And this is how you get things ass backwards!

Tuesday, November 12, 2013

Really Nature, really?

This is going to be a quick one cause i need to catch a bus in 10 minutes.
A nice paper came out in PNAS on how messed up the p-value cutoff of 5% is. Nature (the journal, not the other nature) decided to report on it here:
Somewhere in the middle of that you will find this little gem:
"he found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in fields such as social science, in which non-reproducibility has become a serious issue — corresponds to Bayes factors of between 3 and 5, which are considered weak evidence to support a finding."
Now, does that sounds weird to anyone?
Nature, that's a dick move! Pretending this is mainly a problem in "social science" is silly. The amount of money bio-medical research wastes because of building on shitty statistics is staggering (obviously, i don't know the number). And you (again, nature) as a publisher of some of that drivel should be highlighting that, not pretending it's not a problem.


Friday, November 8, 2013

Ethical, legal and social implications of "go to hell"

I was feeling quite unproductive the other day so i decided to waste some time in a "Science and Society" conference that happened to be taking place around. Saw a couple of cute talks and then "ethics management" happened.
Let me preface this by saying that i have a modicum of respect for research in the area of ethics and policy making (really, just a smidgen of respect). Let me also say that i do not give a flying toss for research on how to manage research in ethics. That's a level of meta too many for me.
I had also made a strategic mistake in sitting in the middle of a row which made it hard to get out mid talk. Thus, i had to sit there through the entire ordeal. The torturers name is Jane Kaye. I do not know her outside this one talk so i'm going to infer everything about her from it (sorry Jane). She is a professor at Oxford. Sounded promising. Oh, was i wrong!
There was not one piece of actual information in a 45 minute talk. Nothing! She went on and on about accelerating ethics research and making it more adaptable and dynamic. Did she ever tell us how that would happen? Of course she did: ethics research should be "grounded in a commitment to the shared values of mutual respect, trust, and active collaboration". Yeah, that sounds like a plan. Really, read that again and try to come up with what that actually means. It means nothing. It's fluff and nothing more. Actually, it's not only fluff. It's a good excuse to get a lot of people to travel all over the world (she showed pictures from at least 3 cities she visited in the past couple of years for the purpose of "international collaboration" or some junk catch phrase like that) and sit around doing close to nothing.
I wanted to ask some horribly sarcastic question at the end of the talk but i didn't want to be that guy (i'd rather rant about it in a damn blog).
Then i did some googling and it turns out these people managed to get a Science paper out:
Now, Science, really? You have to read this paper. It's not too long and it says nothing. It's all words i understand but they don't come together to form content.
One of these days people will wake up and realize there's a class of managers (i'm using this as a pejorative) that have infiltrated science and are ruining it for all of us. I'm sick and tired of having to put up with these guys which are either:
- evil -> they know they're not adding any value.
- stupid -> they think they're somehow useful and important.
I haven't decided which one is worse.

Ahh, this feels better!

Sunday, January 23, 2011


I think i've had this blog "reserved" for a couple of year now, but it's been hard to get myself psyched enough to put up the first entry.
Not that i'm psyched now, but I am a bit pissed off.
Following the American debate on public and political rhetoric, it got me thinking. Not so much in the US context as in the one back home. So, i started watching home grown news again. A habit i had given up on because it seemed really unhealthy. (Got my bp spiking and i think really messed with my cholesterol).
This morning, i watched an hour long conversation allegedly about the increase in food prices. Ranging from potatoes to cheese. May sound like not such a relevant conversation to the political rhetoric but it presented itself as a surprising summation of the genre. Basically, it contained no information about the possible causes of the increase, a series of percentages of increase per item and a lot (and i mean a LOT) of off topic shouting. The best part however was when an extra phone guest joins the conversation. Between his introduction and him starting to speak his mind, in that 5 seconds gap, you can hear two of the four studio guests going: "Why the f**k are they asking this guy".
This I believe is the essence of political dialogue lately. Politicians and also average people have forgotten how to listen to each other. It has stopped being a dialogue and turned into a series of monologues. Crappy monologues at that! Everyone is ranting uncontrollably seeming to forget why they are there in the first place.

Message to take away from this:
1. Listen to what your opponent has to say before you stuff your boot in his mouth.
2. Don't use prefabricated arguments. Let the information sink in before you retort.

Let's start dialoging about stuff. Or even better, gymnologize.