Prema novim "otkrićima" projekta ENCODE geni nisu osnovna jedinica nego je to "transkript", a njega izvodi RNA, ne DNA. Dvostruka uzvojnica gubi status superstara i tron ustupa evolucijski daleko starijoj - RNA. Tzv. junk geni (koji, kako se mislilo, nemaju nikakvu bitnu funkciju) ne postoje nego svi imaju ulogu u ukupnom kompleksnom sustavu zadataka. Genetski niz više nalikuje 3-D slici grada - gdje 1,5 posto gena sudjeluje u stvaranju bjelančevina a svi ostali obavljaju "komunalne" poslove: nadgledaju, kopiraju, čiste, organiziraju, provjeravaju...
Uslijedila je velika polemika.
Ed Yong: ENCODE: the rough guide to the human genome
Back in 2001, the Human Genome Project gave us a nigh-complete readout of our DNA. Somehow, those As, Gs, Cs, and Ts contained the full instructions for making one of us, but they were hardly a simple blueprint or recipe book. The genome was there, but we had little idea about how it was used, controlled or organised, much less how it led to a living, breathing human.
That gap has just got a little smaller. A massive international project called ENCODE – the Encyclopedia Of DNA Elements – has moved us from “Here’s the genome” towards “Here’s what the genome does”. Over the last 10 years, an international team of 442 scientists have assailed 147 different types of cells with 24 types of experiments. Their goal: catalogue every letter (nucleotide) within the genome that does something. The results are published today in 30 papers across three different journals, and more.
For years, we’ve known that only 1.5 percent of the genome actually contains instructions for making proteins, the molecular workhorses of our cells. But ENCODE has shown that the rest of the genome – the non-coding majority – is still rife with “functional elements”. That is, it’s doing something.
It contains docking sites where proteins can stick and switch genes on or off. Or it is read and ‘transcribed’ into molecules of RNA. Or it controls whether nearby genes are transcribed (promoters; more than 70,000 of these). Or it influences the activity of other genes, sometimes across great distances (enhancers; more than 400,000 of these). Or it affects how DNA is folded and packaged. Something.
According to ENCODE’s analysis, 80 percent of the genome has a “biochemical function”. More on exactly what this means later, but the key point is: It’s not “junk”. Scientists have long recognised that some non-coding DNA has a function, and more and more solid examples have come to light [edited for clarity - Ed]. But, many maintained that much of these sequences were, indeed, junk. ENCODE says otherwise. “Almost every nucleotide is associated with a function of some sort or another, and we now know where they are, what binds to them, what their associations are, and more,” says Tom Gingeras, one of the study’s many senior scientists.
And what’s in the remaining 20 percent? Possibly not junk either, according to Ewan Birney, the project’s Lead Analysis Coordinator and self-described “cat-herder-in-chief”. He explains that ENCODE only (!) looked at 147 types of cells, and the human body has a few thousand. A given part of the genome might control a gene in one cell type, but not others. If every cell is included, functions may emerge for the phantom proportion. “It’s likely that 80 percent will go to 100 percent,” says Birney. “We don’t really have any large chunks of redundant DNA. This metaphor of junk isn’t that useful.”
That the genome is complex will come as no surprise to scientists, but ENCODE does two fresh things: it catalogues the DNA elements for scientists to pore over; and it reveals just how many there are. “The genome is no longer an empty vastness – it is densely packed with peaks and wiggles of biochemical activity,” says Shyam Prabhakar from the Genome Institute of Singapore. “There are nuggets for everyone here. No matter which piece of the genome we happen to be studying in any particular project, we will benefit from looking up the corresponding ENCODE tracks.”
There are many implications, from redefining what a “gene” is, to providing new clues about diseases, to piecing together how the genome works in three dimensions. “It has fundamentally changed my view of our genome. It’s like a jungle in there. It’s full of things doing stuff,” says Birney. “You look at it and go: “What is going on? Does one really need to make all these pieces of RNA? It feels verdant with activity but one struggles to find the logic for it.
Think of the human genome as a city. The basic layout, tallest buildings and most famous sights are visible from a distance. That’s where we got to in 2001. Now, we’ve zoomed in. We can see the players that make the city tick: the cleaners and security guards who maintain the buildings, the sewers and power lines connecting distant parts, the police and politicians who oversee the rest. That’s where we are now: a comprehensive 3-D portrait of a dynamic, changing entity, rather than a static, 2-D map.
And just as London is not New York, different types of cells rely on different DNA elements. For example, of the roughly 3 million locations where proteins stick to DNA, just 3,700 are commonly used in every cell examined. Liver cells, skin cells, neurons, embryonic stem cells… all of them use different suites of switches to control their lives. Again, we knew this would be so. Again, it’s the scale and the comprehensiveness that matter.
“This is an important milestone,” says George Church, a geneticist at the Harvard Medical School. His only gripe is that ENCODE’s cells lines came from different people, so it’s hard to say if differences between cells are consistent differences, or simply reflect the genetics of their owners. Birney explains that in other studies, the differences between cells were greater than the differences between people, but Church still wants to see ENCODE’s analyses repeated with several types of cell from a small group of people, healthy and diseased. That should be possible since “the cost of some of these [tests] has dropped a million-fold,” he says.
The next phase is to find out how these players interact with one another. What does the 80 percent do (if, genuinely, anything)? If it does something, does it do something important? Does it change something tangible, like a part of our body, or our risk of disease? If it changes, does evolution care?
[Update 07/09 23:00] Indeed, to many scientists, these are the questions that matter, and ones that ENCODE has dodged through a liberal definition of “functional”. That, say the critics, critically weakens its claims of having found a genome rife with activity. Most of the ENCODE’s “functional elements” are little more than sequences being transcribed to RNA, with little heed to their physiological or evolutionary importance. These include repetitive remains of genetic parasites that have copied themselves ad infinitum, the corpses of dead and once-useful genes, and more.
To include all such sequences within the bracket of “functional” sets a very low bar. Michael Eisen from the Howard Hughes Medical Institute said that ENCODE’s definition as a “meaningless measure of functional significance” and Leonid Kruglyak from Princeton University noted that it’s “barely more interesting” than saying that a sequence gets copied (which all of them are). To put it more simply: our genomic city’s got lots of new players in it, but they may largely be bums.
This debate is unlikely to quieten any time soon, although some of the heaviest critics of ENCODE’s “junk” DNA conclusions have still praised its nature as a genomic parts list. For example, T. Ryan Gregory from Guelph University contrasts their discussions on junk DNA to a classic paper from 1972, are concludes that they are “far less sophisticated than what was found in the literature decades ago.” But he also says that ENCODE provides “the most detailed overview of genome elements we’ve ever seen and will surely lead to a flood of interesting research for many years to come.” And Michael White from the Washington University in St. Louis said that the project had achieved “an impressive level of consistency and quality for such a large consortium.” He added, “Whatever else you might want to say about the idea of ENCODE, you cannot say that ENCODE was poorly executed.”
Where will it lead us? It’s easy to get carried away, and ENCODE’s scientists seem wary of the hype-and-backlash cycle that befell the Human Genome Project. Much was promised at its unveiling, by both the media and the scientists involved, including medical breakthroughs and a clearer understanding of our humanity. The ENCODE team is being more cautious. “This idea that it will lead to new treatments for cancer or provide answers that were previously unknown is at least partially true,” says Gingeras, “but the degree to which it will successfully address those issues is unknown.
“We are the most complex things we know about. It’s not surprising that the manual is huge,” says Birney. “I think it’s going to take this century to fill in all the details. That full reconciliation is going to be this century’s science.”
So… how much is “functional” again?
So, that 80 percent figure… Let’s build up to it.
We know that 1.5 percent of the genome codes for proteins. That much is clearly functional and we’ve known that for a while. ENCODE also looked for places in the genome where proteins stick to DNA – sites where, most likely, the proteins are switching a gene on or off. They found 4 million such switches, which together account for 8.5 percent of the genome.* (Birney: “You can’t move for switches.”) That’s already higher than anyone was expecting, and it sets a pretty conservative lower bound for the part of the genome that definitively does something.
In fact, because ENCODE hasn’t looked at every possible type of cell or every possible protein that sticks to DNA, this figure is almost certainly too low. Birney’s estimate is that it’s out by half. This means that the total proportion of the genome that either creates a protein or sticks to one, is around 20 percent.
To get from 20 to 80 percent, we include all the other elements that ENCODE looked for – not just the sequences that have proteins latched onto them, but those that affects how DNA is packaged and those that are transcribed at all. Birney says, “[That figure] best coveys the difference between a genome made mostly of dead wood and one that is alive with activity.” Update (5/9/12 23:00): For Birney’s own, very measured, take on this, check out his post.
That 80 percent covers many classes of sequence that were thought to be essentially functionless. These include introns – the parts of a gene that are cut out at the RNA stage, and don’t contribute to a protein’s manufacture. “The idea that introns are definitely deadweight isn’t true,” says Birney. The same could be said for our many repetitive sequences: small chunks of DNA that have the ability to copy themselves, and are found in large, recurring chains. These are typically viewed as parasites, which duplicate themselves at the expense of the rest of the genome. Or are they?
The youngest of these sequences – those that have copied themselves only recently in our history – still pose a problem for ENCODE. But many of the older ones, the genomic veterans, fall within the “functional” category. Some contain sequences where proteins can bind, and influence the activity of nearby genes. Perhaps their spread across the genome represents not the invasion of a parasite, but a way of spreading control. “These parasites can be subverted sometimes,” says Birney.
He expects that many skeptics will argue about the 80 percent figure, and the definition of “functional”. But he says, “No matter how you cut it, we’ve got to get used to the fact that there’s a lot more going on with the genome than we knew.”
[Update 07/09 23:00] Birney was right about the scepticism. Gregory says, “80 percent is the figure only if your definition is so loose as to be all but meaningless.” Larry Moran from the University of Toronto adds, “Functional” simply means a little bit of DNA that’s been identified in an assay of some sort or another. That’s a remarkably silly definition of function and if you’re using it to discount junk DNA it’s downright disingenuous.”
This is the main criticism of ENCODE thus far, repeated across many blogs and touched on in the opening section of this post. There are other concerns. For example, White notes that many DNA-binding proteins recognise short sequences that crop up all over the genome just by chance. The upshot is that you’d expect many of the elements that ENCODE identified if you just wrote out a random string of As, Gs, Cs, and Ts. “I’ve spent the summer testing a lot of random DNA,” he tweeted. “It’s not hard to make it do something biochemically interesting.”
Gregory asks why, if ENCODE is right and our genome is full of functional elements, does an onion have around five times as much non-coding DNA as we do? Or why pufferfishes can get by with just a tenth as much? Birney says the onion test is silly. While many genomes have a tight grip upon their repetitive jumping DNA, many plants seem to have relaxed that control. Consequently, their genomes have bloated in size (bolstered by the occasional mass doubling). “It’s almost as if the genome throws in the towel and goes: Oh sod it, just replicate everywhere.” Conversely, the pufferfish has maintained an incredibly tight rein upon its jumping sequences. “Its genome management is pretty much perfect,” says Birney. Hence: the smaller genome.
But Gregory thinks that these answers are a dodge. “I would still like Birney to answer the question. How is it that humans “need” 100% of their non-coding DNA, but a pufferfish does fine with 1/10 as much [and] a salamander has at least 4 times as much?” [I think Birney is writing a post on this, so expect more updates as they happen, and this post to balloon to onion proportions].
Update (07/09/12 11:00): The ENCODE reactions have come thick and fast, and Brendan Maher has written the best summary of them. I’m not going to duplicate his sterling efforts. Head over to Nature’s blog for more.
* (A cool aside: John Stamatoyannopoulos from the University of Washington mapped these protein-DNA contacts by looking for “footprints” where the presence of a protein shields the underlying DNA from a “DNase” enzyme that would otherwise slice through it. The resolution is incredible! Stamatoyannopoulos could “see” every nucleotide that’s touched by a protein – not just a footprint, but each of its toes too. Joe Ecker from the Salk Institute thinks we should be eventually able to “dynamically footprint a cellular response”. That is, expose a cell to something—maybe a hormone or a toxin—and check its footprints over time. You can cross-reference those sites to the ENCODE database, and reconstruct what’s going on in the cell just by “watching” the shadows of proteins as they descend and lift off.)
Redefining the gene
The simplistic view of a gene is that it’s a stretch of DNA that is transcribed to make a protein. But each gene can be transcribed in different ways, and the transcripts overlap with one another. They’re like choose-your-own-adventure books: you can read them in different orders, start and finish at different points, and leave out chunks altogether.
Fair enough: We can say that the “gene” starts at the start of the first transcript, and ends at the end of the final transcript. But ENCODE’s data complicates this definition. There are a lot of transcripts, probably more than anyone had realised, and some connect two previously unconnected genes. The boundaries for those genes widen, and the gaps between them shrink or disappear.
Gingeras says that this “intergenic” space has shrunk by a factor of four. “A region that was once called Gene X is now melded to Gene Y.” Imagine discovering that every book in the library has a secret appendix, that’s also the foreword of the book next to it.
These bleeding boundaries seem familiar. Bacteria have them: Their genes are cramped together in a miracle of effective organisation, packing in as much information as possible into a tiny genome. Viruses epitomise such genetic economy even better. I suggested that comparison to Gingeras. “Exactly!” he said. “Nature never relinquished that strategy.”
Bacteria and viruses can get away with smooshing their protein-encoding genes together. But not only do we have more proteins, but we also need a vast array of sequences to control when, where and how they are deployed. Those elements need space too. Ignore them, and it looks like we have a flabby genome with sequence to spare. Understand them, and our own brand of economical packaging becomes clear. (However, Birney adds, “In bacteria and viruses, it’s all elegant and efficient. At the moment, our genome just seems really, really messy. There’s this much higher density of stuff, but for me, emotionally it doesn’t have that elegance when we see in a bacterial genome.“)
Given these blurred boundaries, Gingeras thinks that it no longer makes sense to think of a gene as a specific point in the genome, or as its basic unit. Instead, that honour falls to the transcript, made of RNA rather than DNA. “The atom of the genome is the transcript,” says Gingeras. “They are the basic unit that’s affected by mutation and selection.” A “gene” then becomes a collection of transcripts, united by some common factor.
There’s something poetic about this. Our view of the genome has long been focused on DNA. It’s the thing the genome project was deciphering. It is converted into RNA, giving it a more fundamental flavour. But out of those two molecules, RNA arrived on the planet first. It was copying itself and evolving long before DNA came on the scene. “These studies are pointing us back in that direction,” says Gingeras. They recognise RNA’s role, not as simply an intermediary between DNA and proteins, but something more primary.
What about diseases?
For the last decade, geneticists have run a seemingly endless stream of “genome-wide association studies” (GWAS), attempting to understand the genetic basis of disease. They have thrown up a long list of SNPs – variants at specific DNA letters—that correlate with the risk of different conditions.
The ENCODE team have mapped all of these to their data. They found that just 12 percent of the SNPs lie within protein-coding areas. They also showed that compared to random SNPs, the disease-associated ones are 60 percent more likely to lie within functional, non-coding regions, especially in promoters and enhancers. This suggests that many of these variants are controlling the activity of different genes, and provides many fresh leads for understanding how they affect our risk of disease. “It was one of those too good to be true moments,” says Birney. “Literally, I was in the room [when they got the result] and I went: Yes!”
Imagine a massive table. Down the left side are all the diseases that people have done GWAS studies for. Across the top are all the possible cell types and transcription factors (proteins that control how genes are activated) in the ENCODE study. Are there hotspots? Are there SNPs that correspond to both? Yes. Lots, and many of them are new.
Take Crohn’s disease, a type of bowel disorder. The team found five SNPs that increase the risk of Crohn’s, and that are recognised by a group of transcription factors called GATA2. “That wasn’t something that the Crohn’s disease biologists had on their radar,” says Birney. “Suddenly we’ve made an unbiased association between a disease and a piece of basic biology.” In other words, it’s a new lead to follow up on.
“We’re now working with lots of different disease biologists looking at their data sets,” says Birney. “In some sense, ENCODE is working form the genome out, while GWAS studies are working from disease in.” Where they meet, there is interest. So far, the team have identified 400 such hotspots that are worth looking into. Of these, between 50 and 100 were predictable. Some of the rest make intuitive sense. Others are head-scratchers.
The 3-D genome
Writing the genome out as a string of letters invites a common fallacy: that it’s a two-dimensional, linear entity. It’s anything but. DNA is wrapped around proteins called histones like beads on a string. These are then twisted, folded and looped in an intricate three-dimensional way. The upshot is that parts of the genome that look distant when you write the sequences out can actually be physical neighbours. And this means that some switches can affect the activity of far away genes
Job Dekker from the University of Massachusetts Medical School has now used ENCODE data to map these long-range interactions across just 1 percent of the genome in three different types of cell. He discovered more than 1,000 of them, where switches in one part of the genome were physically reaching over and controlling the activity of a distant gene. “I like to say that nothing in the genome makes sense, except in 3D,” says Dekker. “It’s really a teaser for the future of genome science,” Dekker says.
Gingeras agrees. He thinks that understanding these 3-D interactions will add another layer of complexity to modern genetics, and extending this work to the rest of the genome, and other cell types, is a “next clear logical step”.
How will scientists actually make sense of all of this?
ENCODE is vast. The results of this second phase have been published in 30 central papers in Nature, Genome Biology and Genome Research, along with a slew of secondary articles in Science, Cell and others. And all of it is freely available to the public.
The pages of printed journals are a poor repository for such a vast trove of data, so the ENCODE team have devised a new publishing model. In the ENCODE portal site, readers can pick one of 13 topics of interest, and follow them in special “threads” that link all the papers. Say you want to know about enhancer sequences. The enhancer thread pulls out all the relevant paragraphs from the 30 papers across the three journals. “Rather than people having to skim read all 30 papers, and working out which ones they want to read, we pull out that thread for you,” says Birney.
And yes, there’s an app for that.
Transparency is a big issue too. “With these really intensive science projects, there has to be a huge amount of trust that data analysts have done things correctly,” says Birney. But you don’t have to trust. At least half the ENCODE figures are interactive, and the data behind them can be downloaded. The team have also built a “Virtual Machine” – a downloadable package of the almost-raw data and all the code in the ENCODE analyses. Think of it as the most complete Methods section ever. With the virtual machine, “you can absolutely replay step by step what we did to get to the figure,” says Birney. “I think it should be the standard for the future.”
Compilation of other ENCODE coverage
- ENCODE: the human encyclopedia, a long-read feature by Brendan Maher
- Cataloguing the controlled chaos of the human genome, by John Timmer
- The best summary of the reactions yet; Fighting about ENCODE and junk by Brendan Maher. Graceful, measured stuff.
- This 100,000 word post on the ENCODE media bonanza will cure cancer, by Michael Eisen, heavily critical of the PR, and some of the claims (but see also: Michael Eisen’s take on ENCODE by T. Ryan Gregory)
- Mike White at The Finch and Pea is one of the few to delve into the papers for interesting angles not covered by the media coverage. Some interesting stuff here.
- Michael Eisen again on the neutral theory of molecular function.
- The ENCODE project: lessons for scientific publication, by Daniel Macarthur
- T. Ryan Gregory’s various pieces, including this one comparing ENCODE’s claims to a 1972 paper, and two others.
- Larry Moran’s various pieces, where he is heavily critical of ENCODE’s scientists, journalists in general, me specifically, and others.
- Chris Gunter’s informal Twitter poll on public understanding of junk DNA
30 Responses to “ENCODE: the rough guide to the human genome”
On Wednesday, a handful of journals, including this one, released more than 30 papers describing results from the second phase of ENCODE: a consortium-driven project tasked with building the ‘ENCyclopedia Of DNA Elements’, a manual of sorts that defines and describes all the functional bits of the genome.Many reactions to the slew of papers, their web and iPad app presentations and the news coverage that accompanied the release were favourable. But several critics have challenged some of the most prominently reported claims in the papers, the way their publication was handled and the indelicate use of the word ‘junk’ on some material promoting the research.
First up was a scientific critique that the authors had engaged in hyperbole. In the main ENCODE summary paper, published in Nature, the authors prominently claim that the ENCODE project has thus far assigned “biochemical functions for 80% of the genome”. I had long and thorough discussions with Ewan Birney about this figure and what it actually meant, and it was clear that he was conflicted about reporting it in the paper’s abstract.
It’s a big number, to be sure. The protein-encoding portion of the genome — that which has historically been considered the most important part— represents a little more than 1%, and to imply that they found similarly important and interesting functions for another 79% is an extraordinary claim. Birney had said to me and reiterates in a Q&A-style blog post that it is also a loose interpretation of the word ‘functional’ that encompassed many categories of biochemical activity, from the very broad — such as actively producing or ‘transcribing’ RNA — to being attached to some sort of transcription-factor protein, all the way down to that narrow range of protein-encoding DNA within the 1%.
But hold on, said a number of genome experts: most of that activity isn’t particularly specific or interesting and may not have an impact on what makes a human a human (or what makes one human different from another). A blog post by Ed Yong discusses some of these critiques. It was already known, for example, that vast portions of the genome are transcribed into RNA. A small amount of that RNA encodes protein, and some serves a regulatory role, but the rest of it is chock-full of seemingly nonsensical repeats, remnants of past viruses and other weird little bits that shouldn’t serve a purpose.
The paper does drill down somewhat into what the authors mean by functional elements. And Birney does the same in his blog. Excluding all but the sites where there is very probable active binding by a regulatory protein, “we see a cumulative occupation of 8% of the genome,” he writes. Add to that the 1% of protein-encoding DNA and you get 9%.
Birney and his colleagues have estimated how complete their sampling is, and suspect that they will find another 11% of the genome with this kind of regulatory activity. That gets them to 20%. So, perhaps the main conclusion should have been that 20% of the genome in some situation can directly influence gene expression and phenotype of at least one human cell type. It’s a far cry from 80%, but a substantial increase from 1%.
Some suggest that a majority of the genome does have an active role in biological functions. John Mattick, director of the Garvan Institute of Medical Research in Sydney, Australia, who I spoke to in the run up to the publication of these papers, argued that the ENCODE authors were being far too conservative in their claims about the significance of all that transcription. “We have misunderstood the nature of genetic programming for the past 50 years,” he told me. Having long argued that non-coding RNA has a crucial role in cell regulatory functions, his gentle criticism is that “they’ve reported the elephant in the room then chosen to otherwise ignore it”.
The 80% number may not have been ideal, but it did provide a headline figure that was impressive to the mainstream media. This is at the core of a related critique against the ENCODE researchers and the journals that published their papers. By bandying about this big number, press releases on the project touted the idea that ENCODE had demolished some long-standing notion that much of the genome is ‘junk’. Michael Eisen, an evolutionary biologist at the University of California, Berkeley, said in a blog post that this pushed “a narrative about their results that is, at best, misleading.”
That narrative goes something like this: scientists long thought the genome was littered with junk, evolutionary remnants that serve no purpose, but ENCODE has shown that 80% of the genome (and possibly more to come) does serve a purpose. That narrative appeared in many media reports on the publication. Many on Twitter and in online conversations bemoaned the rehashing of a junk-DNA debate that they considered imaginary or at least long-settled. Eisen, perhaps rightfully, puts the blame on press releases that touted the supposed paradigm shift: the one from Nature Publishing Group started thus: “Far from being junk, the vast majority of our DNA participates in at least one biochemical event in at least one cell type.” Eisen says that “the authors undoubtedly know, nobody actually thinks that non-coding DNA is ‘junk’ anymore. It’s an idea that pretty much only appears in the popular press, and then only when someone announces that they have debunked it.”
It is an old argument, but it’s not clear that it is a dead argument. Several researchers took issue with ENCODE’s suggestion that its wobbly 80% number in any way disproves that some DNA is junk. Larry Moran, a biochemist at the University of Toronto in Ontario argued on his blog that claims about disproving the existence of junk gives ammunition to creationists who like a tidy view of every letter in the genome having some sort of divine purpose. “This is going to make my life very complicated,” he writes.
Indeed, the papers have caught the attention of at least some creationists, and of just about everyone else. This was in part designed by the project leaders and editors, who organized a simultaneous release of the publications to maximize their impact. This was a major, time-consuming event that occupied a great deal of time from the scientists involved and from the editors at their respective journals. And the delay that this coordination caused has led to another complaint. Casey Bergman, a genome biologist at the University of Manchester, UK, tried to tally the cost of this delay on the scientific community.
Each paper sat for an average of 3.7 months after being accepted before it was published. He estimates a maximum total of 112 months — nearly 10 years — during which the scientific community was deprived of insights from these papers. “To the extent that these papers are crucial for understanding the human genome, and the consequences this knowledge has for human health, this decade lost to humanity is clearly unacceptable,” writes Bergman. Granted, the ENCODE data have been released regularly and consistently throughout the project, and anyone can access and use the data to publish, but some observers noted that not everyone was aware of ENCODE’s progress. It would have been far better, Bergman and others argue, for the papers to be released as they were accepted. A review article, perhaps along with some of the other web and mobile bells and whistles, could have rounded them up at some set point, but reserving all the papers for one big publication push was detrimental, he claims.
ENCODE was conceived of and practised as a resource-building exercise. In general, such projects have a huge potential impact on the scientific community, but they don’t get much attention in the media. The journal editors and authors at ENCODE collaborated over many months to make the biggest splash possible and capture the attention of not only the research community but also of the public at large. Similar efforts went into the coordinated publication of the first drafts of the human genome, another resource-building project, more than a decade ago. Although complaints and quibbles will probably linger for some time, the real test is whether scientists will use the data and prove ENCODE’s worth.
The ENCODE Data Dump and the Responsibility of Scientists
A few hours ago I criticized science journalists for getting suckered by the hype surrounding the publication of 30 papers from the ENCODE Consortium on the function of the human genome [The ENCODE Data Dump and the Responsibility of Science Journalists].
They got their information from supposedly reputable scientists but that's not an excuse. It is the duty and responsibility of science journalists to be skeptical of what scientists say about their own work. In this particular case, the scientists are saying the same things that were thoroughly criticized in 2007 when the preliminary results were published.
I'm not letting the science journalists off the hook but I reserve my harshest criticism for the scientists, especially Ewan Birney who is the lead analysis coordinator for the project and who has taken on the role as spokesperson for the consortium. Unless other members of the consortium speak out, I'll assume they agree with Ewan Birney. They bear the same responsibility for what has happened.
Ewan Birney is listed as the corresponding author for the main summary paper in Nature: An integrated encyclopedia of DNA elements in the human genome. Here's the opening paragraph,
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.I've highlighted the main take-home message.
The papers show no such thing as Ewan Birney admits on his own blog [ENCODE: My own thoughts].
It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word “functional” so let’s try to tackle this first. Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, “broad” histone modifications, “narrow” histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.In other words, "functional" simply means a little bit of DNA that's been identified in an assay of some sort or another.
For someone who claims to have "spent a long time thinking about and discussing this" that's a remarkably silly definition of function and if you're using it to discount junk DNA it's downright disingenuous. Did Birney really not anticipate all the hype about refuting junk DNA? Come on, he can't have been that stupid, could he?
Here's the video he prepared with Magdalena Skipper, Senior Editor at Nature. Check out what she says at 2:28
The striking overall result that the ENCODE project reports is that they can assign a function, a biochemical function, to 80% of the human genome. The reason why this is striking is because, not such a long time ago, we still considered that the vast proportion of the human genome was simply junk because we know that it's only 3% that encodes proteins.Where did she get that idea if not from Ewan Birney? Watch Birney's performance to see if he challenges this interpretation or supports the concept that most of the human genome is involved in a vast network of complex controls.
Scientist have a responsibility to be scrupulously accurate when they present their own work to the general public. That means they should recognize the difference between what the data actually says and their own interpretation of the data. When scientists know that there are other ways to interpret the data, they are obliged to mention that. That's the mark of a good scientist. In this case Birney is well aware of the controversy over interpreting pervasive transcription and the possible insignificance of a DNA binding site. He knows that because the ENCODE Consortium was challenged in 2007 when it presented the results of the pilot project (see The ENCODE Data Dump and the Responsibility of Science Journalists).
This is, unfortunately, another case of a scientist acting irresponsibly by distorting the importance and the significance of the data. It's getting to be a serious problem and it makes it hard to convey real science to the general public. The public now believes that the concept of junk DNA has been rejected by scientists and that our huge genome really is full of wonderful sophisticated control elements regulating the expression of every gene.
It's going to take a lot of effort to undo the damage caused by scientists like Ean Birney. - Laurence A. Moran
- Nature 489, 57–74 (06 September 2012)