In Parts one and two, I talked about the scholarly practice of Open Access publishing, and about how the central concept of "openness", or knowledge as a public good, is being incorporated into other aspects of science. I suggested that the overall practice (or philosophy, or movement) might be called Open Science, by which I mean the process of discovery at the intersection of Open Access (publishing), Open Data, Open Source (software), Open Standards (semantic markup) and Open Licensing.
Here I want to move from ideas to applications, and take a look at what kinds of Open Science are already happening and where such efforts might lead. Open Science is very much in its infancy at the moment; we don't know precisely what its maturity will look like, but we have good reason to think we'll like it.
By way of analogy, think about what the Web has made possible, and ask yourself: how much of that could you have predicted in, say, 1991, when Sir Tim wrote the first browser? Actually, "infancy" being a generous term for the developmental state of Open Science, a better analogy probably reaches further back: how much of what the internet has made possible could anyone have predicted when ARPANET first met NSFnet? Given that last link, for instance, would you have seen Wikipedia coming? How about eBay, Amazon.com, RSS, blogs, YouTube, Google Maps, or insert-your-own-favorite amazing web site/service/application?
The potential is immense, and from our current perspective we cannot predict more than a fraction of the ways in which openness will transform the culture and practice of science. Nonetheless, there are signs pointing in possible directions.
early examples: sequence data
Sequence data (such as mRNA, genomic DNA and protein sequences) have long been the leading edge of large-scale collaborative science, largely because early competition among public and private organizations resulted in a series of groundbreaking agreements on public data sharing. (For a quick tour of the relevant history, see this article.) Among the online tools that have been developed around openly-accessible sequence databases such as GenBank or SwissProt, the flagship effort is probably the NCBI's online gateway Entrez. From Entrez I can search for information on a sequence of interest on almost thirty different interlinked databases. I can:
- find related nucleotide and protein sequences, and make detailed comparisons between them
- map a sequence of interest onto whole chromosomes or genomes, and compare those maps across ten or twenty different species
- access expert-curated information on any connection between a query molecule and human genetic disease or heritable disorders in other species
- look for known motifs or functional sequence modules in a query molecule, or use similar sequences to build 3D models of its likely shape and structure
- compare a sequence of interest across wide taxonomies, and formulate useful questions about its evolutionary history
- look for array data regarding expression of a query sequence in different developmental, disease-related and other contexts
- access genetic mapping data with which to map a query sequence in organisms for which little or no sequence data is yet available
There's much more -- that was a very brief and incomplete overview of what Entrez can do -- but you get the point. All of this analysis is only possible because the underlying sequence data is available on Open terms (and largely machine-readable due to semantic markup), and it forms a ready-made infrastructure in which further Open information can readily find a place -- as soon as it becomes available.
data and text mining
In part 2 I talked about a range of efforts to make databases of other information, including text, similarly interoperable and available for mining. Paul Ginsparg, in a recent essay, used the interface between PubMed Central and various sequence databases as an early example of what becomes possible when databases can be read by computers as well as by humans (emphasis mine):
GenBank accession numbers are recognized in articles referring to sequence data and linked directly to the relevant records in the genomic databases. Protein names are recognized, and their appearances in articles are linked automatically to the protein and protein interaction databases. Names of organisms are recognized and linked directly to the taxonomic databases, which are then used to compute a minimal spanning tree of all of the organisms contained in a given document. In yet another view, technical terms are recognized and linked directly to the glossary items in the relevant standard biology or biochemistry textbook in the books database. The enormously powerful sorts of data mining and number crunching that are already taken for granted as applied to the open-access genomics databases can be applied to the full text of the entirety of the biology and life sciences literature and will have just as great a transformative effect on the research done with it.
Donat Agosti recently pointed to three related projects: Biotext, which builds text mining tools; EBIMed, which analyses Medline search results and presents associations between gene names and several other databases; and the Arrowsmith Project, which allows semantic comparison between two search-defined sets of PubMed articles. The latter also maintains a list of free online text mining tools, which currently includes several dozen sites offering tools for a variety of purposes, although the majority are still focused on Medline and/or sequence databases.
These sorts of tools are not only useful, they are likely to become essential. Even now, I can hardly imagine trying to navigate the existing sequence data without Entrez, or the research literature without PubMed. GenBank contains more than 40 billion bases and is growing exponentially, doubling every 12-15 months. PubMed contains nearly 17 million records as I write this, and is adding well over half a million every year. The 2007 Nucleic Acids Research database issue lists nearly 1000 separate biological databases, up more than 10% from last year. As Matthew Cockerill of BioMed Central has pointed out, simple text searching is not enough to keep a researcher afloat in this onrushing sea of information.
bibliometrics
Data and text mining methods stand to come into their own as discovery tools once they have a fully Open and machine-readable body of published research on which to work. Similarly, the utility of bibliometrics, the quantitative analysis of text based information, can be dramatically enhanced by Open Access. In particular, measures of research impact can be made much more powerful, direct and reliable.
Research impact is the degree to which a piece or body of work has been taken up and built upon by other researchers and put to practical use in education, technology, medicine and so on. Governments and other funding bodies want to be able to measure research impact in order to provide accountability and ensure maximal return on investment, and researchers and research administrators want the same measurements in order to assess the quality of their research and to plan future directions ("how are we doing? how can we do better?").
The most important measure of research impact currently available is citation analysis, a proxy measurement based on acknowledged use by later published work; the predominant citation-based metric in modern research assessment is the Impact Factor (IF). If a journal has a 2004 IF of 5, then papers published in that journal in 2001-2002 were cited, on average, 5 times each in 2003. This number is probably the most widely misunderstood and misused metric in all of science, and comes with a number of serious built-in flaws, not the least of which is that the underlying database is the property of for-profit publishing company Thomson Scientific.
Despite these flaws and considerable high-profile criticism, it is difficult to overstate the influence that the Impact Factor has had, and continues to have, on all efforts to evaluate scientists and their work. Researchers obsess over journal choice: you don't want a rejection, which forces you to re-submit elsewhere and wastes time, but you need to get that paper into the "best" (that is, highest IF) journal you can so as to appeal to hiring, funding and tenure committees. And that's not unrealistic, since quite frankly the bottom line for most such committees is "who has published the most papers in high-IF journals". Other factors are usually considered, but the IF dominates. It's a clumsy, inaccurate and unscientific way to go about evaluating research impact and researcher talent.
Happily, there is a better way just over the Open Access horizon. Once a majority of published research is available in machine-readable OA databases, the community can get out from under Thomson's thumb and improve scientific bibliometrics in a host of different ways. Shadbolt et al. list more than two dozen improvements that OA will make possible, including:
- A CiteRank analog of Google’s PageRank algorithm will allow hits to be rank-ordered by weighted citation counts instead of just ordinary links (not all citations are equal)
- In addition to ranking hits by author/article/topic citation counts, it will also be possible to rank them by author/article/topic download counts
- Correlations between earlier download counts and later citation counts will be available online, and usable for extrapolation, prediction and eventually even evaluation
- Searching, analysis, prediction and evaluation will also be augmented by cocitation analysis (who/what co-cited or was co-cited by whom/what?), coauthorship analysis, and eventually also co-download analysis
- Time-based (chronometric) analyses will be used to extrapolate early download, citation, co-download and co-citation trends, as well as correlations between downloads and citations, to predict research impact, research direction and research influences.
- Authors, articles, journals, institutions and topics will also have "endogamy/exogamy" scores: how much do they cite themselves? in-cite within the same "family" cluster? out-cite across an entire field? across multiple fields? across disciplines?
- "Hub/authority" analysis will make it easier to do literature reviews, identifying review articles citing many articles (hubs) or key articles/authors (authorities) cited by many articles.
Existing metrics (which basically means Thomson's proprietary data) are simply not rich enough to support such analyses. There are already efforts underway to mine the available body of text for better ways to evaluate research. Hirsch's h-index, an alternative way of using citation counts to rank authors according to their influence, can be calculated online using Google Scholar. Bollen et al. have proposed a method for using Google's PageRank as an alternative to the Impact Factor, as well as their own Y-factor which is a composite of the two measures. The Open Citation Project built Citebase, an online citation tracker which has been used to show that downloads (which are measured in real-time from the moment of upload) can predict citations (for which data one must wait years). Authoratory is a text-mining tool based on PubMed, and is capable of co-author analysis, authority ranking and more.
As the body of OA literature expands, these and similar tools will provide a far more reliable and equitable means of comparing researchers and research groups with their peers than is currently available, and will also facilitate the identification of trends and gaps in research focus. The downstream effects of increased efficiency in managing and carrying out research will be profound.
commentary and community
Andrew Dayton recently described another feature of the coming Open Science world, which he calls Open Discourse:
The internet is expanding the realm of scientific publishing to include free and open public debate of published papers. [...] How often have you asked yourself how a certain study was published unchallenged, without the results of a key control? How often have you wondered whether a paper's authors performed a specific procedure correctly? How often have you had the opportunity to question authors about previously published or opposing results they failed to cite, or discuss the difficulties of reproducing certain results? How often have you had the opportunity to command a discussion of an internal contradiction the referees seemed to have missed?
Stevan Harnad has referred to a similar idea as peer commentary, calling it a "powerful and important supplement to peer review". It's important to note that a number of journals, such as Current Anthropology or Psycoloquy, offer "open peer commentary" which is not actually open to public contribution. Similarly, the phrase "open peer review" is typically used to indicate that reviewers are not anonymous, rather than that review is open to the public. Neither of these pseudo-open concepts rely on "openness" in the Open Access/Open Science sense, whereas Open Discourse as Dayton means it is, of course, utterly dependent on such openness for its subject matter.
There are a number of venues which enable fully Open Discourse as Dayton means it. OA publisher BioMed Central offers a public comment button on every article, and Cell allows public comments on selected articles. BMC also publishes Biology Direct, which offers both an alternative model of peer review and public commentary, and PLoS has just launched PLoS One, offering standard peer review followed by public commentary, annotation and rating. Philica will publish anything, and provides public commentary which can also serve as a form of peer review through an authentication process for professional researchers. JournalReview.org is set up as an online public journal club, and Naboj is a forum for public review of articles posted to arxiv.org. BioWizard is somewhat similar, but is limited to articles accessible via PubMed and offers a number of other tools, such as a blogging platform and a rating mechanism designed to identify popular papers. Both JournalReview and BioWizard notify corresponding authors so that they can participate in the discussion. The British Medical Journal offers a rapid response mechanism which, having posted over 50,000 public responses to published work, sounds a cautionary note for more recent arrivals on the public commentary scene: in 2005, the journal was forced to impose a length limit and active moderation in order to avoid losing the desired signal in a flood of uninformed, obsessive noise.
Speaking of floods of uninformed, obsessive noise -- what about blogs?
Of course, I'm kidding. I actually have high hopes for the future of blogs in science, centered on three themes: commentary, community and data. Blogs are an excellent medium for commenting on anything, and with web feeds and a good aggregator it's pretty easy to keep track of a selected group of blogs. If Technorati worked, it might allow interesting views of the science blogosphere; fortunately, we have Postgenomic, which indexes nearly 700 science blogs and then "does useful and interesting things" with the data. For instance, you can see which papers and/or books are getting attention from science bloggers; there's even a Greasemonkey script that will flag Postgenomic-indexed papers in Connotea, Nature.com's social bookmark manager for scientists, another for PubMed and yet another for journal websites. A new Digg-like "community commentary" site, The Scientific Debate, allows trackbacks and so can interact with regular blogs. The discussion above about text mining applies, of course, to blogs, since they are typically openly accessible and friendly to text mining software. For instance, Biology Direct or PLoS One could interact with the blogosphere using linkbacks, or by pulling relevant posts from Postgenomic.
Blogs also tend to create virtual communities, such as the one that centers on Seed's ScienceBlogs collection of, well, science blogs. This group of about 50 blogs is rapidly becoming a hub of the science blogosphere, and even gave rise to a recent meatspace conference that bids fair to become an annual event. Such self-selected communities foster a sense of cameraderie and strongly encourage co-operation over competition, which can only favor the advance of Open Science. (It's not just blogs, of course, that can take advantage of community building. The Synaptic Leap, the Tropical Disease Initiative, OpenWetWare and BioForge all provide infrastructures that enable collaborative communities to do Open Science.)
Finally, blogs (and wikis) have immense potential as a scientific publishing medium. They are, to begin with, the perfect place for things like negative results, odd observations and small side-projects -- research results for which the risk of having an idea stolen is greatly outweighed by both the possibility of picking up a collaboration and the importance of having made available to the research community information which would never surface in a traditional journal. Most research communities are relatively small; it would not be difficult for most researchers to keep up with the lab weblogs (lablogs?) of the groups doing work most closely related to their own. I know of a few blog posts in this category. This and this from Bora Zivkovic are, I think, the first instances of original data on a blog. This series from Sandra Porter is earlier but involves bioinformatic analysis (that is, original experimentation, but no original data), as do this and this from Pedro Beltrao. Egon Willighagen blogs working software/scripts for cheminformatics, and Rosie Redfield and her students blog hypotheses, thinking-out-loud and even data. Blogs are also good for sharing protocols, like the syntheses posted by the anonymous proprietor of Org Prep Daily.
Beyond that, it's possible to do fully Open Science, publishing day-to-day results (including all raw data) in an online lab notebook. I know it's possible because Jean-Claude Bradley is doing it; he calls it Open Notebook Science. His lab's shared notebook is the UsefulChem wiki, which is supplemented by the UsefulChem blog for project discussion and the UsefulChem Molecules blog, a database of molecules related to their work. There is nothing to prevent Jean-Claude from publishing traditional articles whenever he has the kind of "story" that is required for that format, but in the meantime all of his research output is captured and made available to the world. Importantly, this includes information which would never otherwise have been published -- negative results, inconclusive results, things which simply don't fit into the narrative of any manuscript he prepares, and so on. Being on a third-party hosted wiki, the notebook entries have time and date stamps which can establish priority if that should be necessary; version tracking provides another layer of authentication.
At the moment the Bradley lab is the only group I know of that is doing Open Notebook Science, but of all the glimpses of an Open Science world I have tried to provide in this entry, Jean-Claude's model is, I think, the clearest and most hopeful. Only when that level of transparency and immediacy is the norm in scientific communication will the research community be able to realize its full potential.
that's all, folks
I promise, no more obsessive posting about Open Science here on 3QD. If I've managed to pique anyone's interest, I recommend reading Peter Suber's Open Access News and anything else that takes your fancy from the "open access/open science" section of my blogroll. And as always, if I've missed anything or got anything wrong, let me know in comments.
....
This
work is licensed under a
Creative Commons Attribution 3.0 License.
Excellent article, one nitpicking detail: Tim Berners-Lee more than anyone else invented the web, the the browser was mostly invented at the U. of Illinois as Mosaic (one very tall undergraduate claimed main responsibility (although he was just there).
Posted by: John Garrett | Monday, January 22, 2007 at 02:22 PM
Bill,
Brilliantly informative and interesting! And this is a great close to your excellent trilogy of articles. Thanks very much.
Posted by: Abbas Raza | Monday, January 22, 2007 at 04:38 PM
Bill,
What a finale!!! I can't agree with you more. The key for "scientific discovery" will be in the open publishing space. I think it is a given that biological data is going to be in the public domain. The same is not yet true of publications, and more importantly linking publications with openly available data. PLoS One is the start, and as more structure is built into publications and data is distributed in ways other than the ones currently known, the field might just explode. Hopefully, it won't be a case of 100 different data types and formats. That will end up being rather counterproductive
Posted by: Deepak | Tuesday, January 23, 2007 at 12:18 AM
John: Mosaic is certainly the first browser I ever saw, but every reference I checked (see, e.g., the link I gave) says explicitly that TB-L "wrote the first web browser". I think it was called WWW, and was pretty rudimentary. If I have this wrong, please point me to a better reference and I'll change the link.
Deepak: it's always good to hear from another Open Science enthusiast. I worry that biological data won't be automatically Pub Dom, so I hope the sequence data model prevails. PLoS One is great; they don't yet link to raw data as I would like them to, but they are open to suggestions and I know they have already heard from several people about Open Data and are thinking in terms of the semantic web.
(Abbas, you're too kind. Other readers should know that this piece was late, and is too long, and still Abbas compliments me. All writers should have such editors.)
Posted by: Bill | Tuesday, January 23, 2007 at 11:32 AM
The Open Notebook thing is fascinating (if much better organized than my lab notebook). I've always joked that there should be a Journal of Negative Results; maybe this is the answer to it.
Impact factors surely need work, but I remain skeptical that the big, profitable journals ('Nature and its unholy spawn') will ever go fully OA, though I live in hope they'll go to a 6-month embargo. The small journal for which I edit just went OA, largely to be more searchable; it's surely a huge benefit to researchers- but not, perhaps, to publishers, alas.
Posted by: Jenny F Scientist | Tuesday, January 23, 2007 at 02:13 PM
Hats off, Bill - I was in the middle of it from the beginning (at CNRI with Bob Kahn and Vint Cerf), but didn't know that Tim wrote a browser, since nobody ever used it (including Tim) once Mosaic arrived. If Tim's here, would be interesting to have his comments.
Posted by: John Garrett | Tuesday, January 23, 2007 at 05:58 PM
Update/edit: fixed a goof in the first paragraph (Open Licensing has nothing to do with the semantic web!).
Posted by: Bill | Wednesday, January 24, 2007 at 12:51 AM
Bill,
Very thorough indeed and valuable for the Open Science movement. This should help with getting more people involved.
Posted by: Jean-Claude Bradley | Wednesday, January 24, 2007 at 04:24 AM
Hi Bill,
thanks for that very nice and comprehensive write up. I just want to add another tool in the open access / open science world.
http://www.scientificcommons.org
We try to make all open access publications available in a lightweight interface and are working on ideas for the impact factor.
/lars
Posted by: Lars Kirchhoff | Wednesday, January 24, 2007 at 04:27 PM
Great article, Bill. I'm wondering, however, what you think of Nature's reports (doi:10.1038/445347a) that some of the big science publishers have hired PR Guy Eric Dezenhall (who worked for Enron among others) to combat this Open Science movement?
Not to mention the irony of Nature complaining about this.
--Simon
Posted by: Simon Greenhill | Thursday, January 25, 2007 at 05:57 PM
Lars, thanks, that's great. (I'm thinking of turning these three articles into a wiki or something update-able, so that I can keep adding new resources like SC.org.)
I have a couple of questions:
1. according to the front page, SC.org contains over 13.6 million publications. Since PubMed contains about 17 million, I guess you must index *all* sciences, not just biomed?
2. Any chance you could add a function? It's often useful to know where an article was published, and SC.org doesn't seem to display that information with retrieved articles (this also distinguishes pre- from post-prints).
3. Do you have a blog? I'd love to keep up-to-date with what you are doing with SC.org, your ideas on impact factors, and so on.
Posted by: Bill | Thursday, January 25, 2007 at 10:47 PM
Simon: there has been quite a lot of blog comment about the PR flack story; the general theme seems to be that, if the AAP is hiring this bottom-feeder, they must be running scared. That's pretty much my view. My favorite expression of it, to date, is Dorothea's.
Posted by: Bill | Thursday, January 25, 2007 at 10:53 PM
Oh, and in re: Nature, I know they haven't been exactly at the forefront of OA implementation, but they do have some neat stuff -- Connotea, their blogs, the recent dabbling in open peer review. I think Macmillan publishing might be old-school, but I know that there are some people at Nature who are very keen on OA/OS, and given time they'll get their way.
Posted by: Bill | Thursday, January 25, 2007 at 11:03 PM
Bill,
Yes I think that putting all of your research on Open Science on a wiki to update it is a great idea. Things change so quickly that it is really the only way to stay current. Also it would make it easier for others to contribute examples.
Posted by: Jean-Claude Bradley | Saturday, January 27, 2007 at 05:08 AM
Hi. I am the founder of Trendirama.com, a community of online amateur writers. We specialize on the future. I would like to invite you to write an article on our website, perhaps based on what you are mentioning here. (the article there should be shorter :)
Please have a look at our website.
(And no, this is not spam. I only offer this to a few people a day, who I feel can benefit from it.)
Best regards
Javier Marti
http://www.trendirama.com
Posted by: Javier Marti | Sunday, February 11, 2007 at 11:31 PM
Hi.
I am Javier, the founder of Trendirama.com, a community of online amateur writers. We write about the future of everything, and I would like to invite you guys to write an article on the Trendirama.com website, perhaps "The future of sience in...(field of your choice)" or whatever you are passionate about? It is up to you, you choose the subject.
You would get a link back when you link to your own article, if you wish.
You can even re-use some of what you have here, in the last part of the article, "your view and comments". That would save you time and still be interesting for readers.
Don’t underestimate this opportunity!
Failing that, if you like the project and you can help me to promote it -even if you don't write- it would be great.
By making this valuable information available online for free, I truly believe we are helping to make the world a better place.
And you could do your bit for the world too.
Your help is appreciated, and if you let me know your contribution, you'll be rewarded appropriately in due time. If you link to us or mention us, we can link you back too.
You can even use our valuable articles on your websites, provided that you link back. Any better offer than that?! :)
Look forward to hearing from you or read your article in Trendirama! Join us!
Best regards
Javier Marti
http://www.trendirama.com
Posted by: Javier Marti | Sunday, February 25, 2007 at 02:08 AM
Brilliantly informative and interesting. Thank's for you works xD
Dmitrij Miedwiediew
kurs php i html
Posted by: historia powszechna | Friday, May 02, 2008 at 10:22 AM
I hope that LHC will begin new part of siecnce, but now something is wrong. Next researches will be about one moth.
P.S. This article is great!
Interesting webstites about suppose topic:
Historia
Nauka
Posted by: Michal | Saturday, November 01, 2008 at 04:50 PM
Thanks for mentioning Naboj on your blog. The limitations of Naboj are mostly due to limited resources with which I have had to contend during its development, mostly in terms of time. I would be very glad to get some more developers working with me on it, and hopefully as the word spreads about it I can come in contact with people who would like to collaborate on what I think is a very important project for the future of scientific publishing.
Posted by: Bojan Tunguz | Tuesday, December 01, 2009 at 12:13 PM
I'm looking forward for start the machine.
Posted by: Karpacz | Sunday, April 11, 2010 at 05:26 AM