Stanford Genome Technology Center Logo Stanford Genome Technology Center
ADDITIONAL CONSIDERATIONS CONCERNING OUR DATA RELEASE POLICY
We have received requests asking us to amplify upon our data release policy.
In response, we are posting the following letter to the editor of Science. Published in the 2 February 2001 issue: volume 291, page 827.

Sequence Data: Posted vs. Published

Large-scale DNA sequencing projects take a considerable amount of time to complete, including 2 to 3 years for the final or "finishing" stage. This fact is not always appreciated by those who are not directly involved in such efforts, as appears to be the case with Elaine Bell in her letter (1 Dec., p. 1696) about the Policy Forum by Lee Rowen et al., "Publication rights in the era of open data release policies" (15 Sept., p. 1881). In her letter, Bell, as editor of Immunology Today, discusses what factors were considered in the decision to publish two articles that contained information from publicly available sequence data that had not been previously published. A major factor, according to her, was the length of time that the primary sequence had been available in the public domain. But the time referred to, about a year, is not adequate for such projects given the nature of the work involved.

Large-scale sequencing projects can be divided into three unequal stages: (i) random (shotgun) sequencing (a relatively fast process); (ii) assembly of the shotgun data, done many times during the course of the project; and (iii) finishing. During finishing, physical gaps in the sequence are closed, ambiguities in the sequence are resolved, contaminating sequences are removed, and errors in the sequence are identified and corrected. Finishing is a slow process, often taking 2 to 3 years for large sequencing projects. Thus, the almost complete sequence will be available for an extended length of time while the sequence is completed and published.

Posted sequence (from stages two and three), as well as sequence found in the GenBank database, is easily distinguished from published sequence. The posted sequence is often incomplete, might contain errors and contamination, and has not gone through peer review. In fact, the high-throughput genome sequence section of GenBank was established precisely to contain sequences not yet sufficiently complete and secure to be published. Thus, posted sequences are public, but they are not thereby automatically in the public domain.

Unpublished sequence should be treated as are all other unpublished scientific data. Therefore, a third party who wants to publish an analysis of other scientists' unpublished sequence should obtain the written consent of those other scientists. Absent that consent, that third party would be committing a "misappropriation of data" as defined by the National Institutes of Health (NIH) (http://ori.dhhs.gov/html/misconduct/regulation.asp). As such, misappropriation of data is one of the NIH definitions of plagiarism: "As a general working definition, [Office of Research Integrity] considers plagiarism to include both the theft or misappropriation of intellectual property and the substantial unattributed textual copying of another's work." Plagiarism is one definition of fraud in science.

Richard W. Hyman

Stanford Genome Technology Center (Department of Biochemistry), Stanford University, 855 California Avenue, Palo Alto, CA 94304, USA. E-mail: rhyman@stanford.edu


SGTC home