Select Committee on Science and Technology Written Evidence


Memorandum by Graham Cameron and Michael Ashburner, writing as individuals

  1.  The activities of the European Bioinformatics Institute (http://www.ebi.ac.uk) are marginal to the specific questions asked by this Inquiry. However, as the European source of all nucleic acid sequence data, both human and otherwise, the Inquiry needs to understand the role of the EBI, and its sister institutes in Japan and the USA, in distributing human genetic sequence data.

2.  GENETIC SEQUENCE DATATHE ROLE OF THE DATA LIBRARIES

  2.1  As the Committee will be aware techniques for sequencing DNA became available in the late 1970s and have become increasingly facile since then. In the very early 1980s it became apparent to researchers at the EMBL, Heidelberg that DNA sequence data required archiving in a computer readable form. These data are, on the one hand, not particularly suited to conventional scientific publication and, on the other hand, can best be analysed by computer programs. It seemed, therefore, very sensible to build a database in which all available sequence data could be made freely available to all, deposition of sequences in this database obviating the need for them to be published in scientific papers. This was the genesis of the EMBL Nucleotide Sequence Data Library, first released in June 1982 with 585,433 base pairs of sequence (568 database entries); in the last week of September 2000 the Data Library (now EMBLBank) passed 10 billion base pairs (>8,766,800 entries).

  2.2  Soon after the foundation of the EMBL Nucleotide Sequence Data Library a similar initiative was launched by the US National Institutes of Health (leading to the foundation of Genbank) and, in 1984, a third was launched in Japan, leading to the foundation of the DNA Data Bank of Japan. Since the mid-1980s these three data libraries have been very closely integrated. Scientists may submit new data to any one of the three and these data are then exchanged with the other two partners every day. At present the International Nucleic Acid Sequence Data Library includes 10,061,977,000 base-pairs of sequence, from >50,000 different organisms (from small viruses to human) submitted by over 100,000 different scientists. For publication of new scientific information in the literature which references new DNA sequence data it is now essentially mandatory to have submitted the sequences themselves to the Data Library.

  2.3  All of the sequence data in the Data Library are open and freely available to all without let or hindrance. The data are neither secret nor are they copyrighted; some data may well be covered by patent, since data appearing in the patent literature area are included in the Data Library.

  2.4  All of the data from the "Human Genome Project" are included in the Data Library. In the UK, for example, there is, in effect, a direct pipe from the major sequencing centre, the Sanger Centre, into the Data Library at the EBI next door. At the time of writing of the 10 billion base pairs some 62 per cent is sequence from human.

  2.5  When submitting new sequence data to the Data Library the scientists will "annotate" the sequence in some way. At the very least this annotation must indicate the source of the new sequence, for example the species. In the majority of cases the annotation puts the sequence in its scientific context. Sequences may well be derived from human patients (for example there are over 40,000 different patient derived HIV sequences in the Data Library). As providers of the Data Library neither the EBI, nor its partners, have established standards for what patient data may or may not be included in such submissions. Our role is weakly analogous to that of the University Librarian—those who submit the sequences (that is, the writers of the books) are legally and ethically responsible for their content. Of course, should it be brought to our attention that a submission includes information that is unethical or libellous, then our duty would be to inform the submitter and withdraw the sequence from the Data Library. (In fact, this has never happened, though sequences have been withdrawn on scientific grounds, eg that they were simply wrong.)

  2.6  Without the primary nucleic acid sequence data library modern biological research would be impossible. The data in the data library belongs to the scientific community and it is the determination of the EBI, and of its sister institutes, that these data should continue to be available to all, without constraint or restraint. We see no reason whatsoever for these data to be subject to regulation, and every reason for the current policies to continue. We point out that this policy of complete openness can, and perhaps should, be exploited in the context of public attitudes to human genetic data.

3.  FUNDING THE DATA LIBRARIES

  3.1  The three international institutes that, collaboratively, collect and distribute the primary nucleic acid sequence data are all publicly funded: the US Genbank project by the US National Institutes of Health, the Japanese DDBJ by the Ministry of Education, Science, Sports and Culture and the European Bioinformatics Institute by the 16 member governments of the European Molecular Biology Laboratory (in the case of the UK, this is through the Medical Research Council budget).

4.  ANNOTATED HUMAN GENETIC SEQUENCE DATA

  4.1  The primary sequence data from the Human Genome Project are barely annotated. This is typical of the information in the primary "working draft" record of a 175,000 base pair sequence from the Sanger Centre:

    Feature  /organism="Homo sapiens"

    Feature  /clone="RP11-575B7"

    Feature  /clone_lib="RPCI-11.2"

  4.2  "Annotation" is the task of putting this sequence into its biological context. In its best form annotation will be the product of both computational and expert human analysis. An "annotated" sequence will have its context with its neighbouring sequences established and will have been analysed with respect to regions which may (at least) code for proteins and other features.

  4.3  There are a few efforts world-wide to automatically annotate the emerging human genome sequence. The first and best established of these is called Ensembl, a joint project between the EBI and the Sanger Centre at Hinxton, largely funded by the Wellcome Trust.

  4.4  Ensembl (http://www.ensembl.org) tracks all of the primary sequence data from the Human Genome Project. Viewing these data as a large jigsaw puzzle Ensembl is developing computer programs to assemble the individual sequences into a larger whole and to analyse these for features of biological interest.

  4.5  Ensembl is a part of a larger international effort (see http://www.ensembl.org/genome/central/) to bring annotations on the Human Genome Sequence to the public. Ensembl is, like the nucleic acid sequence data library, open and public to all. Neither the data nor the computer software are subject to any restriction.

  4.6  Those developing and supporting Ensembl, and similar projects in the USA, are confident that they can provide information concerning the human genome that is at least as good, and probably better, than that being offered by commercial companies, such as Celera, Incyte and DoubleTwist in the USA. We are also convinced that such information must never become the property of any single institution, be that institution public or private.

5.  HUMAN MUTATION DATABASES

  5.1  The EBI is heavily involved in an international collaboration to make public data concerning the genetic basis of human disease and variation. There are, internationally, nearly 100 different databases that include information concerning the specific genetic basis of human disease. Typically, each database is specific for one disease eg the Haemophilia B Mutation Database at Guy's Hospital. Each of these databases includes patient data, since they represent the actual nucleic acid sequence of an individual patient. For example, in the Haemophilia B Mutation Database it is recorded that patient "UK 232" has a particular nucleotide base pair change in his (or her) Haemophilia B gene. The way in which patient anonymity is respected is, in these databases, a matter for their curators; all, however, fully understand the need for such protection; indeed we see no problems in this area over and above the well understood needs to protect the privacy of patients and their relatives.

  5.2  In the end it may well be possible for a very determined person to break anonymity if contextual data for patients is public. The dilemma facing scientists, of course, is that completely stripped of any context such data may lose much of their value. For example it is clearly important that scientists can analyse different mutations in the same gene in the context of the particular phenotypes of those patients carrying the mutations. Perhaps some classes of mutation have a much more severe phenotype (or worse prognosis) than others. In the case of genetic diseases that are very rare then inclusion of data describing a clinical condition could well allow the particular patient to be identified.

  5.3  The EBI is both a member of the world-wide HUGO administered "Mutation Database Initiative" and the producer of an integrated resource, the Sequence Variation Database (http://www.ebi.ac.uk/mutations). This project is funded from EMBL sources.

  5.4  There is a real danger that such information, though usually obtained by scientists funded by public monies, may end up in the private domain and then subject to licence. For example the Human Gene Mutation Database in Cardiff (http://www.uwcm.ac.uk/uwcm/mg/hgmd0.html) has recently signed an agreement with Celera that limits access to the data they include (see Bioinform 4(19), 18 September 2000). This trend is very regrettable as it means access to data, the great majority of which are discovered by publicly funded scientists, is preferentially available to commercial interests, some of which may (in effect) be monopolistic.

  5.5  Human mutation data typically include data associated with particular disease. A new class of human genetic data is now being collected—the variation in nucleotide sequence that affects every individual in the world (other than identical twins); two individuals chosen at random will differ in the base pair at about three million different positions in their genomes (about 0.1 per cent). These data (so-called SNPs, Single Nucleotide Polymorphisms) are being collected both in the public and private domains. In the public domain the "SNP Consortium" (http://snp.cshl.org) and many others are providing data on a large number of human polymorphisms. In the private domain Celera have recently announced the release of data on 2.8 million SNPs (see http://www.pecorporation.com/press/prccorp091300.html) (of which 400,000 are from public data).

  5.6  Scientists have handled data on genetic polymorphisms in humans for many decades. There is, for example, an enormous amount of information available concerning the frequency and distribution of different alleles that code for the blood group substances, as well as for many other human polymorphisms (see http://human.stanford.edu/). These data have been of extraordinary interest and importance to human biologists. There is the hope that these classical studies will be followed by studies of SNPs (see http://satori.stanford.edu/institute.html), although proposals for surveys have yet to gain universal approval. Typically such data, if to be used for the study of human genetic and cultural diversity, need not be attributable to an individual; but they do need to be attributable to a community or population. The danger that these data will be used to discriminate between populations is real, but no different in principle from the danger posed by protein polymorphism data. The danger that the sampling of populations to obtain such data will be used to exploit genetic diversity for commercial or even public benefit is also real, but should be obviated by well written ethical agreements.

  5.7  The main justification for determining a very large number of human SNPs is their potential for the analysis of complex human diseases, diseases that might have a multifactorial basis (see Nature 407:516, 28 September 2000). It is here that there is a direct conflict between commercial and public interests. We see neither reason nor method to prevent commercial interests from obtaining and exploiting such data; what is, however, vital is that the public domain is funded to compete with these interests at a realistic level. It is only then that the public good will be best served by the exploitation of human genetic data, since they promise enormous benefits to human health.

Graham Cameron
Joint Head, European Molecular Biology Laboratory—European Bioinformatics Institute

Michael Ashburner FRS
Joint Head, European Molecular Biology Laboratory—European Bioinformatics Institute and Professor of Biology, Department of Genetics, University of Cambridge

2 October 2000


 
previous page contents next page

House of Lords home page Parliament home page House of Commons home page search page enquiries index

© Parliamentary copyright 2000