Letter from the Sanger Centre
I am replying to the letter of 20 July addressed
to Dr John Sulston regarding the House of Lords Enquiry into Human
Genetic Databases. I am head of the Informatics Division at the
Sanger Centre, and hence responsible for databases. The Sanger
Centre is a genome research institute, funded primarily by a charity,
the Wellcome Trust. The Sanger Centre is itself operated by a
charitable limited company, Genome Research Limited, which is
legally seen as being controlled by the Wellcome Trust.
1. What current projects involve collecting
genetic information on people in the UK? What other projects are
about to start? Are there collections of material (eg tissue samples)
that could be used to generate databases of DNA profiles?
I am confining my answer to this question to
databases built or primarily maintained at the Sanger Centre.
Our major activity has been to contribute to
the public domain reference sequence of the human genome, for
which we are committed to finishing a third. The sequence is almost
all obtained from bacterial clones (BACs, PACs and cosmids) made
elsewhere, which each contain of the order of 100,000 bp of human
DNA that has been obtained from anonymised donors.
We have also been part of a consortium to identify
SNPs (Single Nucleotide Polymorphisms) which are points in the
genome where people differ genetically. In the near future we
will extend these studies to linkage disequilibrium analyses,
which look at the covariation of nearby SNPs, ie the extent to
which differences that are close in the genome occur together
in individuals. To support this we hold DNA from cell lines from
Another large scale project that has started
during the last year is to identify somatic mutations that are
involved in cancer (somatic mutations are ones that are not inherited
from your parents, but which occur in the person getting the cancer),
by looking at material from tumours. We therefore hold collections
of tumour tissue and derived cell lines to support this work.
We also have a number of more minor projects,
for example to study the human MHC locus (Major Histocompatibility
Complex), which is one of the most variable regions of the human
genome, involved in the immune response and self recognition.
Any sample of human DNA or tissue could be used
to generate a DNA profile. As described in (4) below, our primary
approach to ensuring privacy is to ensure that the sample is anonymised
so that any results can not be traced back to the original donor.
2. Why are these genetic databases being assembled?
How are these activities funded? What practical considerations
will constrain developments? Are there alternative ways of fulfilling
The reference sequence of the human genome is
being collected as a fundamental reference material for research
on human molecular biology and genetics. It will underpin much
of the future biomedical research worldwide. Our contribution
to the sequence is funded by grants, primarily from the Wellcome
Trust with a relatively minor contribution from the Medical Research
Council. By making the sequence freely available without any attached
IPR restrictions we aim to avoid constraints on future use. The
human genome sequence is so fundamental that reading it, and its
variations, out in the human population will be important to a
very wide range of applications, many of which cannot now be envisaged.
Given this, alternative approaches such as relying on a private
provider who places constraints on use are not appropriate, since
it would not be desirable for any one body to control the resource.
The work on SNPs and linkage disequilibrium
will help provide a foundation for a detailed understanding of
human population structure, and contribute to future genetic disease
mapping. This is funded via a consortium of pharmaceutical companies
with the Wellcome Trust, that is committed to making the results
publicly available without constraints.
The work on cancer variation will contribute
to our understanding of cancer, and thus hopefully to its cure.
In particular it may help stratify cancer types that by other
criteria appear identical. It is funded by the Wellcome Trust.
Like the other projects, this work will aid fundamental biomedical
science, which is a primary aim of the Wellcome Trust.
3. What is the genetic information that is
being collected? How is it being stored and protected?
For the reference sequence, the primary information
being collected is the sequence of nucleotides ("A"s,
"G"s, "C"s and "T"s) which is copied
to multiple places all over the world. The physical clones from
which the sequence is obtained come from widely distributed libraries,
for which individual clones of interest can be obtained from a
variety of sources.
For SNP, linkage disequilibrium and cancer studies
the information that is collected is the position and nature of
variations from the reference sequence in individual (anonymised)
samples. This information is again stored in computer files and
databases, which at the appropriate time will be made publicly
4. How do the organisations involved see their
responsibilities regarding privacy; consent; future use; public
accountability; and intellectual property rights?
We believe that the correct model for as many
resources as possible that are used for broad research purposes
is to anonymise the underlying samples so that they cannot be
referred back to the original donors, and to obtain very broad
consent for future use on that basis. This approach ensures privacy,
and does not impede future research. Since the Sanger Centre is
involved primarily in basic research, this is the approach we
prefer to take. Our human sequencing and human variation projects
have so far always used anonymised sources.
Of course, full anonymisation is not appropriate
where studies are undertaken that might involve going back to
the donor, for example to change the care of a patient involved
in a trial, or where further information may be required later
in a study (such as prospective long-term studies). We have less
experience with such cases, but cannot rule them out for the future.
In this case, specific fully informed consent is appropriate,
with further consent for any use that arises not covered by the
initial consent. Also anonymisation should be used as far as possible,
so that access to information concerning the patient is minimised.
Finally all data that might lead to identification (on computers
or paper) must be kept secure.
As an institute funded by grants mainly from
charity or public agencies we believe that we should be accountable
for our activities and indeed we produce reports on the progress
of our grant-funded activities. Furthermore, the results of our
research are made available in a timely fashion by publication
and/or distribution across the Internet.
Concerning intellectual property rights (IPR),
we believe that genomic sequence is a fundamental resource that
is precompetitive and should be made available without IPR attached.
However, we recognise that IPR is important when materials or
data are close to being converted into products such as pharmaceuticals,
where protection is needed for the substantial investment in product
We expect that for some programmes at the Sanger
Centre where results tell us more about function and are closer
to therapeutic application we will take out IPR. But we also expect
that most licensing terms would be non-exclusive, with exclusive
licences only for material clearly close to creating a product
that required substantial resources to bring to market.
Our approach to the patenting of genes is that
this is reasonable where a significant function has been directly
established, in which case the patent should cover the application
of that function and natural extensions, rather than all possible
not yet envisaged or speculative uses of the gene as granted in
composition of matter patents. The reason for this position is
that we are in a great state of ignorance. Almost nothing is known
about almost all genes. Granting rights over all possible applications
of a gene, as has happened, creates a disincentive to research
by others both in the industrial and, to an increasing extent,
Unfortunately there has been a "land grab"
for rights to genes based on sequence in the absence of clear
functional assignment. An example of how this can lead to surprising
results is the CCR5 gene. Human Genome Sciences (HGS) filed a
speculative composition of matter patent application on CCR5 based
on its sequence and similarity to a broad class of receptors.
Subsequently researchers at NIH established independently that
CCR5 was a coreceptor for HIV, the AIDS virus, and an important
therapeutic target. HGS updated their patent application and were
awarded a patent with complete rights to all uses including for
AIDS research. However almost all the investment of time and money,
and the inventiveness, was made by the NIH researchers. This type
of situation has a negative effect on research into gene function.
We therefore encourage this Committee, and the
Government, to argue for a narrow interpretation of patent rights
over genes. This is currently a matter of debate for the EPO (European
Patent Office). The main counter-argument to this is that the
USPTO (US Patent and Trademark Office) is still giving broad rights,
although they have tightened criteria somewhat recently. We must
of course maintain a competitive position with respect to the
United States. Any pressure that can be brought to bear on the
USPTO would be very positive.
Finally, the IPR focus in genetic research is
shifting from materials that can be patented to information in
databases. Much of this information is secondary, derived by analysis
and annotation that combines multiple sources. In this case we
believe that there is a creative element, and copyright type rights
are appropriate to potentially allow recovery of costs of forming
the data collection. These should not extend to give rights over
others who reach similar conclusions independently. There is a
new European database copyright which may be interesting in this
regard, but its effectiveness and use has not been fully established
yet. Notwithstanding this position, for public or charitably funded
research we believe in most cases the most effective way to disseminate
results is to write off the cost of producing the data, and place
the resulting database in the public domain without IPR constraints.
This allows maximally effective use of the results without complications,
and is the approach taken by us for all the databases we generate
at the Sanger Centre.
5. How do they see their activities in the
area of genetic databases developing in the future? What advances
in sequencing, screening and database technology are they anticipating?
We see primary genomic DNA sequence as of very
high value to biological research.
Even though the human genome is mostly sequenced,
we expect demand to continue to grow in the medium term (five
years), primarily for sequencing other organisms of research or
We expect the demand for genetic information
on human sequence variation to grow enormously on the same time
scale, partly to study the contribution of inherited genetic variation
to disease, and partly to study the somatic variation (variation
within an individual's body) that underlies cancer. We are directly
involved in studies of both types, which we expect to grow. Similar
studies will be required to analyse genetic traits in organisms
of economic importance, which will help traditional breeding as
well as provide potential information for genetic modification,
and in model organisms, which will help understanding of basic
biology that will inform medical research.
Under the influence of these forces, automation
and efficiency will continue to improve, reducing costs progressively.
We do not foresee a dramatic reduction in costs per unit of information
on the five year timescale, but internationally we expect capacity
to continue to increase.
Demands on data management will grow dramatically,
however. First, the amount of primary data is increasing faster
than Moore's law for computer efficiency growth (two-fold increase
in 18 months, valid for the last 30 years). Second, much of the
information of interest will be secondary, derived from the primary
information, and in many cases increasing in quantity faster than
linearly with respect to the primary data; ie if the amount of
sequence doubles the amount of secondary information more than
6. What lessons should be learnt from genetic
database initiatives in other countries?
For the reasons given at the end of the last
section, it is even more important now to be supporting adequately
central resources for the management and presentation of genetic
information to the research and development community. We should
learn from the USA, who have funded a central national facility
NCBI (National Centre for Biotechnology Information), which has
become a world leader for management of biological information.
However there is a serious economic and political risk in allowing
one country to take sole charge of such an important resource.
It is also important for there to be serious competition to NCBI
to maintain quality and responsiveness in a changing field. In
our view UK interests are best served by strongly supporting the
EBI (European Bioinformatics Institute, sited in England at Hinxton,
next door to the Sanger Centre) as the natural partner/competitor
to NCBI. We encourage full support by the British Government for
substantially increased funding of the EBI through both EC and
EMBL (European Molecular Biology Laboratory) channels (the NCBI
budget is currently around $30 million per year, more than twice
that of EBI). In addition to this action on a European scale (but
sited in Britain), there is also a need for good national computing
network and infrastructure to deliver the relevant information
to biologists' desktops
Head of Informatics
3 October 2000