Computational techniques analyze data from the Human Genome Project


A new discipline has emerged at the intersection of computer science and biotechnology, bringing the power of advanced computational techniques to bear on complex problems in molecular biology. Called bioinformatics or computational biology, this new field is providing essential tools for scientists on the leading edge of research in genetics and other fundamental areas of biology.

Gene-sequencing efforts such as the Human Genome Project, combined with new techniques for studying the activity of genes in living cells, are generating enormous amounts of raw data. These data are accumulating at a rapidly accelerating pace in a variety of public computer databases, such as those maintained by the National Center for Biotechnology Information at the National Institutes of Health.

"The driving force behind bioinformatics is the availability of these large databases and the need to come up with sophisticated computer models for extracting useful information from them," said David Haussler, professor of computer science at the University of California, Santa Cruz.

Haussler discussed the use of computational techniques to analyze genetic data in a talk Saturday (February 19) at the annual meeting of the American Association for the Advancement of Science in Washington, D.C.

Haussler, who directs UCSC's Center for Biomolecular Science and Engineering, recently joined the Human Genome Project's bioinformatics team. Bioinformatics is playing an increasingly important role in the project, an international effort to identify and understand all of the roughly 100,000 human genes.

"Computer analysis will be an integral part of identifying genes and understanding their functions," Haussler said.

The set of genetic instructions for making an organism--its genome--is contained in long, threadlike DNA molecules neatly packaged into chromosomes within the nucleus of every cell. The sequence of chemical units in the DNA is a kind of code that specifies the structures of protein molecules, which carry out most of the functions of living cells.

The complete DNA sequence of the human genome, if compiled in books, would fill 200 volumes the size of the Manhattan telephone book. Human Genome Project scientists are close to having a rough draft of this sequence, but that will only be a first step. Buried within the genome sequence are the genes--DNA sequences that encode specific proteins--which ultimately determine all the inherited characteristics of humans.

Locating genes within genomic DNA sequences is one of the first tasks for which scientists have turned to bioinformatics. Less than 10 percent of the human genome is thought to comprise protein-coding gene sequences. Interspersed with the genes are control sequences, which regulate gene activity, and other "noncoding regions" whose functions are obscure.

Haussler and his coworkers at UC Santa Cruz have developed some of the most effective computational techniques for finding genes in DNA sequences. They introduced a now widely used statistical method called hidden Markov modeling to attack this problem.

To analyze the rough draft of the human genome sequence, Haussler is working closely with researchers at the Massachusetts Institute of Technology's Whitehead Institute. The Whitehead Institute is one of five major sequencing sites involved in the Human Genome Project.

Working with the rough draft, however, will be a monumentally difficult task, Haussler said. "The problem is that the rough draft does not provide a continuous DNA sequence across each chromosome--many regions of the genome are covered only by small pieces," he said.

The first task Haussler and the Whitehead group are tackling is to line up all of the segments of the human genome sequenced so far in their proper order and orientations along the chromosomes. The next step will be to locate genes within the genome sequence. This will be done in collaboration with Neomorphic, a Berkeley-based genomics company, using a computer program called Genie.

Genie was initially developed by Haussler's group and researchers at the Lawrence Berkeley National Laboratory (LBNL). It was exclusively licensed and further developed by Neomorphic, which was founded by a group of scientists from LBNL, UC Berkeley, and UCSC. Genie was recently used to identify genes in the genome of the fruit fly, Drosophila melanogaster, which was sequenced last year. Neomorphic is now developing a new version of Genie optimized for the rough draft of the human genome sequence.

Research on the genetics of organisms such as Drosophila, yeast, and the roundworm Caenorhabditis elegans has helped lay the groundwork for studying the much more complex genome of humans. Many human genes are closely related to genes found in these simpler organisms, which are widely used as model systems for research in genetics and molecular biology. Studies of these model organisms have already yielded many valuable insights into gene functions, normal gene regulation, genetic diseases, and evolutionary processes.

According to Haussler, the role for bioinformatics in this type of research is steadily increasing as the experimental methods become more sophisticated and complex. DNA microarrays or "gene chips," for example, provide valuable information about gene expression--when, where, and to what extent specific genes are active. This information is critical to understanding a gene's biological function. But gene chips, like genomic-sequencing technology, produce enormous amounts of data that can only be analyzed and understood using sophisticated computational approaches.

"There is a lot of information pertaining to gene function that is becoming available as a result of large-scale experiments using gene chips and other methods, which generate massive datasets relating to the functions of thousands of genes," Haussler said.

To analyze these complex datasets, Haussler is pioneering the use of a new statistical method based on the theory of support vector machines (SVMs). SVMs are able to handle high-dimensional datasets in which each data point has many features or attributes.

"It's hard to visualize because we live in a three-dimensional world, and we're talking about analyzing datasets in ten thousand or more dimensions. But we're finding SVMs extremely useful for gene chip data," Haussler said.

Genomic sequencing and gene chips represent what Haussler calls "high-throughput genomic technologies," powerful new techniques for understanding molecular biology. The use of these techniques is increasing, and all of them present significant computational challenges. One of Haussler's goals is to develop new statistical and algorithmic methods for integrating these diverse types of genomic data.

For the moment, analyzing the rough draft of the human genome sequence is the focus of Haussler's efforts. But in the long run, he foresees a happy and prosperous future for the marriage of computer science and molecular biology. The application of human genomics to areas such as drug discovery and clinical diagnostics, for example, will undoubtedly require new computational methodologies, he said.

"Our vision for bioinformatics spans a broad spectrum, from basic molecular biology all the way up to clinical diagnostics," Haussler said.