Tech Creates Self-Training Gene Prediction Program

Researchers at the Georgia Institute of Technology have developed the first ever computer program capable of training itself to predict genes in genomic DNA sequences of eukaryotic organisms such as animals, plants and fungi. The software program, GeneMark.hmm-ES, may help researchers save a year or more in a genome sequencing and interpretation project. The program is a new addition to the family of GeneMark gene prediction programs developed at Georgia Tech and is freely available to academic researchers.

Currently, there are more than 600 ongoing genome sequencing projects of eukaryotes that carry nuclei within cells. Decoding the DNA sequences that come out from even a single genome project is an enormous task. Still, unraveling the genetic code of living creatures allows scientists to understand the details of the cellular machinery. This knowledge helps generate ideas for a variety of future research directions. Understanding the specific features of individual genomes may lead to the development of personalized medicine, while comparing the genomes from related species can help scientists trace their evolution.

"The genomic sequence is a foundation and blueprint of molecular cellular networks and processes which dynamics need to be reconstructed to understand how the cell works. These networks are specific for each organism, so once you know the list of the genes, you start to assemble all the parts into a picture," said Mark Borodovsky, Regents' professor in the School of Biology and the Department of Biomedical Engineering, and director of the Center for Bioinformatics and Computational Genomics at Georgia Tech.

Borodovsky developed the first version of GeneMark in 1993. In 1995, this program was used by Craig Venter and his Institute for Genomic Research to find genes in the first ever completely sequenced genomes of the organisms representing the two prokaryotic domains of life, bacteria and archea.

A self-training version of the genefinding program for prokaryotic genomes was created by Borodovsky's group in 2001. Since 1998, it has been frequently used for gene finding in eukaryotes, particularly in plant genomes such as rice. By now, use of the GeneMark programs by the researchers around the globe was registered for discoveries of more than 400,000 genes in various genomes, from viruses and bacteria to rice and humans.

Now Borodovsky and his team at Georgia Tech have taken a leap forward and built a program that can train itself to make accurate gene prediction in the numerous newly sequenced genomes of eukaryotes. The program uses established general principles of genetic code organization - adjusted to the general compositional features of a particular genome - to help identify at least a few regions of the anonymous genome that contain protein coding sequences. Once they have the initial predictions, they separate the coding and non-coding sequences. This clusterization allows scientists to apply machine-learning techniques to refine the parameters of the recognition algorithm to the specific patterns found in the newly identified protein-coding sequences. A researcher then repeats this prediction and training step, each time detecting a larger set of true coding sequences that are used to further improve the model employed in statistical pattern recognition. The last run, when no innovation is reached at the prediction step, produces the desirable final set of predicted genes.

Because the self-training method uses established general principles of eukaryotic gene organization to reconstruct the species specific nucleotide sequence patterns, it speeds things up, since scientists don't have to wait for an outside expert to develop a sequence large enough to use as a training set. That can shave a year or more off a sequencing project. With the self-training method, the program does the work itself.

Details on the new program can be found in number 20 of Nucleic
Acids Research (volume 33) on pages 6494-6506.