Introduction
With the advent of the genomic age and the recent completion of genome screening projects of many eukaryotic as well as prokaryotic organisms, molecular biologists are now in a better position than ever to exploit the available genetic information in order to identify the evolutionary origins of many different species.
Modern bioinformatic techniques that utilise the data yielded from genome screening projects can also be interpreted to identify individual genes and the function of their gene products. The wealth of information surrounding genome sequences allows the technique of comparative genomics to be used in order to investigate gene function and evolution.
Much of the analysis of genome data involves the comparison of sequence information within the same species or between different species in order to identify similarities and differences in DNA and genes that can in turn provide insights into the function of genes and the evolution of the organisms concerned. Such studies are known as synteny analysis or comparative genomics.
Synteny means “same thread” and can be described as the conservation of genetic loci at the same position of the same chromosome within a particular species. Genes can be described as syntenic if they are found on the same chromosome but are not necessarily genetically linked and can be considered to be in microsynteny when dealing with short sequences and in macrosynteny when considering the sequences of whole chromosomes, reviewed by Passarge et al., (1999).
The term synteny is often used to describe the concept of paralogous or orthologous genes, but this has been suggested to be incorrect usage of the term (Passarge et al., 1999). Paralogy refers to homologous genes from sequences that have arisen in the same species due to a duplication event in evolution, while orthology refers to genes in different species that have evolved from a single gene in a common ancestor of the two or more species concerned.
Other researchers however, consider synteny as a blanket term and consider synteny and homology to be interchangeable terms, reviewed by (Hardison, 2003). For the purposes of this essay, synteny will be considered to describe all of the above terms, i.e. “true” synteny, paralogy and orthology.
Much of the sequence data that allows synteny analysis to be performed has been generated recently by genome screening projects. The methods and techniques involved in such projects for the screening of genomes in a number of species will be discussed here. Comparative genomics is the study of genomic sequence and structure across different species and therefore can be considered to be the method of investigating synteny. Methods in comparative genomics for the identification of gene function and evolution will be discussed here.
Genome screening
Many genome screening projects were developed in the 1990s, based on the pioneering methods of Sanger and Coulson (1975). Such projects were designed in order to screen the genomes of medically and economically important species of eukaryotic as well as prokaryotic organisms. The most well-known of these studies is the Human Genome Project, the results of which, that described the first full sequence of the Homo sapiens genome, were first published in 2001 (Venter et al., 2001).
Many more prokaryotic genomes have been sequenced than those of eukaryotic organisms and this is mainly because of the lower level of complexity and reduced size in terms of chromosome number and sequence length in terms of the number of nucleotide bases in eukaryotic genomes, reviewed by Haubold and Wiehe, (2004). Prokaryotic organisms are also of importance as human pathogens and so the knowledge of such genomes can be valuable in that respect, reviewed by Haubold and Wiehe, (2004)
Recent advances in molecular biological techniques for the identification of large sequences have allowed the much larger genomes of eukaryotic to be screened and the nature of such genomes to be investigated. In addition to the human genome project, genome sequencing has also been performed in other eukaryotic species, particularly those that are established biological models. Such eukaryotic species include the mouse Mus musculus, the fruit-fly Drosophila melanogaster the nematode Caenorrhabditis elegans, yeasts Saccharomyces cerevisae and Schizosacchomyces pombe, the major biological plant model thale cress Arabidopsis thaliana and some medically important species of Plasmodium that are responsible for malaria, a disease that has disastrous effects on the global human population, reviewed by Haubold and Wiehe, (2004)
Genome screening techniques
There have been a number of recent advances in molecular biology techniques for sequencing large pieces of DNA that have allowed the screening and sequencing of large eukaryotic genomes. Such techniques include chromosome walking, chromosome jumping, chromosome landing and shotgun sequencing.
Short pieces of DNA that are between approximately 100 and 1000 nucleotides in length and are of the length typically used in the laboratory for molecular biology experiments are sequenced using the chain termination method that forms the basis of the sequencing technique, (Sanger and Coulson, 1975), reviewed by Ziebolz and Droege, (2007). The chain termination method can be used to investigate the genomes of prokaryotes that have simple and genome sequences that are short in length. Such a method is not viable for the sequencing of larger DNA fragments however, such as whole eukaryotic chromosomes that can be millions of bases in length or entire genomes (the human genome is 14.8 billion nucleotide bases in length, (Venter et al., 2001)) and therefore further sequencing techniques have been developed in order to identify sequences within genome screening projects.
One such method for sequencing large portions of DNA is known as chromosome walking, a technique that makes use of restriction fragment length polymorphisms (RFLPs) that are present in the genome as markers and is one of the pioneering methods that were first developed for the sequencing of large DNA fragments, reviewed by (Hauge et al., 1993). Chromosome walking was used in the early 1992 to identify the genomes of the malarial eukaryote Plasmodium falciparium (Walker-Jonah et al., 1992) , and has been used more recently to screen the genome of rice (Huang et al., 2006, Sim et al., 2005).
RFLPs are used as markers and are used to generate RFLP maps by chromosome walking. Chromosome walking involves dividing long sequences of genomic DNA into consecutive short fragments in order to sequence them, by generating primers for polymerase chain reaction (PCR) at the end of each sequence that can be used to amplify the next fragment of sequence in the genome (Stubbs, 1992, Rosenthal, 1992). In addition to the use of RFLPs, other markers known as RAPD (random amplified polymorphic DNA) and AFLP (amplified fragment length polymorphisms) have been investigated and utilised to provide a more accurate map of the genome (Saliba-Colombani et al., 2000)
Additional techniques include chromosome jumping and chromosome landing.
Because chromosome walking cannot be performed in all plant species because their genomes are larger and more complex than prokaryotic organisms and even many eukaryotic organisms with smaller genomes, another technique, known as chromosome landing has been developed for genome screening in such plants. Chromosome landing employs DNA markers in order to isolate specific genes from a clone library (Tanksley et al., 1995).
Repetitive sequences cannot always be successfully sequenced by chromosome walking or chromosome landing, and so a technique known as chromosome jumping is used in such instances, and involves the digestion of genomic DNA into shorter sequences with restriction enzymes in order for such smaller sequences to be individually analysed, reviewed by Zabarovsky et al., (1996), Zabarovsky et al., (1991).
Modern genome sequencing projects, including the human genome project, (Venter et al., 2001) use an alternative method known as shotgun sequencing in order to determine the sequences of large-scale whole genomes that are found within eukaryotic organisms such as rice (Tyagi et al., 2004), maize (Martienssen et al., 2004) and chicken (Dodgson, 2003) . Rather than using consecutive short fragments for PCR, shotgun uses random fragments and is named after the random firing pattern of a shotgun, reviewed by (Bankier, 2001).
Comparative genomics
Knowledge of the DNA sequence of the genome of a species alone is not sufficient to identify individual genes or their functions nor does it provide information about the phenotypes that such genes confer upon the individual, reviewed by (Hardison, 2003).
Having identified the genome sequences of many eukaryotic species, comparative genomics provides a powerful method for the identification of the function and evolution of genes by, as the term suggests, comparing the genomes of different species or by comparing the differences between sequences from individuals of the same species.
Techniques of comparative genomics
There are three main experimental techniques in comparative genomics: alignment, phylogenetic analysis and coalescent theory, reviewed by Haubold and Wiehe, (2004). Knowledge of entire genome sequences has recently allowed more widespread use of these techniques to compare the genomes of different species.
Analysing homology versus analogy is a classic biological technique for the comparison of physiology in terms of comparing the similar features of organisms to determine their origins. Alignment is utilised in a similar fashion in order to compare the evolutionary relationship between species by comparing genetic sequences, reviewed by Haubold and Wiehe, (2004).
In alignment, two or more sequences are placed side by side so that similarities or differences in their sequences can be analysed.
Certain mathematical algorithms are required to allow for gaps in the sequence resulting from insertions or deletions and for point mutations, reviewed by (Haubold and Wiehe, 2004). In modern times, this is taken care of by computer software such as BLAST (Altschul et al., 1990) or FASTA (Pearson and Lipman, 1988) to determine regions of homology between sequences but were originally determined before such technology was available (Needleman and Wunsch, 1970, Smith and Waterman, 1981)), , reviewed by (Haubold and Wiehe, 2004).
Even distantly related species with a low level of nucleotide homology can now be compared using a recently developed technique known ad Syntenator analysis, that identifies homologous regions rather than the homology of individual nucleotide bases (Rodelsperger and Dieterich, 2008).
The resulting alignment data can be applied to investigate the evolutionary relationship between two or more species in a technique known as phylogenetic analysis and was originally developed in the 1960s (Edwards and Cavalli-Sforza, 1964), reviewed by Haubold and Wiehe, (2004). This type of analysis arranges the species concerned into an evolutionary or phylogenetic tree based on the alignment data and comprises three different methods known as distance methods (used for constructing a tree), and parsimony and maximum likelihood (used to score a tree in terms of mutations or probability of sequence data respectively), reviewed by Haubold and Wiehe, (2004).
Depending on the evolutionary distance between species, that to say is the evolutionary time since a single ancestral species diverged into two modern species, different questions about the nature of their genes can be investigated based on the data obtained from alignment and phylogenetic studies, reviewed by (Hardison, 2003). In species with a small evolutionary distance, such as humans and chimpanzees, only the genomic sequences responsible for the unique species phenotype can be identified. As explained by (Hardison, 2003), species with greater evolutionary distances can be compared to investigate which genes have been conserved throughout evolution and are therefore likely to be functional, while comparing humans to the most primitive eukaryotes with vast evolutionary distances can tell us the genes and proteins that characterise eukaryotic life or those that are required by all metazoans, reviewed by Hardison, (2003).
Coalescent theory involves the comparison of genome data from members of the same species, such as the investigation of single nucleotide polymorphisms (SNPs), changes in individual bases that characterise the differences between individuals of the same species, reviewed by (Haubold and Wiehe, 2004). The major difference between phylogenetic analysis and coalescent theory is that they are based on known and simulated data sets respectively, reviewed by Haubold and Wiehe, (2004).
Comparative genomics in the analysis of genome sequences
Some species of Yeast are used in research and are established models for the study of the cell cycle and other processes, such as S. pombe and S. cervisae whereas other yeasts and species of fungi are relevant from a medical point of view, such as Candida albicans. The genomes of fungi are relatively small in comparison to other eukaryotic organisms and therefore their sequencing is relatively simple, reviewed by (Piskur and Langkjaer, 2004, Wolfe, 2006)). As such, the genomes of around thirty yeast species across different fungal groups have been sequenced, thus providing an insight into the evolutionary origins and structure of the yeast genomes, reviewed by Wolfe, (2006).
Syntenic analysis was performed on two species of fungi, S. cerevisae and C albicans to determine if the complete S. cerevisae genome could be used to “fill in the blanks” of the incomplete C. albicans genome sequence. Unfortunately, it was discovered that the genomes of these two species are too divergent to be able to do so (Chibana et al., 2005). This process, however, can be performed in other, more closely related species in order to identify genome sequences of species without screening their entire genome.
Genome screening can also be used to identify the diversity between different strains of the same species, a procedure that has also been undertaken using S. cerevisae, (Carreto et al., 2008). Genome screening has allowed the identification of human disease genes, such as many inherited developmental disorders and cancers, many of which were mapped before the full genome sequencing had been completed (McKusick and Amberger, 1994), reviewed by (Moreno et al., 2008) and has provided further data that can be utilised in the treatment of such conditions as cardiovascular disease (Pollex and Hegele, 2007).
Sequencing of the chimpanzee genome and subsequent comparative genomic analysis with the human genome has provided insights not only into the evolutionary relationship between chimps and humans, but has also provided information regarding the genetic basis of human disease, reviewed by (Olson and Varki, 2003). Perhaps surprisingly, a comparison of even the Drosophila genome with the human map has identified homologous genes that are involved in human disease (Fortini et al., 2000).
From an evolutionary perspective, the genome of the chicken and the distantly related Great reed warbler Acrocephalus arundinaceus were compared and it was shown that the order of the genes on the chromosomes of the two species was conserved in evolution between them, but that sequences had recombined differently, indicating that while the genome of the chicken is a useful tool in determining the genome of other birds, it cannot be used alone (Dawson et al., 2007). This highlights the importance of screening the genome sequences of further species to reduce the extent of problems such as this.
Synteny and comparative genomic analysis has been employed to identify important genes for resistance in agriculturally and economically important crop plants such as tomatoes and potatoes (Huang et al., 2003), (Huang et al., 2005). The genome screening of such species as barley (Kilian et al., 1995), wheat (Blaszczyk et al., 2004), rice (Kilian et al., 1995), arabidopsis and maize (Brendel et al., 2002) has allowed subsequent comparative genomic techniques to be employed in order to locate and investigate specific genes, particularly those involved in conferring resistance in crop plants to pests and environmental conditions, reviewed by Brendel et al., (2002), Caicedo and Purugganan, (2005), Gale and Devos, (1998), Lyons and Freeling, (2008)).
Summary
The genomic age as provided a great advance in our understanding of genes, their origins and functions not only from genome screening projects, but also from the techniques developed in order to interpret the data from individual species and compare data between individuals of the same species or between different species.
Once they have been sequenced by large-scale chromosome walking or shotgun sequencing techniques, genomes can be compared using the methods of comparative genomics described above to identify important genes by their conservation over evolutionary time.
The sequencing of further eukaryotic genomes will provide further data to provide a comprehensive overview of the evolutionary origins and functions of all genes. From a medical perspective, the identification of genes and their polymorphism useful is already becoming as useful tool in the screening of heritable genetic conditions within individuals.
Following the continued success of genome projects, there is a current wealth of research being directed at elucidating the proteome, to identify the many hundreds of thousands of proteins in the case of humans that are encoded by the genome, but are subject to alternatively splicing, post-translational modifications and phosphorylation that confer their specific functions.