tech_banner
An Introduction to Biocomputing - 序列分析 - 生物在线 Lab-on-Web
David Steffen, Ph.D.President, Biomedical Computing, Inc.6626 WestchesterHouston, Texas 77005USA Biological Background for Sequence AnalysisThe fundamental building blocks of life are proteins. Enzymes, which are the molecular machines responsible for virtually all of the chemical transformations that cells are capable of, are proteins. In addition, much of the structure of a cell is made up of proteins. That part of the structure which is not made up of proteins is produced by enzymes which are proteins. A human contains on the order of 100,000 different proteins. It is the properties of and the interactions between these 100,000 proteins that make us what we are. Proteins are variable length[1] linear, mixed polymers of 20 different amino acids[2]. Other terms used more or less interchangably for amino acid polymers are peptides and polypeptides[3]. These topologically linear polymers fold upon themselves to generate a shape characteristic of each different protein, and this shape along with the different chemical properties of the 20 amino acids determine the function of the protein. One of the most important concepts in modern biology is that the functional properties of proteins is determined largely by the sequence of the 20 amino acids in the linear polypeptide chain; that in many cases proteins are largely self-folding. Thus, in theory, knowing the sequence of a protein (the order with which the amino acids occurred) one could infer its function. What determines the order of amino acids in a protein? The Central Dogma of Molecular Biology describes how the genetic information we inherit from our parents is stored in DNA, and that information is used to make identical copies of that DNA and is also transfered from DNA to RNA to protein. DNA is a linear polymer of 4 nucleotides [4] deoxyAdenosine monophosphate (abbreviated A), deoxyThymidine monophosphate (abbreviated T), deoxyGuanosine monophosphate (abbreviated G) and deoxyCytidine monophosphate (abbreviated C). RNA is a very similar polymer of Adenosine monophosphate, Guanosine monophosphate, Cytidine monophosphate, and Uridine monophosphate. Uridine monophosphate, abbreviated U, is a nucleotide functionally equivalent to Thymidine monophosphate. A property of both DNA and RNA is that the linear polymers can pair one with another, such pairing being sequence specific. In such double polymers (referred to as a \"double helix\" due to the shape they assume) G pairs with C and A pairs with T or U. All possible combinations of DNA and RNA double helices occur. One strand DNA can serve as a template for the construction of a complementary strand, and this complementary strand can be used to recreate the original strand. This is the basis of DNA replication and thus all of genetics. Similar templating results in an RNA copy of a DNA sequence. Conversion of that RNA sequence into a protein sequence is more complex. This occurs by translation of a code consisting of three nucleotides into one amino acid, a process accomplished by cellular machinery including tRNA and ribosomes. Four different nucleotides taken three at a time can result in 64 different possible triplet codes; more than enough to encode 20 amino acids. The way that these 64 codes are mapped onto 20 amino acids is first, that one amino acid may be encoded by 1 to 6 different triplet codes, and second, that 3 of the 64 codes, called stop codons, specify \"end of peptide sequence\". Where multiple codons specify the same amino acid, the different codons are used with unequal frequency and this distribution of frequency is referred to as \"codon usage\". Codon usage varies between species. The fact that DNA nucleotides need to be read three at a time to specify a protein sequence implies that a DNA sequence has three different reading frames determined by whether you start at nucleotide one, two, or three. (Nucleotide four will be in the same frame as nucleotide one and so on). Both strands of DNA can be copied into RNA (for translation into protein). Thus, a DNA sequence with its (inferred) complementary strand can specify six different reading frames. It is possible to chemically determine the sequence of amino acids in a protein and of nucleotides in RNA or DNA. However, it is vastly easier at present to determine the sequence of DNA than that of RNA or protein. Since the sequence of a protein can be determined from the DNA sequence which encodes it, most protein sequences are in fact inferred from DNA sequences. Conversion of RNA to a DNA copy (cDNA) is a simple laboratory proceedure, so RNA molecules are themselves sequenced as cDNA copies. Sequence analysis is the process of making biological inferences from the known sequence of monomers in protein, DNA and RNA polymers. Sequence Analysis is DifficultAlthough it is possibly true in theory that given a protein sequence one can infer its properties, current state of the art in biology falls far short of being able to implement this in practice. Current sequence analysis is a painful compromise between what is desired and what is possible. Some of the many factors which make sequence analysis difficult are discussed in this section. As noted above, the difficulty of sequencing proteins means that most protein sequences are determined from the DNA sequences encoding them. Unfortunately, the cellular pathway from DNA to RNA to Protein includes some features that complicates inference of a protein sequence from a DNA sequence. Many proteins are encoded on each piece of DNA, and, so when confronted with a DNA sequence, a biologist needs to figure out where the code for a protein starts and stops. This problem is even more difficult because the human genome contains much more DNA than is needed to encode proteins; the sequence of a random piece of DNA is likely to encode no protein whatsoever. The DNA which encodes proteins is not continuous, but rather is frequently scattered in separate blocks called exons. Many of these problems can be reduced by sequencing of RNA (via cDNA) rather than DNA itself, because the cDNA contains much less extraneous material, and because the separate exons have been joined in one continuous stretch in the RNA (cDNA). There are situations, however, where analysis of RNA is not possible and the DNA itself needs to be analyzed. Although a much greater fraction of RNA encodes protein than does DNA, it is certainly not the case that all RNA encodes protein. In the first case, there can be RNA up- and down-stream of the coding region. These non-coding regions can be quite large, in some cases dwarfing the coding region. Further, not all RNAs encode proteins. Ribosomal RNA (rRNA), transfer RNA (tRNA), and the structural RNA of small nuclear ribonucleoproteins (snRNA) are all examples of non-coding RNA. By and large, global, complete solutions are not available for determining an encoded protein sequence from a DNA sequence. However, by combining a variety of computational approaches with some laboratory biology, people have been fairly successful at accomplishing this in many specific cases. Nonetheless, this problem is currently considered one of the most important in computational biology. Once you have obtained a protein sequence, inferring structure and function represent vastly greater problems. As is noted above, the structure of a protein is produced by the folding of a peptide chain back on itself, and in some cases, the association of multiple peptide chains. This folding can occur as rotation can occur around both bonds within the constituent amino acids as well as the bonds that join the amino acids one to another. Unfortunately (or fortunately, as life depends on this fact), the number of possible folding patterns is effectively infinite. To help cope with this daunting problem, biologists have divided the structural features of proteins into levels. The first level of structure, termed primary structure, refers just to the sequence of amino acids in the protein; this is what we know. Decades ago, it was found that polypeptide chains can sometimes fold into regular structures; that is, structures which are the same in shape for different polypeptides. One such shape is helical, and is referred to as an alpha helix. In another such shape, the polypeptide chain folds back and forth, producing a sheet-like surface. This structure is referred to as a beta sheet. There are additional examples of secondary structural types into which a polypeptide might fold, and some peptides do not fold into one of these regular structures at all. In fact, most long polypeptide chains (e.g. virtually all real biological proteins) fold into different secondary structures along different portions of their length. The secondary structures described above are all very simple and regular; the round and round of an alpha helix or the back and forth of a beta sheet. There are other structures which are found over and over in different proteins which are more complex than this. One example is the helix-loop-helix motif found in many transcription factors[5]. These features are referred to as super-secondary structure. When you look at an actual polypeptide chain, the final shape is made up of secondary features, perhaps super-secondary structural features, and some apparently random conformations. This overall structure is referred to as the tertiary structure. Finally, many biological proteins are constructed of multiple polypeptide chains. The way these chains fit together is referred to as the quarternary structure of the protein. The reason that this complex nomenclature for protein structure has developed is that the problem of understanding protein structure is so imporant and so difficult. The importance of understanding protein structure comes from two factors working together. The first of these is that the function of the protein is absolutely dependent on its structure. In fact, one of the most common ways for proteins to loose their function is to have their structure disrupted; for example by heat or mechanical stress (e.g. beating an eggwhite); only completely and properly folded proteins \"work\". The second factor is that it is extremely difficult to determine the structure of a protein experimentally[6]. To date, the primary structure of many sequences has been determined (about 30,000). In contrast, the tertiary structure of many fewer (about 500) has been determined. Obviously, then, it would be of great value if tertiary structure could be determined from primary structure. It is not an exaggeration to state that the ability to exactly predict protein structures and, from that, protein function would revolutionize medicine, pharmacology, chemistry and ecology. Current research on tertiary structure prediction has used two basic approaches; homology based and ab initio. Homology-based approaches attempt to determine the tertiary structure of a protein by comparing its primary sequence to that of a related proteins whose structure is known. This is a laborious but fairly successful approach. Unfortunately, it requires the existance of similar protein(s) with known structure(s); something not always available. Ab initio approaches try to determine the structure which minimizes free energy. This is done using either Monte-Carlo methods or Neural Net software. Finally, even if/when you determine the tertiary structure of a protein, techniques have not yet been developed for inferring the functional properties of this protein from its structure. What Can Be Done Now with Sequence Analysis?Given the pessimistic view of sequence analysis presented in the previous section, why do we even bother with it? In the first place, the attempt to find methods for successful sequence analysis is a research goal in its own right; one whose potential rewards are so vast as to make it of the first importance. In the second place, although there are many things that sequence analysis cannot yet do, there are many very worth while things that can currently be done with sequence analysis, and these will be summarized in this section. Identification of Protein Primary Sequence from DNA Sequence.The computer programs which are used to infer protein sequence from DNA sequence provide information which can be used to help approach a solution. For example, if you are trying to find out where in a DNA sequence a protein is encoded, it is very useful to know what peptides would be encoded by all six reading frames. A stretch containing many stop codons is a poor candidate for encoding a protein. This will not absolutely tell you where the protein sequence starts and stops, but it will help you guess where that might occur. Programs exist for doing this. In fact, there are many factors you can use to guess where in a DNA sequence a protein sequence might reside; use of the expected codon bias, presence of characteristic sequences representing regulatory signals in the DNA, and so forth. One family of programs integrates a variety of these approaches, and, using either explicit algorithms or trained neural nets[7], makes a prediction. Searching of databases for sequences similar to a new sequence.If you have just determined a sequence of an interesting bit of DNA, one of the first questions you are likely to ask yourself is \"has anybody else seen anything like this?\" Fortunately, there has been a very successful international effort to collect all the sequences people have determined in one place so they can be searched. For DNA sequences, three groups have cooperated in this effort, one in Japan, one in Europe, and one in the United States to produce DDBJ, EMBL and GenBank, respectively. These databases are frequently reconciled with each other, so that searching any one is virtually the same as searching all three. The problem is that these databases are HUGE and, as a result, you must compare your sequence with this vast number of other sequences efficiently. A number of programs have been written to rapidly search a database for a query sequence, two of which, BLAST and FASTA, will be discussed in this course. The techniques used by these programs to make searching rapid result in some loss of rigor of comparison. It is possible (although, as it turns out, unlikely) that a weak but relevant similarity could be missed by these programs. In addition, many times these programs will flag a sequence as being similar to your query sequence when this similarity is not significant. Thus, these programs should be seen as tools for identifying a small subset of sequences from the database for retrieval and further analysis rather than ends in themselves. Databases of protein sequences, including SwissProt and PIR, also exist and can similarly be searched. Which program should you use to search a database, FASTA or BLAST? This question is about as controversial as that over choices of computers (Mac vs. PC) or religions. In fact, as you enter the world of sequence analysis, you will find religous wars between proponents of different programs over and over. Worse, new programs are constantly appearing. In addition, even after having selected a program, you will frequently have to select values for \"parameters\" and always have to interpret the output. There are no magic answers to help you do these things. What you will acquire in this course is the background you need to make reasonable decisions on these issues. Calculation of sequence alignments for evolutionary inferences and to aid in structural and functional analysis.Although it is not possible to completely predict the function or shape (structure) of a protein from its sequence de novo, some useful inferences about structure and function can be drawn, especially by comparing the sequence of a protein of unknown structure and function to sequences of proteins with known structure and function. Second, if the goal of structure/function prediction is to be reached in the future, it will be because of partial analyses done in the present. Third, by comparing the sequence of equivalent proteins from different species of animals (such equivalent proteins are called \"homologues\"), one can draw inferences about the evolution of these species from their common ancestors. One of the most useful things people do with sequences is to compare them to other sequences. However, such comparisons are not as easy to make as one might first think. One factor that complicates analysis is that the sequences biologists need to compare are usually not identical, but only similar. In addition to having a small number of substitutions (e.g. a Guanine for an Adenine at one position in a DNA sequence) there will be insertions and deletions in one sequence relative to the other. Also, depending what you are comparing and what you want to learn from the comparison, how you do the comparison will be different. For these reasons, there have been many different kinds of programs written to compare sequences. Practical Sequence AnalysisWhere can you find these programs and what do you need to have and to do to run them? There is no one place where all possible sequence analysis programs reside and there is no one way to run them. You might buy a commercial sequence analysis package such as DNA* or MacVector to run on the PC or Macintosh sitting on your desktop. You might go out on the network and download the source code for a program which you compile and run on your computer. Your institution may have a computer center where various programs, both commercial and free, have been installed. You might write your program based either on an algorithm you read about in a journal, or even an algorithm you derive yourself. (This happens rarely, and I don\'t recommend it if you are interested in biology rather than bio-computing). Or finally, there are now places on the network where programs run that you can connect to and use. This latter possibility is covered in the next chapter. Or, most likely, because none of these approaches is perfect, you will probably decide to do some combination of all of them. 联系人:生物在线编辑部 Email:service@bioon.com 电话:021-54485309 传真:021-54485087