Jeremy Scheff

Retrospective bioinformatics: the feasibility overlapping genetic codes

, , (2 Comments)

ResearchBlogging.org
This post was chosen as an Editor's Selection for ResearchBlogging.org

In 1957, we knew what DNA was. We were pretty sure that proteins were determined by sequences of DNA. But we didn’t know exactly how this happened. In other words, the genetic code was still a mystery back then. This was a particularly perplexing problem, because a very simple question could be stated with no obvious answer: How does a language (DNA sequences) with four letters (the nucleotides A, C, G, and T) get translated into a language (protein sequences) with twenty letters (amino acids)… and furthermore, is there some higher purpose to having these two different alphabets?

Lacking direct experimental results at the time, there were numerous fascinating hypotheses that all turned out to be completely wrong. The history of these hypotheses and how the real genetic code was eventually discovered is summarized in this excellent article by Brian Hayes; this is one example of the depressing reality of molecular biology over the past 50 years, that things just become more and more complicated the closer we investigate them. However, the disproved hypotheses are still quite interesting. One critical question most of the hypothesized codes tried to answer was the alphabet problem. If your “word size” (the number of letters to code from nucleotides to proteins) was 1, you could represent 4 different amino acids (one for each of the 4 nucleotides). For 2 letters, you have 4^2=16 possible words. That’s not quite enough to represent 20 nucleotides. But for 3 letters, you wind up with 4^3=64 possible combinations, which is a lot more than 20. That seems very inefficient, doesn’t it? So, many scientists assumed there must be some deep underlying reason that explains this discrepancy. Personally, I find Crick’s comma-free code to be a particularly elegant hypothesis along these lines, but I’m going to focus on another class of explanations.

One hypothesis on the genetic code was that it could be an overlapping code. For instance, consider a DNA sequence like AGATTC. We now know that these six nucleotides (if the open reading frame starts at the beginning) code for two amino acids. AGA is arginine and TTC is phenylalanine. But what if codons (sequences of three nucleotides which code for an amino acid) could overlap? Then, for instance, you could take the same six nucleotides and get AGA, GAT, ATT, and TTC. This would make our DNA much more compact and thus much more energetically efficient, which was then (before the sequencing of the genome) believed to be of critical importance.

Even before scientists began to decipher the true genetic code in the 1960s, some aspects of the hypothesized codes could be tested. Consider a dipeptide (two adjacent amino acids). If no restriction is placed on the sequence of amino acids, then there are 20*20=400 possible dipeptides. But for an overlapping code, a dipeptide is defined by just four nucleotides (e.g. AGAT gives AGA and GAT). This means that an overlapping code has at most 4^4=256 possible dipeptides. Along these lines, clever combinatorics could put testable constraints on the feasibility of overlapping codes, which is exactly what Sidney Brenner did:

Consider an amino acid, which has adjacent amino acids at both of its ends, called C-neighbors and N-neighbors. As each unique triple can be preceded by and followed by any one of the four nucleotides, a single codon could have at most four C-neighbors and four N-neighbors. If more than four neighbors exist, then there must be more than one codon coding for that amino acid (remember, we have 64 triplets and 20 amino acids, so that allows for 44 redundant codons). For instance, an amino acid with 13 known C-neighbors and 15 known N-neighbors must have at least 4 different codons, as that would allow for 4^4=16 possible neighbors on each side.

Back in 1957, protein sequencing was a very young field, but there were a handful known sequences. Brenner used sequences of seven known proteins to find the number C-neighbors and N-neighbors for each amino acid, and then calculated the number of codons that would be needed to represent all of those dipeptides. He found that 70 different codons were required, and since this is more than the 64 that is possible for a simple triplet code, the existence of an overlapping triplet code was disproved.

Now, in 2011, we know the sequences of many more than seven proteins. Brenner’s experiment can be performed on much more comprehensive data with just a bit of programming. So let’s try it. First, we need some protein sequence data. This can be downloaded from UniProt. I’m going to use the UniProtKB/Swiss-Prot database in FASTA format. Once the .gz file is uncompressed, it becomes apparent that it is just a plain text file in a standard format which has protein sequences.

Then, we have to install Perl and BioPerl. On Ubuntu, that’s just an apt-get install bioperl away. Now it’s time to code, starting with some boilerplate and module loading:


#!/usr/bin/perl -w

use strict;

use Bio::SeqIO;

We want to calculate the frequency of all dipeptides. This can be done by scanning through each protein sequence and keeping a count of all the dipeptides. I will store all of the counts in a two-dimensional matrix @count with 20 rows and 20 columns. This will replicate Table 2 in Brenner’s paper. I also define the hash %labels which contains the 20 amino acids that correspond to the rows and columns of @count.


# 20x20 matrix for dipeptide frequencies
my @count;
for (my $i=0; $i<20; $i++) {
	for (my $j=0; $j<20; $j++) {
		$count[$i][$j] = 0;
	}
}

# Amino acids corresponding to rows/columns of @count
my %labels = (
	A => 0,
	C => 1,
	D => 2,
	E => 3,
	F => 4,
	G => 5,
	H => 6,
	I => 7,
	K => 8,
	L => 9,
	M => 10,
	N => 11,
	P => 12,
	Q => 13,
	R => 14,
	S => 15,
	T => 16,
	V => 17,
	W => 18,
	Y => 19,
);

Then comes the first bit of BioPerl magic, loading the FASTA file. Without standard file formats and standard programming interfaces, this would require custom code to be written to process every different type of file. I am grateful that other hackers came before me so I can just write some simple code like this:


my $seqio = Bio::SeqIO->new(-file => 'uniprot_sprot.fasta');

Then we can use the nice BioPerl data structure to look at all of the protein sequences. The bit with the defined functions is to ignore anything that is not one of the 20 standard amino acids.


while (my $seq = $seqio->next_seq()) {
	my @aa = split(//, $seq->seq());
	my $size = $seq->length();

	# Count the frequency of dipeptides
	for (my $i=0; $i<$size-1; $i++) {
		if (defined($labels{$aa[$i]}) && defined($labels{$aa[$i+1]})) {
			$count[$labels{$aa[$i]}][$labels{$aa[$i+1]}]++;
		}
	}
}

Then simply save the output to a CSV file.


open(OUT, '>out.txt');
foreach my $row(@count) {
	print OUT join(',', @{$row}), "\n";
}
close(OUT);

On my four year old laptop, this script takes several minutes to run on the UniProt file I downloaded. The output is the following table.

A C D E F G H I K L M N P Q R S T V W Y
A 1604350 195182 822398 1044202 552296 1158886 320361 883073 857065 1588339 357454 529350 645313 613666 874430 939788 785165 1089590 154667 392376
C 174130 58499 138705 148016 103913 227743 70985 135499 130730 235749 45384 103432 139410 99012 144379 190093 130271 161362 31710 81394
D 827445 132752 578593 744270 446042 716855 211155 684994 556923 1019856 220821 383099 479704 332924 515621 609928 495489 741802 130625 350202
E 1078101 142692 686431 1067085 439071 774305 267415 817232 891635 1225847 299511 550953 439135 537320 745843 662946 635891 866060 127733 341910
F 533809 121718 450059 450999 312612 541059 169544 435875 381373 687454 146401 310279 311603 252249 355298 541662 402970 480058 85220 234312
G 1043943 189990 696738 825576 552099 1066213 320105 830120 829867 1215270 312678 479838 512268 484335 750996 864545 722355 951778 158808 411564
H 308719 71130 195973 224532 189524 321458 138965 258138 198526 437238 84493 161914 260307 177599 242326 273822 220482 271193 49896 147018
I 933722 163420 696647 779148 413674 775075 248568 689560 665501 998339 209124 505839 536561 394941 586852 749331 633140 745150 103013 322834
K 866387 123427 597071 845021 352420 674838 224900 689040 833329 999805 230527 499704 478812 433871 618957 649117 592149 734724 102558 330326
L 1577014 245724 1010174 1209093 673188 1234601 415640 973104 1078886 1774109 368336 737749 915158 726217 1038774 1248515 982929 1164291 180014 472504
M 435247 50382 245208 291077 152180 316570 95204 251080 290926 413850 117451 199689 215996 174459 242775 319577 261128 304354 36375 108406
N 554118 106037 378931 459348 309755 524590 166632 520787 453269 714592 162639 418656 421842 292950 368217 491895 385191 503131 87700 248285
P 714406 107407 481661 675169 343068 669937 203444 445798 438734 798215 172183 325933 504394 354379 426646 616487 475868 653357 98682 260902
Q 649443 91724 334988 485523 260767 467674 178795 427814 430406 746243 170464 286123 347358 445493 443371 421132 367001 493498 84418 205712
R 819841 134578 563050 730657 422276 678711 256094 614800 596058 1017683 227775 397183 460706 434279 715987 614290 491147 696504 117367 322554
S 903056 177942 634626 724284 500281 936876 281252 691235 663922 1189788 251608 489757 610758 477067 653919 1039613 671654 797089 139655 353170
T 807206 142149 516545 591659 393125 773111 226037 601197 500720 1020373 201492 376965 571417 354351 496334 668873 600541 739356 109986 277183
V 1096070 182993 747045 887345 481224 853259 265110 790477 735570 1233197 281514 506873 606861 441079 686430 835395 741120 974728 129389 346900
W 144012 31093 111942 116299 85858 129009 50467 119014 115293 225991 49921 91136 77790 99612 121738 125757 99874 129496 31518 62207
Y 385674 89301 316970 336107 243821 408022 136323 314029 290346 529055 105459 237805 246395 232667 318950 359767 286293 352826 66302 190585

Clearly, this is overkill. Rather than a sparsely populated grid like Brenner's Table 2, there are thousands upon thousands of every dipeptide combination. This means that each amino acid has 20 C-neighbors and 20 N-neighbors, which would take 5 different triplets for each amino acid, or 100 triplets. Thus, even more conclusively than Brenner's original paper, I have disproved the existence of an overlapping triplet genetic code. Of course we already knew this. A non-obvious thing that these results tell us is that the distribution of dipeptides is certainly not uniform.

One striking aspect of Brenner's paper is that it is written incredibly confidently. In my own writing, I struggle to convey such confidence (sometimes for good reason). But it is interesting that Brenner does not 100% conclusively prove what he claims, given that posttranslational modification could account for some anomalies in protein sequence data.

References

Brenner, S. (1957). On the Impossibility of all Overlapping Triplet Codes in Information Transfer from Nucleic Acid to Proteins Proceedings of the National Academy of Sciences, 43 (8), 687-694 DOI: 10.1073/pnas.43.8.687

2 comments »

  1. [...] twenty-one amino acids. One discarded hypothesis, that the genetic code is overlapping, can be  revisited today in [...]

    Pingback by ResearchBlogging.org News » Blog Archive » Editor’s Selections: When is MRSA not MRSA, and the feasibility of overlapping genetic codes, and — June 10, 2011 @ 12:03 pm

  2. vous êtes superbes. vos cours sont très satisfaisants. Bravooooooooo

    Comment by patrice MONDI — March 26, 2012 @ 7:57 am

RSS feed for comments on this post. TrackBack URL

Leave a comment