DNA is often thought of as the "software" of life.
When talking about deoxyribonucleic acid -- DNA, the molecule that carries the genetic information of life -- scientists often make comparisons to computer systems, with DNA being an enormous "program" to be run by the body's hardware. But significant differences exist between the genetic code of DNA and the binary code used by computers, and each system has its advantages and limitations.
The simplest unit of binary code is the binary digit, or "bit," which can have one of two values: 0 or 1. The simplest unit of DNA, on the other hand, is the nucleotide, which can have one of four bases: adenine, cytosine, thymine or guanine (A, C, T or G). This increased variation means that each nucleotide of DNA can hold twice as much information as each digit of a binary program.
Computers and biological systems both read their respective codes in blocks of several units instead of analysing each bit or nucleotide individually. Binary information is grouped into sets of eight bits, called bytes; each byte thus has one of 256 possible configurations of zeros and ones. Genetic information instead comes in triplets of nucleotides known as codons, which represent different amino acids, meaning that each DNA "byte" has only 64 possibilities.
Starting and Stopping
Both binary and genetic codes contain signals that indicate where to begin and end the reading of their messages. Computers use start and stop bits for this purpose, while the genetic code contains one start codon and three stop codons. However, DNA often exhibits greater flexibility in starting and stopping, as certain parts of the genetic code can be read in different, overlapping segments. These different interpretations are called open reading frames, and often each frame codes for an entirely different but still useful final product.
In digital code, a single inaccurate bit causes its byte to have a different value, which can introduce significant errors to a computer program. DNA is considerably more resilient in comparison, as many nucleotide changes do not result in changes to the value of -- the amino acid coded by -- a codon. Although 64 codons are possible, biological machinery uses only 20 amino acids in the construction of proteins. Many codons that differ by one nucleotide therefore code for the same amino acid, a property known as redundancy. Redundancy protects genetic data from some inevitable errors that occur in the replication and reading of DNA.
How Much Information Does DNA Encode?
The simplest answer to “How much information does DNA encode?” is “enough data to completely specify an organism’s particular genome and epigenome.” That involves the number of base pairs and the number of possible sites for adding a suppressor. Human DNA has approximately 3 billion base pairs, according to the National Human Genome Research Institute. That means 4^3,000,000,000 possible base sequences.
For simplicity, let’s say that each gene is either suppressed, or not, in the epigenome. That would be a binary choice for each gene. Most humans have between 20,000 and 25,000 genes. Let’s say the average is about 2^22,500 more choices.The length of DNA varies for different species. Humans, with about 3 billion base pairs, have neither the largest nor smallest genome.Normally we specify the “amount of information” in bits; so 2^n choices requires n bits. Note that 4^j = (22)^j = 2^(2j).
Therefore, human DNA genome encodes 4^(3 billion) = 2^(6 billion) choices, or 6 billion bits of information. The epigenome encodes at least 2^22,500 choices, or 22,500 bits. The total information is 6,000,022,500 bits, or approximately 6 Gb (gigabits).We usually discuss computer storage in bytes rather than bits. 6 Gb would amount to 6/7 = 0.857 GB (gigabytes), or 857 MB (megabytes), using ASCII code.
How much information the amino acids encode?
One might suggest that the genetic information is equally carried by the amino acids produced by the codons. (This still assumes that “junk” DNA also carries exactly that information). There are 21 possible results from each codon. The one “start” codon encodes one amino acid; 60 different codons encode another 19 amino acids; and three codons encode “stop”. The 3 billion base pairs would be grouped into 1 billion codons, and each codon has 21 possible meanings. So that would be 21^(1 billion) sequences of amino acids.
We need to convert 21^(1 billion) to a power of two, since all the other information results are in bits. The conversion factor is ln(21)/ln(2), where “ln” is the natural logarithm function. We have ln(21)/ln(2) = 3.0445/0.6931 = 4.3923 (rounded), according to my calculator. (1 billion) * 4.3923 = 4,392,300,000 bits of information to code amino acids. So that is a total information of 4,392,322,500 bits including the epigenome. In ASCII code, that would be 627,474,642 MB (megabytes).
Comparing the Genetic Code to Computer Data Storage
Let’s conclude by comparing computer data storage to the genetic code for DNA. Computers store data in two-valued bits, grouped as bytes of 7 or more bits (for ASCII). One byte holds 2^7=128 unique values.
DNA stores data in four-valued base pairs, which RNA then groups as codons of 3 pairs. One codon holds 4^3=2^6=64 unique values. A sequence of base pairs that convey biological information is called a gene. DNA includes extra information to express or suppress specific genes. Each gene has at least one bit of information for expression or suppression.
Computer files may be measured in megabytes or gigabytes: millions or billions of bytes. One CD-ROM disc may store about 710 MB. Modern solid-state memory and disk drives can store gigabytes. If we can fully prescribe one human’s DNA by specifying the full sequence of base pairs, plus a binary flag to express or suppress each gene, then human DNA contains about 6 Gb or 857 MB of information.