There are two main players in gene expression programming: the
chromosomes and the expression trees (ETs) or programs, being the latter the expression of the genetic information encoded in the former. As in nature, the process of information decoding is called
translation. And this translation implies obviously a kind of code and a set of rules. The
genetic code of gene expression programming is very simple: a one-to-one relationship between the symbols of the chromosome and the nodes they represent in the trees. The
rules are also very simple: they determine the spatial organization of nodes in the expression trees and the type of interaction between sub-ETs. Therefore, there are two languages in GEP: the language of the genes and the language of expression trees and, thanks to the simple rules that determine the structure of ETs and their interactions, it is possible to infer immediately the expression tree given the sequence of a gene, and vice versa. This means that one can choose to have a very complex program represented by its compact genome without losing in meaning. This unequivocal bilingual notation is called
Karva language. Its details are explained in the remainder of this section.
The structural organization of GEP genes is better understood in terms of
open reading frames (ORFs). In biology, an ORF or coding sequence of a gene begins with the start codon, continues with the amino acid codons, and ends at a termination codon. However, a gene is more than the respective ORF, with sequences upstream of the start codon and sequences downstream of the stop codon. Although in GEP the start site is always the first position of a gene, the termination point does not always coincide with the last position of a gene. Consequently, it is common for GEP genes to have
noncoding regions downstream of the termination point. These noncoding
regions obviously do not interfere with expression but, nonetheless, they play a crucial role in evolution, for they alone allow the creation of valid programs no matter how profoundly their chromosomes are modified. Well get back to them later.
Consider, for example, the algebraic expression:
It can also be represented as a diagram or an expression tree:
where Q represents the square root function.
This kind of diagram representation is what is called the phenotype of gene expression programming. And the
genotype can be easily inferred from the phenotype as follows:
01234567
Q*-+abcd
(2)
which is the straightforward reading of the expression tree from left to right and from top to bottom (exactly as we read a page of text). The expression (2) is an ORF, starting at Q (position 0) and terminating at d (position 7). These ORFs were named
K-expressions from Karva language.
Consider another ORF, the following K-expression:
01234567890
Q*b**+baQba
(3)
Its expression as an ET is also very simple and straightforward. In order to express the ORF correctly, we must follow the rules governing the spatial distribution of functions and terminals. First, the start of a gene corresponds to the
root of the ET which is placed in the topmost line. Second, in the next line, below each function, are placed as many branch nodes as there are arguments to that function. Third, from left to right, the nodes are filled consecutively with the next elements of the K-expression. Fourth, the process is repeated until a line containing only terminals is formed. In this case, the following expression tree is formed:
Looking at the structure of GEP ORFs only, it is difficult or even impossible to see the advantages of such a representation, except perhaps for its simplicity and elegance. However, when
ORFs are analyzed in the context of a gene, the advantages of this representation become obvious. As previously stated, GEP chromosomes have fixed length, and they are composed of one or more genes of equal length. Consequently, the length of a gene is also fixed. Thus, in gene expression programming, what varies is not the length of genes which is constant, but the length of the
ORFs. Indeed, the length of an
ORF may be equal to or less than the length of the gene. In the first case, the termination point coincides with the end of the gene, and in the latter, the termination point is somewhere upstream of the end of the gene.
What is the function of these noncoding regions of GEP genes? We will see that they are the essence of gene expression programming and evolvability, for they allow the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs. Thus, in gene expression programming, the fundamental property of genotype/phenotype systems
syntactic closure is intrinsic, allowing the totally unconstrained restructuring of the genotype and, consequently, an efficient evolution.
Lets analyze then the structural organization of GEP genes in order to understand how they invariably code for syntactically correct programs and why they allow an unconstrained application of virtually any genetic operator.
The genes of gene expression programming are composed of a head and a
tail. The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head
h is chosen, whereas the length of the tail t is a function of
h and the number of arguments n of the function with more arguments (also called
maximum arity) and is evaluated by the equation:
t = h (n-1) + 1
Consider a gene for which the set of functions F = {Q, *, /, -, +} and the
set of terminals T = {a, b}. In this case n = 2; if we chose an
h = 15, then t = 15 (2 - 1) + 1 = 16; thus, the length of the gene
g is 15 + 16 = 31. One such gene is shown below (the head is shown in
blue):
0123456789012345678901234567890
*b+a-aQab+//+b+babbabbbababbaaa
(4)
It codes for the following ET:
In this case, the K-expression ends at position 7, whereas the gene ends at position
30.
Suppose now a mutation occurred at position 6, changing the Q into *. Then the following gene is obtained:
0123456789012345678901234567890
*b+a-a*ab+//+b+babbabbbababbaaa
(5)
And its expression gives:
In this case, the termination point shifts one position to the right (position
8), changing slightly the daughter tree.
Consider another mutation in chromosome (4) above, the substitution of a at position 5 by +. The following chromosome is obtained:
0123456789012345678901234567890
*b+a-+Qab+//+b+babbabbbababbaaa
(6)
And its expression gives:
In this case, the termination point shifts twelve positions to the right (position
19), enlarging and changing significantly the daughter tree.
Obviously the opposite also might happen, and the daughter tree might shrink. For example, consider again gene (4) above, and suppose a mutation occurred at position 2, changing the + into Q:
0123456789012345678901234567890
*bQa-aQab+//+b+babbabbbababbaaa
(6)
Its expression results in the following ET:
In this case, the ORF ends at position 3, shortening the original ET in
four nodes.
So, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, where the simplest is composed of only one node (when the first element of a gene is a terminal) and the largest is composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).
It is evident from the examples above, that any modification made in the genome, no matter how profound, always results in a structurally correct program. Obviously, the structural organization of genes must be preserved, always maintaining the boundaries between head and tail.
If you want to explore further the plasticity of GEP chromosomes, please see the next chapter
Genetic Operators where the mechanisms and effects of all the genetic operators used in APS 3.0 are thoroughly analyzed.
|