Gene Expression Programming
Introduction
Gene Expression Programming
(GEP) is the learning algorithm behind
GeneXproTools and what it learns
specifically is about relationships between
variables in sets of data and then builds
models to explain these relationships.
How learning algorithms build models or discover solutions to
problems varies, with some simulating networks of neurons and others
simulating evolution through natural selection. Gene Expression
Programming belongs to the latter group, the so called Evolutionary
Algorithms. And like all evolutionary algorithms, natural or
otherwise, GEP uses populations of individuals (in this case,
populations of models or solutions), selects and reproduces them
according to fitness, and introduces genetic variation using one or
more genetic operators such as mutation or recombination. These
are obviously the prerequisites for evolution to occur.
In GeneXproTools the goal is to design
good statistical models,
which means that GeneXproTools starts by generating a random population of models,
which it then selects according to their performance. Then it
reproduces the selected models, doing so with a certain degree of
genetic variation, thus creating the next generation of new models. By
repeating this process for a certain number of generations,
evolution happens, which in this case means better and better
solutions to the problem at hand.
So these are the general principles of all
evolutionary algorithms and there
are several artificial evolutionary
algorithms that explore them. Of special
interest to the GEP technique are the
Genetic Algorithms (GAs)
and Genetic Programming
(GP), for they serve to illustrate some of
the fundamental characteristics of the GEP
technique and why GEP surpasses the GP
technique by a factor of 100-60,000.
First of all, both GAs and GP are
simple replicator systems, with the
latter considerably more complex than the
former. This means that the stuff or
entities these systems evolve go about their
business doing what has to be done and when
their time comes, they must reproduce their
bodies (hopefully with some variation) to
perpetuate their line. But reproducing
bodies with modification might not be an
easy task, especially if the bodies are
complex structures like the parse trees that
GP evolves. Indeed, the canonical GP
technique is limited to the use of very
conservative operators (almost exclusively a
tree-based crossover) that change the trees
in a way reminiscent of the grafting and
pruning familiar to gardeners and farmers
around the world. And when we compare the
achievements of grafting and pruning
with the achievements of the DNA/protein
system of life on Earth (the most
sophisticated genotype/phenotype
system known to science), it
becomes clear why genotype/phenotype systems
are much more powerful than simple
replicator systems: firstly, in
genotype/phenotype systems there are no
restrictions concerning the type and number
of modifications in the genome and therefore
all the paths and crannies of the fitness
landscape are accessible to meander through,
and secondly, these systems are free to
explore different levels of organization
(for instance, genes are expressed into
proteins; proteins form ribosomes,
microtubules, membranes, etc.; these
macromolecules and structures then form
cells; cells get organized into tissues;
tissues into organs and so forth,
culminating with the organism itself).
Obviously these higher levels of complexity
are completely inaccessible to simple
replicator systems, for no matter how
complex, they continue to be a single
structure, forever incapable of becoming a
part of a much more complex being.
The GEP system is a full-fledged genotype/phenotype system
with expression trees of different sizes and shapes encoded in
linear chromosomes of fixed length. Also important is that GEP
chromosomes are multigenic, encoding multiple expression trees or
sub-programs that can be organized into a much more complex program.
So, like the DNA/protein system of life on Earth, the
genes/trees system of GEP can not only explore all the
paths of the solution space but it's also free to explore
higher levels of organization. But how is this achieved? In order to
answer this question we need to take a closer look at the structures
of the main players of GEP – the chromosomes and the expression
trees – and see how they work together.
The Architecture of GEP Programs
There are two main players in Gene Expression Programming: the
chromosomes and the expression trees (ETs),
being the latter the expression of the genetic information encoded in the former.
As in nature, the process of information decoding is called
translation. And this translation implies obviously a kind of code and a set of rules. The
genetic code of GEP is very simple: a one-to-one relationship between the symbols of the
genes and the nodes they represent in the trees. The
rules are also very simple: they determine the spatial organization of nodes in
the expression trees and the type of interaction between sub-ETs. Therefore, there are
two languages in GEP: the language of genes and the language of expression trees and,
thanks to the simple rules that determine the structure of ETs and their interactions,
it is possible to infer immediately the expression tree given the sequence of a gene, and vice versa.
This means that we can choose to have a very complex program represented by its compact genome
without losing in meaning. This unequivocal bilingual notation is called
Karva language. Its details are explained in the remainder of this
guide.
The structural organization of GEP genes is better understood in terms of
open reading frames (ORFs). In biology, an ORF or coding sequence of a gene
begins with the start codon, continues with the amino acid codons, and ends at a termination codon.
However, a gene is more than the respective ORF, with sequences upstream of the start codon and
sequences downstream of the stop codon.
And although in GEP the start site is always the first position of a gene, the termination point
does not always coincide with the last position of a gene. Consequently, it is common for GEP genes
to have noncoding regions downstream of the termination point. These noncoding
regions obviously do not interfere with expression but, nonetheless, they play a crucial role
in evolution, for they alone allow the creation of valid programs no matter how profoundly
their chromosomes are modified by the genetic operators.
Consider, for example, the algebraic expression:
|
(1)
|
It can also be represented as a diagram or an expression tree:
where “Q” represents the square root function.
This kind of diagram representation is what is called the phenotype
in Gene Expression Programming. And the
genotype can be easily inferred from the phenotype as follows:
which is the straightforward reading of the expression tree from left to right and
from top to bottom. The expression (2) is an ORF, starting at “Q” (position 0) and
terminating at “d” (position 7). These ORFs were named
K-expressions from Karva notation.
Consider another ORF, the following K-expression:
01234567890
|
|
Q*b**+baQba
|
(3) |
Its expression as an ET is also very simple and straightforward. In order to express the ORF correctly, we must follow the rules governing the spatial distribution of functions and terminals. First, the start of a gene corresponds to the
root of the ET which is placed in the topmost line. Second, in the next line, below each function, are placed as many branch nodes as there are arguments to that function. Third, from left to right, the nodes are filled consecutively with the next elements of the K-expression. Fourth, the process is repeated until a line containing only terminals is formed. In this case, the following expression tree is
obtained:
Just by looking at the structure of GEP
K-expressions, it is difficult or even
impossible to see the advantages of such a
representation, except perhaps for its
simplicity and elegance. However, when
K-expressions are analyzed in the context of
a gene, the advantages of this
representation become obvious. As previously
stated, GEP chromosomes have fixed length,
and they are composed of one or more genes
of equal length. Consequently, the length of
a gene is also fixed. Thus, in GEP, what
varies is not the length of genes which is
constant, but the length of the
K-expressions. Indeed, the length of a
K-expression may be equal to or less than
the length of the gene. In the former case,
the termination point coincides with the end
of the gene, and in the latter, the
termination point is somewhere upstream of
the end of the gene.
What is the function of these noncoding regions of GEP genes? We will see that they are the essence of
Gene Expression Programming and evolvability, for they allow the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs. Thus, in
GEP, the fundamental property of genotype/phenotype systems –
syntactic closure – is intrinsic, allowing the totally unconstrained restructuring of the genotype and, consequently, an efficient evolution.
Let’s analyze then the structural organization of GEP genes in order to understand how they invariably code for syntactically correct programs and why they allow an unconstrained application of virtually any genetic operator.
The genes of Gene Expression Programming are composed of a head and a
tail. The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head
h is chosen, whereas the length of the tail t is a function of
h and the number of arguments n of the function with more arguments (also called
maximum arity) and is evaluated by the equation:
t = h (n-1) + 1
Consider a gene for which the set of functions F = {Q, *, /, -, +} and the
set of terminals T = {a, b}. In this case n = 2; if we chose an
h = 15, then t = 15 (2 - 1) + 1 = 16; thus, the length of the gene
g is 15 + 16 = 31. One such gene is shown below (the head is shown in
blue):
0123456789012345678901234567890
|
|
*b+a-aQab+//+b+babbabbbababbaaa
|
(4) |
It codes for the following ET:
In this case, the K-expression ends at position 7, whereas the gene ends at position
30.
Suppose now a mutation occurred at position 6, changing the “Q” into “*”. Then the following gene is obtained:
0123456789012345678901234567890
|
|
*b+a-a*ab+//+b+babbabbbababbaaa
|
(5) |
And its expression gives:
In this case, the termination point shifts one position to the right (position
8), changing slightly the daughter tree.
Consider another mutation in chromosome (4) above, the substitution of “a” at position 5 by “+”,
resulting in the following chromosome:
0123456789012345678901234567890
|
|
*b+a-+Qab+//+b+babbabbbababbaaa
|
(6) |
And its expression gives:
In this case, the termination point shifts 12 positions to the right (position
19), enlarging and changing significantly the daughter tree.
Obviously the opposite might also happen, and the daughter tree might shrink. For example, consider again
gene (4) above, and suppose a mutation occurred at position 2, changing the “+” into “Q”,
giving:
0123456789012345678901234567890 |
|
*bQa-aQab+//+b+babbabbbababbaaa |
(7) |
And now its expression results in the following ET:
In this case, the ORF ends at position 3, shortening the original ET in
four nodes.
So, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, where the simplest is composed of only one node (when the first element of a gene is a terminal) and the largest is composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).
It is evident from the examples above, that
any modification made in the genome, no
matter how profound, always results in a
structurally correct program. This is
unique to GEP and is obviously at the heart
of its superior performance: for instance,
in the simple replicator system of GP, most
modifications result in syntactically
invalid programs (imagine, for instance,
what would happen if a terminal in a GP tree
is replaced by a function), which is why
most GP implementations rely exclusively on
inefficient tree-specific crossover to
create genetic diversity.
Multigenic Chromosomes and Linking Functions
Genotype/phenotype systems are such well-oiled machines that their
expansion into more complex systems is rather easy, and the
introduction of multiple genes in Gene Expression Programming
clearly illustrates this.
So, the chromosomes of Gene Expression Programming are usually composed of more than one gene of equal length. For each problem or run, the number of genes, as well as the
size of the head, are a priori chosen. Each gene codes for a sub-ET and the sub-ETs interact with one another forming a more complex multi-subunit expression tree.
Consider, for example, the following chromosome with length 39, composed of three genes, each with length
13 (the
heads are shown in blue):
012345678901201234567890120123456789012
QaQ+-Qbbaaaba+Q+ab+abababa*-**b+aabbaba
It has obviously three K-expressions and
each one of them is expressed independently,
resulting therefore in three different
sub-ETs:
For the sake of simplicity, in the linear
representation the start of each
K-expression is always given by position 0;
the end of each K-expression, though, is
only evident upon construction of the
corresponding sub-ET. As shown in the image
above, the first K-expression ends at
position 1; the second at position 7; and
the last at position 10. Thus, the
multigenic chromosomes of GEP contain
multiple K-expressions of different sizes,
each one of them coding for a structurally
and functionally unique sub-ET.
Obviously, these sub-ETs or sub-programs must interact with one another. And in
GeneXproTools they interact
through special functions, the so called linking functions: addition, subtraction,
multiplication, division, average, minimum, and maximum for mathematical models and And, Or,
Nand, Nor, Xor, Nxor, Less Than, Greater Than, Less Or Equal,
and Greater Or Equal for logic functions.
The linking by addition of all the three sub-ETs shown in the image above is illustrated
below (the linking functions are shown in gray):
Note that the final program represented in the image above could be linearly encoded as the following
K-expression:
01234567890123456789012
++*Q+-*aQ+*b+aab+abbaab
However, the use of multigenic chromosomes
is more appropriate to evolve solutions to
complex problems, for they permit the
modular construction of complex,
hierarchical structures, where each gene
codes for a smaller and simpler building
block. These smaller building blocks are
separated from one another and are therefore
free to evolve independently, allowing for
the creation of different new units that
might prove handy in a new situation.
GEP with Random Numerical Constants
The implementation of a system for handling random numerical
constants (RNCs) in Gene Expression Programming is another
example of how easy it is to create and explore higher levels of
complexity in genotype/phenotype systems and, indeed, RNCs are
elegantly and efficiently
implemented in GEP.
This elegant construct involves an extra
terminal “?” that is used for representing the RNCs
and an additional domain Dc at the end of GEP genes. Structurally, the Dc comes after the
tail, has a length t equal to
the length of the tail, and is composed of the symbols used to represent the random constants. Therefore, another region (besides the head and
the tail) with defined boundaries and its own alphabet is created in the gene.
Consider the single-gene chromosome with a head size h = 7 (the Dc is shown in
blue):
01234567890123456789012
+?*+?**aaa??aaa68083295
where the terminal “?” represents the random constants. The expression of this kind of
gene is exactly done as
explained above for genes
with just the head/tail construct, giving:
Then the ?’s in the expression tree are replaced from left to right and from top to bottom by the symbols (numerals, for simplicity) in Dc, obtaining:
The values corresponding to these symbols are kept in an array. For simplicity, the number represented by the numeral indicates the order in the array. For instance, for the 10 elements
zero-based array:
A = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}
the chromosome above gives:
Obviously, the GEP-RNC algorithm can also be used to evolve highly sophisticated programs composed of multiple sub-programs through the use of
multigenic chromosomes.
Obviously in this case the genes of the
multigenic GEP-RNC system carry the extra Dc
domain for encoding the random numerical
constants.
See Also:
Related Tutorials:
Related Videos:
References:
Ferreira, C., 2006. Gene Expression Programming:
Mathematical Modeling by an Artificial Intelligence. 2nd Edition, Springer-Verlag, Germany.
Last modified: May
1, 2013
Cite this as:
Ferreira, C. "Gene Expression Programming" From GeneXproTools
Documentation – A Gepsoft Web Resource.
http://www.gepsoft.com/GeneXproTools/GeneExpressionProgramming.htm
|