Gene Expression Programming

Home

About Us

Contact

Blog


What's New	Products	Buy Now	Downloads	Forum

GeneXproTools Online Guide Learn how to use the 5 modeling platforms of GeneXproTools with the Online Guide

Last update: February 19, 2014

Gene Expression Programming

Introduction
The Architecture of GEP Programs
Multigenic Chromosomes and Linking Functions
GEP with Random Numerical Constants

Introduction

Gene Expression Programming (GEP) is the learning algorithm behind GeneXproTools and what it learns specifically is about relationships between variables in sets of data and then builds models to explain these relationships.

How learning algorithms build models or discover solutions to problems varies, with some simulating networks of neurons and others simulating evolution through natural selection. Gene Expression Programming belongs to the latter group, the so called Evolutionary Algorithms. And like all evolutionary algorithms, natural or otherwise, GEP uses populations of individuals (in this case, populations of models or solutions), selects and reproduces them according to fitness, and introduces genetic variation using one or more genetic operators such as mutation or recombination. These are obviously the prerequisites for evolution to occur.

In GeneXproTools the goal is to design good statistical models, which means that GeneXproTools starts by generating a random population of models, which it then selects according to their performance. Then it reproduces the selected models, doing so with a certain degree of genetic variation, thus creating the next generation of new models. By repeating this process for a certain number of generations, evolution happens, which in this case means better and better solutions to the problem at hand.

So these are the general principles of all evolutionary algorithms and there are several artificial evolutionary algorithms that explore them. Of special interest to the GEP technique are the Genetic Algorithms (GAs) and Genetic Programming (GP), for they serve to illustrate some of the fundamental characteristics of the GEP technique and why GEP surpasses the GP technique by a factor of 100-60,000.

First of all, both GAs and GP are simple replicator systems, with the latter considerably more complex than the former. This means that the stuff or entities these systems evolve go about their business doing what has to be done and when their time comes, they must reproduce their bodies (hopefully with some variation) to perpetuate their line. But reproducing bodies with modification might not be an easy task, especially if the bodies are complex structures like the parse trees that GP evolves. Indeed, the canonical GP technique is limited to the use of very conservative operators (almost exclusively a tree-based crossover) that change the trees in a way reminiscent of the grafting and pruning familiar to gardeners and farmers around the world. And when we compare the achievements of grafting and pruning with the achievements of the DNA/protein system of life on Earth (the most sophisticated genotype/phenotype system known to science), it becomes clear why genotype/phenotype systems are much more powerful than simple replicator systems: firstly, in genotype/phenotype systems there are no restrictions concerning the type and number of modifications in the genome and therefore all the paths and crannies of the fitness landscape are accessible to meander through, and secondly, these systems are free to explore different levels of organization (for instance, genes are expressed into proteins; proteins form ribosomes, microtubules, membranes, etc.; these macromolecules and structures then form cells; cells get organized into tissues; tissues into organs and so forth, culminating with the organism itself). Obviously these higher levels of complexity are completely inaccessible to simple replicator systems, for no matter how complex, they continue to be a single structure, forever incapable of becoming a part of a much more complex being.

The GEP system is a full-fledged genotype/phenotype system with expression trees of different sizes and shapes encoded in linear chromosomes of fixed length. Also important is that GEP chromosomes are multigenic, encoding multiple expression trees or sub-programs that can be organized into a much more complex program. So, like the DNA/protein system of life on Earth, the genes/trees system of GEP can not only explore all the paths of the solution space but it's also free to explore higher levels of organization. But how is this achieved? In order to answer this question we need to take a closer look at the structures of the main players of GEP – the chromosomes and the expression trees – and see how they work together.

The Architecture of GEP Programs

There are two main players in Gene Expression Programming: the chromosomes and the expression trees (ETs), being the latter the expression of the genetic information encoded in the former. As in nature, the process of information decoding is called translation. And this translation implies obviously a kind of code and a set of rules. The genetic code of GEP is very simple: a one-to-one relationship between the symbols of the genes and the nodes they represent in the trees. The rules are also very simple: they determine the spatial organization of nodes in the expression trees and the type of interaction between sub-ETs. Therefore, there are two languages in GEP: the language of genes and the language of expression trees and, thanks to the simple rules that determine the structure of ETs and their interactions, it is possible to infer immediately the expression tree given the sequence of a gene, and vice versa. This means that we can choose to have a very complex program represented by its compact genome without losing in meaning. This unequivocal bilingual notation is called Karva language. Its details are explained in the remainder of this guide.

The structural organization of GEP genes is better understood in terms of open reading frames (ORFs). In biology, an ORF or coding sequence of a gene begins with the start codon, continues with the amino acid codons, and ends at a termination codon. However, a gene is more than the respective ORF, with sequences upstream of the start codon and sequences downstream of the stop codon. And although in GEP the start site is always the first position of a gene, the termination point does not always coincide with the last position of a gene. Consequently, it is common for GEP genes to have noncoding regions downstream of the termination point. These noncoding regions obviously do not interfere with expression but, nonetheless, they play a crucial role in evolution, for they alone allow the creation of valid programs no matter how profoundly their chromosomes are modified by the genetic operators.

Consider, for example, the algebraic expression:

(1)

It can also be represented as a diagram or an expression tree:

An example of a mathematical expression tree or parse tree

where “Q” represents the square root function.

This kind of diagram representation is what is called the phenotype in Gene Expression Programming. And the genotype can be easily inferred from the phenotype as follows:

01234567
Q*-+abcd	(2)

which is the straightforward reading of the expression tree from left to right and from top to bottom. The expression (2) is an ORF, starting at “Q” (position 0) and terminating at “d” (position 7). These ORFs were named K-expressions from Karva notation.

Consider another ORF, the following K-expression:

01234567890
Qb*+baQba	(3)

Its expression as an ET is also very simple and straightforward. In order to express the ORF correctly, we must follow the rules governing the spatial distribution of functions and terminals. First, the start of a gene corresponds to the root of the ET which is placed in the topmost line. Second, in the next line, below each function, are placed as many branch nodes as there are arguments to that function. Third, from left to right, the nodes are filled consecutively with the next elements of the K-expression. Fourth, the process is repeated until a line containing only terminals is formed. In this case, the following expression tree is obtained:

Expressing a k-expression as a mathematical expression tree

Just by looking at the structure of GEP K-expressions, it is difficult or even impossible to see the advantages of such a representation, except perhaps for its simplicity and elegance. However, when K-expressions are analyzed in the context of a gene, the advantages of this representation become obvious. As previously stated, GEP chromosomes have fixed length, and they are composed of one or more genes of equal length. Consequently, the length of a gene is also fixed. Thus, in GEP, what varies is not the length of genes which is constant, but the length of the K-expressions. Indeed, the length of a K-expression may be equal to or less than the length of the gene. In the former case, the termination point coincides with the end of the gene, and in the latter, the termination point is somewhere upstream of the end of the gene.

What is the function of these noncoding regions of GEP genes? We will see that they are the essence of Gene Expression Programming and evolvability, for they allow the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs. Thus, in GEP, the fundamental property of genotype/phenotype systems – syntactic closure – is intrinsic, allowing the totally unconstrained restructuring of the genotype and, consequently, an efficient evolution.

Let’s analyze then the structural organization of GEP genes in order to understand how they invariably code for syntactically correct programs and why they allow an unconstrained application of virtually any genetic operator.

The genes of Gene Expression Programming are composed of a head and a tail. The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head h is chosen, whereas the length of the tail t is a function of h and the number of arguments n of the function with more arguments (also called maximum arity) and is evaluated by the equation:

t = h (n-1) + 1

Consider a gene for which the set of functions F = {Q, *, /, -, +} and the set of terminals T = {a, b}. In this case n = 2; if we chose an h = 15, then t = 15 (2 - 1) + 1 = 16; thus, the length of the gene g is 15 + 16 = 31. One such gene is shown below (the head is shown in blue):

0123456789012345678901234567890
*b+a-aQab+//+b+babbabbbababbaaa	(4)

It codes for the following ET:

Expressing a gene as a mathematical expression tree

In this case, the K-expression ends at position 7, whereas the gene ends at position 30.

Suppose now a mutation occurred at position 6, changing the “Q” into “*”. Then the following gene is obtained:

0123456789012345678901234567890
b+a-aab+//+b+babbabbbababbaaa	(5)

And its expression gives:

Expressing a mutated gene as a mathematical expression tree

In this case, the termination point shifts one position to the right (position 8), changing slightly the daughter tree.

Consider another mutation in chromosome (4) above, the substitution of “a” at position 5 by “+”, resulting in the following chromosome:

0123456789012345678901234567890
*b+a-+Qab+//+b+babbabbbababbaaa	(6)

And its expression gives:

In this case, the termination point shifts 12 positions to the right (position 19), enlarging and changing significantly the daughter tree.

Obviously the opposite might also happen, and the daughter tree might shrink. For example, consider again gene (4) above, and suppose a mutation occurred at position 2, changing the “+” into “Q”, giving:

0123456789012345678901234567890
*bQa-aQab+//+b+babbabbbababbaaa	(7)

And now its expression results in the following ET:

In this case, the ORF ends at position 3, shortening the original ET in four nodes.

So, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, where the simplest is composed of only one node (when the first element of a gene is a terminal) and the largest is composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).

It is evident from the examples above, that any modification made in the genome, no matter how profound, always results in a structurally correct program. This is unique to GEP and is obviously at the heart of its superior performance: for instance, in the simple replicator system of GP, most modifications result in syntactically invalid programs (imagine, for instance, what would happen if a terminal in a GP tree is replaced by a function), which is why most GP implementations rely exclusively on inefficient tree-specific crossover to create genetic diversity.

Multigenic Chromosomes and Linking Functions

Genotype/phenotype systems are such well-oiled machines that their expansion into more complex systems is rather easy, and the introduction of multiple genes in Gene Expression Programming clearly illustrates this.

So, the chromosomes of Gene Expression Programming are usually composed of more than one gene of equal length. For each problem or run, the number of genes, as well as the size of the head, are a priori chosen. Each gene codes for a sub-ET and the sub-ETs interact with one another forming a more complex multi-subunit expression tree.

Consider, for example, the following chromosome with length 39, composed of three genes, each with length 13 (the heads are shown in blue):

012345678901201234567890120123456789012
QaQ+-Qbbaaaba+Q+ab+abababa*-**b+aabbaba

It has obviously three K-expressions and each one of them is expressed independently, resulting therefore in three different sub-ETs:

Expressing a multigenic chromosome as multiple expression trees

For the sake of simplicity, in the linear representation the start of each K-expression is always given by position 0; the end of each K-expression, though, is only evident upon construction of the corresponding sub-ET. As shown in the image above, the first K-expression ends at position 1; the second at position 7; and the last at position 10. Thus, the multigenic chromosomes of GEP contain multiple K-expressions of different sizes, each one of them coding for a structurally and functionally unique sub-ET.

Obviously, these sub-ETs or sub-programs must interact with one another. And in GeneXproTools they interact through special functions, the so called linking functions: addition, subtraction, multiplication, division, average, minimum, and maximum for mathematical models and And, Or, Nand, Nor, Xor, Nxor, Less Than, Greater Than, Less Or Equal, and Greater Or Equal for logic functions.

The linking by addition of all the three sub-ETs shown in the image above is illustrated below (the linking functions are shown in gray):

Linking three expression trees with addition

Note that the final program represented in the image above could be linearly encoded as the following K-expression:

01234567890123456789012
++*Q+-*aQ+*b+aab+abbaab

However, the use of multigenic chromosomes is more appropriate to evolve solutions to complex problems, for they permit the modular construction of complex, hierarchical structures, where each gene codes for a smaller and simpler building block. These smaller building blocks are separated from one another and are therefore free to evolve independently, allowing for the creation of different new units that might prove handy in a new situation.

GEP with Random Numerical Constants

The implementation of a system for handling random numerical constants (RNCs) in Gene Expression Programming is another example of how easy it is to create and explore higher levels of complexity in genotype/phenotype systems and, indeed, RNCs are elegantly and efficiently implemented in GEP.

This elegant construct involves an extra terminal “?” that is used for representing the RNCs and an additional domain Dc at the end of GEP genes. Structurally, the Dc comes after the tail, has a length t equal to the length of the tail, and is composed of the symbols used to represent the random constants. Therefore, another region (besides the head and the tail) with defined boundaries and its own alphabet is created in the gene.

Consider the single-gene chromosome with a head size h = 7 (the Dc is shown in blue):

01234567890123456789012
+?*+?**aaa??aaa68083295

where the terminal “?” represents the random constants. The expression of this kind of gene is exactly done as explained above for genes with just the head/tail construct, giving:

Expressing a gene with random numerical constants as an expression tree

Then the ?’s in the expression tree are replaced from left to right and from top to bottom by the symbols (numerals, for simplicity) in Dc, obtaining:

The values corresponding to these symbols are kept in an array. For simplicity, the number represented by the numeral indicates the order in the array. For instance, for the 10 elements zero-based array:

A = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}

the chromosome above gives:

Obviously, the GEP-RNC algorithm can also be used to evolve highly sophisticated programs composed of multiple sub-programs through the use of multigenic chromosomes. Obviously in this case the genes of the multigenic GEP-RNC system carry the extra Dc domain for encoding the random numerical constants.

See Also:

Karva Notation: The Native Code of GeneXproTools
Core Algorithms

Related Tutorials:

What is GEP?

Related Videos:

What's New in GeneXproTools 5.0?

References:

Ferreira, C., 2006. Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. 2nd Edition, Springer-Verlag, Germany.

Last modified: May 1, 2013

Cite this as:

Ferreira, C. "Gene Expression Programming" From GeneXproTools Documentation – A Gepsoft Web Resource.
http://www.gepsoft.com/GeneXproTools/GeneExpressionProgramming.htm

Time Limited Trial

Try GeneXproTools for free for 30 days!

Released February 19, 2014

Last update: 5.0.3883

New Entries

Add-ons − GeneXproServer

Subscribe to the GEP-List