Gepsoft - What Is GEP?

Home

About Us

Contact

Blog


What's New	Products	Buy Now	Downloads	Forum	What Is GEP?

GeneXproTools

GeneXproServer

Gene Expression Programming

What Is GEP?

Introduction
The Architecture of GEP Programs
Multigenic Chromosomes and Linking Functions
GEP with Random Numerical Constants
Higher Levels of Complexity

Introduction

Gene Expression Programming (GEP) is the learning algorithm behind GeneXproTools and what it learns specifically is about relationships between variables in sets of data and then builds models to explain these relationships.

How learning algorithms build models or discover solutions to problems varies, with some simulating networks of neurons and others simulating evolution through natural selection. Gene Expression Programming belongs to the latter group, the so called Evolutionary Algorithms. And like all evolutionary algorithms, natural or otherwise, GEP uses populations of individuals (in this case, populations of models or solutions), selects and reproduces them according to fitness, and introduces genetic variation using one or more genetic operators such as mutation or recombination. And these are obviously the prerequisites for evolution to occur.

In GeneXproTools we are obviously interested in designing mathematical models, which means that GeneXproTools works with populations of models and selects them according to their respective fitnesses. Then it reproduces the selected models, doing so with a certain degree of genetic variation, thus creating the next generation of new models. By repeating this process for a certain number of generations, one is bound to get evolution, which in this case means better and better solutions to the problem at hand.

So these are the general principles of all evolutionary algorithms and, in fact, there are several artificial evolutionary algorithms that explore them. Of special interest to the GEP technique are the Genetic Algorithms (GAs) and Genetic Programming (GP), for they serve to illustrate some of the fundamental characteristics of the GEP technique and why GEP surpasses the old GP technique by a factor of 100-60,000.

First of all, both GAs and GP are simple replicator systems, with the latter considerably more complex than the former. This means that the things or entities these systems evolve go about their business doing what has to be done and when their time comes, they must reproduce their bodies (hopefully with some variation) to perpetuate their line. But reproducing bodies with modification might not be an easy task, especially if the bodies are complex structures like the parse trees that GP evolves. Indeed, the canonical GP technique is limited to the use of very conservative operators (almost exclusively a tree-based crossover) that change the trees in a way reminiscent of the grafting and pruning familiar to gardeners and farmers around the world. And when we compare the achievements of grafting and pruning (and I'm not saying they are not considerable) with the achievements of the DNA/protein system of life on Earth (the most sophisticated genotype/phenotype system known to science), it becomes clear why genotype/phenotype systems are much more powerful than simple replicator systems: firstly, in genotype/phenotype systems there are no restrictions concerning the type and number of modifications in the genome and therefore all the paths and crannies of the fitness landscape are accessible to meander through, and secondly, these systems are free to explore different levels of organization (for instance, genes are expressed into proteins; proteins form ribosomes, microtubules, membranes, etc.; these macromolecules and structures then form cells; cells get organized into tissues; tissues into organs and so forth, culminating with the organism itself). Obviously these higher levels of complexity are completely inaccessible to simple replicator systems, for no matter how complex, they continue to be a single structure, forever incapable of becoming a part of a much more complex being.

The GEP system is a full-fledged genotype/phenotype system with expression trees of different sizes and shapes encoded in linear chromosomes of fixed length. Also important is that GEP chromosomes are multigenic, encoding multiple expression trees or sub-programs that can be organized into a much more complex program. So, like the DNA/protein system of life on Earth, the genes/trees system of GEP can not only explore all the crannies and paths of the solution space but it's also free to explore higher levels of organization. But how is this achieved? In order to answer this question we need to take a closer look at the structures of the main players of GEP – the chromosomes and the expression trees – and see how they work together.

The Architecture of GEP Programs

So, there are two main players in GEP: the chromosomes and the expression trees (ETs), being the latter the expression of the genetic information encoded in the former. As in nature, the process of information decoding is called translation. And this translation implies obviously a kind of code and a set of rules. The genetic code of GEP is very simple: a one-to-one relationship between the symbols of the genes and the nodes they represent in the trees. The rules are also very simple: they determine the spatial organization of nodes in the expression trees and the type of interaction between sub-ETs. Therefore, there are two languages in GEP: the language of genes and the language of expression trees and, thanks to the simple rules that determine the structure of ETs and their interactions, it is possible to infer immediately the expression tree given the sequence of a gene, and vice versa. This means that we can choose to have a very complex program represented by its compact genome without losing in meaning. This unequivocal bilingual notation is called Karva language. Its details are explained in the remainder of this tutorial.

The structural organization of GEP genes is better understood in terms of open reading frames (ORFs). In biology, an ORF or coding sequence of a gene begins with the start codon, continues with the amino acid codons, and ends at a termination codon. However, a gene is more than the respective ORF, with sequences upstream of the start codon and sequences downstream of the stop codon. And although in GEP the start site is always the first position of a gene, the termination point does not always coincide with the last position of a gene. Consequently, it is common for GEP genes to have noncoding regions downstream of the termination point. These noncoding regions obviously do not interfere with expression but, nonetheless, they play a crucial role in evolution, for they alone allow the creation of valid programs no matter how profoundly their chromosomes are modified (we’ll get back to this later).

Consider, for example, the algebraic expression:

(1)

It can also be represented as a diagram or an expression tree:

where “Q” represents the square root function.

This kind of diagram representation is what is called the phenotype in Gene Expression Programming. And the genotype can be easily inferred from the phenotype as follows:

01234567
Q*-+abcd	(2)

which is the straightforward reading of the expression tree from left to right and from top to bottom (exactly as we read a page of text). The expression (2) is an ORF, starting at “Q” (position 0) and terminating at “d” (position 7). These ORFs were named K-expressions from Karva notation.

Consider another ORF, the following K-expression:

01234567890
Qb*+baQba	(3)

Its expression as an ET is also very simple and straightforward. In order to express the ORF correctly, we must follow the rules governing the spatial distribution of functions and terminals. First, the start of a gene corresponds to the root of the ET which is placed in the topmost line. Second, in the next line, below each function, are placed as many branch nodes as there are arguments to that function. Third, from left to right, the nodes are filled consecutively with the next elements of the K-expression. Fourth, the process is repeated until a line containing only terminals is formed. In this case, the following expression tree is obtained:

Just by looking at the structure of GEP K-expressions, it is difficult or even impossible to see the advantages of such a representation, except perhaps for its simplicity and elegance. However, when K-expressions are analyzed in the context of a gene, the advantages of this representation become obvious. As previously stated, GEP chromosomes have fixed length, and they are composed of one or more genes of equal length. Consequently, the length of a gene is also fixed. Thus, in GEP, what varies is not the length of genes which is constant, but the length of the K-expressions. Indeed, the length of a K-expression may be equal to or less than the length of the gene. In the former case, the termination point coincides with the end of the gene, and in the latter, the termination point is somewhere upstream of the end of the gene.

What is the function of these noncoding regions of GEP genes? We will see that they are the essence of Gene Expression Programming and evolvability, for they allow the modification of the genome using several genetic operators without restrictions, always producing syntactically correct programs. Thus, in GEP, the fundamental property of genotype/phenotype systems – syntactic closure – is intrinsic, allowing the totally unconstrained restructuring of the genotype and, consequently, an efficient evolution.

Let’s analyze then the structural organization of GEP genes in order to understand how they invariably code for syntactically correct programs and why they allow an unconstrained application of virtually any genetic operator.

The genes of Gene Expression Programming are composed of a head and a tail. The head contains symbols that represent both functions and terminals, whereas the tail contains only terminals. For each problem, the length of the head h is chosen, whereas the length of the tail t is a function of h and the number of arguments n of the function with more arguments (also called maximum arity) and is evaluated by the equation:

t = h (n-1) + 1

Consider a gene for which the set of functions F = {Q, *, /, -, +} and the set of terminals T = {a, b}. In this case n = 2; if we chose an h = 15, then t = 15 (2 - 1) + 1 = 16; thus, the length of the gene g is 15 + 16 = 31. One such gene is shown below (the head is shown in blue):

0123456789012345678901234567890
*b+a-aQab+//+b+babbabbbababbaaa	(4)

It codes for the following ET:

In this case, the K-expression ends at position 7, whereas the gene ends at position 30.

Suppose now a mutation occurred at position 6, changing the “Q” into “*”. Then the following gene is obtained:

0123456789012345678901234567890
b+a-aab+//+b+babbabbbababbaaa	(5)

And its expression gives:

In this case, the termination point shifts one position to the right (position 8), changing slightly the daughter tree.

Consider another mutation in chromosome (4) above, the substitution of “a” at position 5 by “+”, resulting in the following chromosome:

0123456789012345678901234567890
*b+a-+Qab+//+b+babbabbbababbaaa	(6)

And its expression gives:

In this case, the termination point shifts 12 positions to the right (position 19), enlarging and changing significantly the daughter tree.

Obviously the opposite might also happen, and the daughter tree might shrink. For example, consider again gene (4) above, and suppose a mutation occurred at position 2, changing the “+” into “Q”, giving:

0123456789012345678901234567890
*bQa-aQab+//+b+babbabbbababbaaa	(7)

And now its expression results in the following ET:

In this case, the ORF ends at position 3, shortening the original ET in four nodes.

So, despite their fixed length, each gene has the potential to code for expression trees of different sizes and shapes, where the simplest is composed of only one node (when the first element of a gene is a terminal) and the largest is composed of as many nodes as the length of the gene (when all the elements of the head are functions with maximum arity).

It is evident from the examples above, that any modification made in the genome, no matter how profound, always results in a structurally correct program. Now this is unique to GEP and is obviously at the heart of its superior performance: for instance, in the simple replicator system of GP, most modifications result in syntactically invalid programs (imagine, for instance, what would happen if a terminal in a GP tree is replaced by a function), which is why most GP implementations rely exclusively on inefficient tree-specific crossover to create genetic diversity.

Multigenic Chromosomes and Linking Functions

Genotype/phenotype systems are such well-oiled machines that their expansion into more complex systems is rather easy, and the introduction of multiple genes in Gene Expression Programming clearly illustrates this.

So, the chromosomes of Gene Expression Programming are usually composed of more than one gene of equal length. For each problem or run, the number of genes, as well as the size of the head, are a priori chosen. Each gene codes for a sub-ET and the sub-ETs interact with one another forming a more complex multi-subunit expression tree.

Consider, for example, the following chromosome with length 39, composed of three genes, each with length 13 (the heads are shown in blue):

012345678901201234567890120123456789012
QaQ+-Qbbaaaba+Q+ab+abababa*-**b+aabbaba

It has obviously three K-expressions and each one of them is expressed independently, resulting therefore in three different sub-ETs:

For the sake of simplicity, in the linear representation the start of each K-expression is always given by position 0; the end of each K-expression, though, is only evident upon construction of the corresponding sub-ET. As shown in the Figure above, the first K-expression ends at position 1; the second at position 7; and the last at position 10. Thus, the multigenic chromosomes of GEP contain multiple K-expressions of different sizes, each one of them coding for a structurally and functionally unique sub-ET.

Obviously, these sub-ETs or sub-programs must interact with one another. And in GeneXproTools they interact through special functions, the so called linking functions: addition, subtraction, multiplication, and division for mathematical models and And, Or, Nand, Nor, Xor, Nxor, Less Than, Greater Than, Less Or Equal, and Greater Or Equal for logical expressions.

The linking by addition of all the three sub-ETs shown in the Figure above is illustrated below (the linking functions are shown in gray):

Note that the final program represented in the Figure above could be linearly encoded as the following K-expression:

01234567890123456789012
++*Q+-*aQ+*b+aab+abbaab

However, the use of multigenic chromosomes is more appropriate to evolve solutions to complex problems, for they permit the modular construction of complex, hierarchical structures, where each gene codes for a smaller and simpler building block. These smaller building blocks are separated from one another and are therefore free to evolve independently, allowing for the creation of concrete new units that might prove handy in a new situation.

GEP with Random Numerical Constants

The implementation of a facility for handling random numerical constants (RNCs) in Gene Expression Programming is another example of how easy it is to create and explore higher levels of complexity in genotype/phenotype systems and, indeed, RNCs are elegantly and efficiently implemented in GEP.

This elegant construct involves an extra terminal “?” that is used for representing the RNCs and an additional domain Dc at the end of GEP genes. Structurally, the Dc comes after the tail, has a length t equal to the length of the tail, and is composed of the symbols used to represent the random constants. Therefore, another region (besides the head and the tail) with defined boundaries and its own alphabet is created in the gene.

Consider the single-gene chromosome with a head size h = 7 (the Dc is shown in blue):

01234567890123456789012
+?*+?**aaa??aaa68083295

where the terminal “?” represents the random constants. The expression of this kind of gene is exactly done as explained above for genes with just the head/tail construct, giving:

Then the ?’s in the expression tree are replaced from left to right and from top to bottom by the symbols (numerals, for simplicity) in Dc, obtaining:

The values corresponding to these symbols are kept in an array. For simplicity, the number represented by the numeral indicates the order in the array. For instance, for the 10 elements zero-based array:

A = {0.611, 1.184, 2.449, 2.98, 0.496, 2.286, 0.93, 2.305, 2.737, 0.755}

the chromosome above gives:

Obviously, the GEP-RNC algorithm can also be used to evolve highly sophisticated programs composed of multiple sub-programs through the use of multigenic chromosomes. Obviously in this case the genes of the multigenic GEP-RNC system carry the extra Dc domain for encoding the random numerical constants.

Higher Levels of Complexity

When the mapping is perfect, genotype/phenotype systems are extremely efficient, and Gene Expression Programming is the only GP-style algorithm with such a mapping. And this means obviously that these full-fledged systems can grow: not à la GP which can only bloat, but rather grow in order to explore higher levels of complexity. And we've already seen here different levels of organization: unigenic systems, both with and without a Dc domain for handling numerical constants, and multigenic systems also with and without Dc domains. But there are many more GEP systems: unicellular and multicellular systems for evolving automatically defined functions; systems for parameter optimization that explore both the multigenic organization and the Dc domain; systems for inducing polynomials, neural networks, and decision trees, which also explore the concerted action of different gene domains for fine-tuning all kinds of numerical constants: polynomials' coefficients, the weights and thresholds of neural networks, and the numerical constants of decision trees with mixed/numeric attributes. You can learn all about these systems in Ferreira's book Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence (Ferreira 2006).

References

Ferreira, C., 2006. Gene Expression Programming: Mathematical Modeling by an Artificial Intelligence. 2nd Edition, Springer-Verlag, Germany.

Last modified: September 1, 2007

Cite this as:

Ferreira, C. "What Is GEP?" From GeneXproTools Tutorials – A Gepsoft Web Resource.
https://www.gepsoft.com/tutorial002.htm

Time Limited Trial

Try GeneXproTools for free for 30 days!

Academic Editions

Academic licenses of all GeneXproTools editions are available at a discount for education institutions & students.

Software Bundles

Bundles of GeneXproTools & GeneXproServer are available at a discount price for all editions of GeneXproTools, including academic editions.

Subscribe to the GEP-List

"Finally, a world class user interface in the field of genetic programming and evolutionary computation !! GeneXproTools is simply unrivaled in its marvelous user interface, the breadth of its Fitness Functions, the choice and flexibility in Math and Logic functions, the clarity of its final Model Presentation, and a built in panel for Scoring new data, right inside the interface. This kind of functionality and ease of use has never been seen before in the field of Genetic Programming. Additionally, Dr Ferreira's specific methodology of Gene Expression Programming makes important contributions to the field of evolutionary computation, and the various algorithms she has developed and deployed inside of GeneXproTools are brilliantly conceived, and her methodologies evolve highly predictive models that solve real business problems. GeneXproTools is an extraordinary structural tour de force."

Brian C. Watt, CRM
Chief Risk Officer / Chief Financial Officer
GECC Inc, USA

"I have been using GeneXproTools against a variety of drug research related problems and have found the GUI to be readily-usable and well-attuned to the stages of predictive modeling..."

Steven J Barrett, Ph.D
Principal Scientist
New Applications Team
Data Exploration Sciences GlaxoSmithKline

"Gene Expression Programming, combined with GeneXproTools, allow us here at Mercator GeoSystems to explore new and exciting methods for spatially modelling the relationship between a company's outlets and their customers. The GeneXproTools software is simple to use, well-designed and very flexible. In particular the ability to load training data from a database, and the option to create models in the programming language of our choice, really make this product stand out. Product support is excellent and very responsive - heartily recommended!"

Steve Hall
Mercator GeoSystems Ltd
United Kingdom

"As a professional software developer, I could have attempted to read up on all the latest developments in the field of evolutionary programming and start writing my own modeling tools. One look at the GeneXproTools demo, however, was enough to convince me of the absurdity of that thought. Not only does GeneXproTools have all the power that I would ever need, but it also allows me to customize all parts of the modeling process. I don't have to know the first thing about evolutionary algorithms and yet I can write my own grammars or fitness functions if I wanted to. It is obvious that a huge amount of work went into the making of GeneXproTools, and I am now a very happy customer. Keep up the great work, Gepsoft!"

Glenn Lewis
Software developer, USA

"I've been working as a coastal engineer and mathematical modeler for more than 10 years and now I'm using GeneXproTools to discover complex nonlinear relations that exist in hydraulic and wave processes. For example, GeneXproTools recently helped me establish several explicit approximations to the Wave Dispersion Equation and now with the new version, which allows more independent variables, fitness functions and unlimited records, I plan to develop my own formulae to evaluate the wave overtopping of breakwaters and seawalls. Thanks Gepsoft for providing such an exciting, creative and useful software tool to the scientific community."

Ricardo Carvalho
PROMAN - Centro de Estudos & Projectos, SA
Lisbon, Portugal

"GeneXproTools is being used to look at problems involving parasite populations, where the data is highly skewed. The results using GeneXproTools are considerably better than those obtained using conventional statistics."

Prof John Barrett
Head of the Parasitology Group
University of Wales, UK

"We are using GeneXproTools for modeling the rainfall-runoff process and time series forecasting. GeneXproTools has a nice graphical user interface system and a lot of flexibility in choosing the type of input file. Configuring the problem setup, running and visualizing the graphical outputs with GeneXproTools is indeed user friendly. Being able to get the final model in the languages of our choice makes GeneXproTools stand out from other packages."

Professor S. Mohan
Professor & Head of the Department
Department of Civil Engineering
Indian Institute of Technology, Madras

More