Correlation Coefficient

Choosing the Fitness Function

GeneXproTools 4.0 implements the Correlation Coefficient fitness function both with and without parsimony pressure. The version with parsimony pressure puts a little pressure on the size of the evolving solutions, allowing the discovery of more compact models.

For all classification problems, in order to be able to apply a particular fitness function, the learning algorithms of GeneXproTools 4.0 must convert the value returned by the evolved model into “1” or “0” using the 0/1 Rounding Threshold. If the value returned by the evolved model is equal to or greater than the rounding threshold, then the record is classified as “1”, “0” otherwise.

Thus, the 0/1 Rounding Threshold is an integral part of all fitness functions used for classification and must be appropriately set in the Settings Panel -> Fitness Function Tab.

The Correlation Coefficient fitness function of GeneXproTools 4.0 is, as expected, based on the standard correlation coefficient, which is a dimensionless index that ranges from -1 to 1 and reflects the extent of a linear relationship between the predicted values and the target values.

The correlation coefficient C_i of an individual program i is evaluated by the equation:

where Cov(T,P) is the covariance of the target and model outputs; and s_t and s_p are the corresponding standard deviations, which are given by:

where P_(ij) is the value predicted by the individual program i for sample case j (out of n fitness cases or sample cases); T_j is the target value for fitness case j; andandare given by the formulas:

The correlation coefficient is confined to the range [-1, 1]. When C_i = 1, there is a perfect positive linear correlation between T and P, that is, they vary by the same amount. When C_i = -1, there is a perfect negative linear correlation between T and P, that is, they vary in opposite ways (when T increases, P decreases by the same amount). When C_i = 0, there is no correlation between T and P. Intermediate values describe partial correlations and the closer to 1 or -1 the better the model.

The fitness f_i of an individual program i is expressed by the equation:

f_i = 1000*C_i*C_i

and therefore ranges from 0 to 1000, with 1000 corresponding to the ideal.

Its counterpart with parsimony pressure, uses this fitness measure f_i as raw fitness rf_i and complements it with a parsimony term.

Thus, in this case, raw maximum fitness rf_max = 1000. And the overall fitness fpp_i (that is, fitness with parsimony pressure) is evaluated by the formula:

where S_i is the size of the program, S_max and S_min represent, respectively, maximum and minimum program sizes and are evaluated by the formulas:

S_max = G (h + t)

S_min = G

where G is the number of genes, and h and t are the head and tail sizes (note that, for simplicity, the linking function was not taken into account). Thus, when rf_i = rf_max and S_i = S_min (highly improbable, though, as this can only happen for very simple functions as this means that all the sub-ETs are composed of just one node), fpp_i = fpp_max, with fpp_max evaluated by the formula:

Home | Contents | Previous | Next