 |
Getting Started with Function Finding
This tutorial covers all the steps for the creation of runs for Function Finding.
Loading Data.
Before evolving a model with GeneXproTools 4.0 you must first load the input data for the learning algorithm.
GeneXproTools 4.0 allows you to work either with databases/Excel or text
files and, for text files, accepts two different data matrix formats.
The first is the standard Samples x
Variables format where samples are in rows and variables in
columns, with the dependent variable occupying the
rightmost position. In the small example below with five
samples, PRODUCTION is the dependent variable and LABOR, MATERIAL,
and CAPITAL are the dependent variables:
LABOR MATERIAL CAPITAL PRODUCTION
0.88287491 0.70249262 0.64540872 0.73044339
0.76598265 0.59360711 0.62686264 0.64924032
0.64562062 0.56965563 0.53160265 0.58497820
0.72908206 0.58913859 0.52247179 0.62160041
0.50690655 0.23498841 0.42968111 0.36493144
And the second, is the Gene Expression Matrix format commonly used
in DNA microarrays studies where samples are in columns and
variables in rows, with the dependent variable
occupying the topmost position. For instance, in Gene Expression
Matrix format, the small dataset above corresponds to:
PRODUCTION 0.73044339 0.64924032 0.5849782 0.62160041 0.36493144
LABOR 0.88287491 0.76598265 0.64562062 0.72908206 0.50690655
MATERIAL 0.70249262 0.59360711 0.56965563 0.58913859 0.23498841
CAPITAL 0.64540872 0.62686264 0.53160265 0.52247179 0.42968111
which is not visually very appealing, but of course very handy for
datasets with a relatively small number of samples and thousands of
variables. Note, however, that for Excel files this format is not
supported and if your data is kept in this format in Excel, you must
copy it to a text file so that it can be loaded into GeneXproTools.
GeneXproTools uses
the Samples x
Variables format throughout and therefore all formats are automatically
converted and shown in this format.
GeneXproTools supports the standard separators (space,
tab, comma, semicolon, and pipe) and detects them automatically. The
use of labels to identify your variables is optional and
GeneXproTools also detects automatically whether they are
present or not. If you use them, however, you will be able to
generate more intelligible code where each variable is identified by
its name, by checking the Use Labels box in the Model Panel.
To Load Input Data for Modeling
- Click the File Menu and then choose New.
The New Run Wizard appears. You must give a name to your new run file (the default filename extension of
GeneXproTools 4.0 run files is .gep) and then choose Function
Finding in the Problem Category box and the kind of source file
in the Data Source Type box.
GeneXproTools 4.0 allows you to work either with Excel/databases or text
files.
- Then go to the Training Data window by clicking the Next button.
Choose the path for the training set by browsing the Open dialog
box and choose the appropriate data matrix format. Irrespective
of the data format used,
GeneXproTools shows the loaded data in the standard Samples x
Variables format, with the dependent variable occupying the
rightmost position.
- Then go to the Testing Data window by clicking the Next button.
Repeat the same steps of the previous point if you wish to use a
testing set to evaluate the predictive accuracy of your model.
- Click the Finish button to save your new run file.
The Save As dialog box appears and after choosing the directory where you want your new run file to be saved, the
GeneXproTools modeling environment appears.
Then you just have to click the Evolve button to create a model as
GeneXproTools automatically chooses, from a gallery of templates, default settings that will enable you to evolve a model immediately.

In data mining, be it performed by learning algorithms or conventional statistical methods, it really pays to take a good look at your data before embarking on a complex, usually time consuming modeling
process. It's true that evolutionary algorithms are particularly well equipped to deal with noisy data, but the better the data you feed them the better the models they produce.
GeneXproTools helps you find missing and invalid (usually nominal) values in your data sets and prompts you to fix them before they are used for modeling. But the preparation of a well balanced data set should be done before loading the data into
GeneXproTools, and we recommend you to particularly take care of the following:
- Avoid using duplicated samples for they can bias the modeling process considerably.
- Choose a well balanced data set.
- Choose a reasonable number of samples for training.
An excessively large data set will slow the modeling process unnecessarily. If you have access to huge data bases, it’s good practice to use the surplus samples for testing instead. A good rule of thumb consists of using about 8-10 samples for each independent variable in your training data.
- Check your data sets carefully for inaccurate values. Typographical or measurement errors generally cause outliers that can be detected by graphing one variable at a
time, a task that can be easily accomplished by GeneXproTools in
the Data Panel.
The graphical visualization tools of
GeneXproTools 4.0 make it easy to identify outliers, which may well represent errors in the data files. After loading your data into
GeneXproTools, in the Data Panel
you can visualize the distribution of values for each variable and also plot each independent variable against the dependent variable.
Explore the Function Set
GeneXproTools 4.0 offers a total of 279 built-in mathematical
functions, including 186 different IF THEN ELSE rules, that can be used for designing nonlinear regression models. This wide set of mathematical functions allows the evolution of complex
and rigorous models quickly built with the most appropriate functions. You can find the description of all the 279 built-in mathematical functions available in GeneXproTools 4.0, including their representation in
knowledge base.
The Function Selection Tool of GeneXproTools 4.0 helps you selecting different function sets very quickly
through the combination of the Show options with the
Random/Default/Clear/Select All buttons plus the Increase/Reduce
Weight buttons in the Functions
Panel. Furthermore, GeneXproTools automatically balances your
function set with the number of independent variables in your data,
so now you just have to select the right functions for your problem and then choose their relative
proportions by adjusting their weights.
Despite the wide set of GeneXproTools built-in mathematical functions, some users sometimes want to model with different
ones.
GeneXproTools 4.0 gives the user the possibility of creating custom tailored functions
(Dynamic UDFs or DDFs) and evolve models with them. A note of
caution though: the use of DDFs slows considerably the evolutionary process and therefore should be used with moderation.
By selecting the Functions Tab in the Functions Panel, you have full access to the 279 built-in mathematical functions of
GeneXproTools 4.0. Here is also the place where you can add
Dynamic UDFs to your modeling kit.
To select a function, just check the box on the left. By default, the weight of each function is 1, but you can increase the probability of a function being included in your models by increasing its weight in the Select/Weight column. GeneXproTools automatically balances your
function set with the number of independent variables in your data,
so now you just have to select the right functions for your problem and then choose their relative
proportions by choosing their weights.
To add a DDF to your modeling kit, just click the Add button on the Dynamic UDFs frame and the
DDF Editor appears.

By choosing the arity (minimum is 1 and maximum is 4) in the Arity box, the function header appears
in the code window. Then you just have to write the body of the function in the code editor. The code must be in JavaScript and can be
conveniently tested for compiling errors by pressing the Test button.
In the Definition box, you can write a brief description of the function for your future reference. The text you write
there will appear in the Definition column.
Dynamic UDFs are extremely powerful and interesting tools as they are treated exactly like the built-in functions of
GeneXproTools and therefore can be used to model all kinds of relationships between variables or complex expressions. For instance, you can design a DDF so that it will model the sum of four expressions, that is, DDF = (expression 1) + (expression 2) + (expression 3) + (expression 4), where the value of each expression will depend on the context of the DDF in the expression tree. A note of caution, though, although extremely interesting, DDFs decrease considerably the speed of the algorithm and therefore we advise you to choose your functions from the wide set of
GeneXproTools 4.0 built-in functions.
Explore all the available architectures for building mathematical models.
The chromosome architecture of your models include the
head size, the number of genes and the linking
function. You choose these parameters in the Settings Panel -> General
Settings Tab.
The Head Size determines the complexity of each term in your model. In the
heads of genes, the GeneXproTools
learning algorithm tries out different arrangements of functions and
terminals (variables and constants) in order to model your data. The plasticity of this architecture allows the discovery of
a virtually infinite number of models of different sizes and shapes which are afterwards tested and selected during the learning process.
The heads of genes are shown in blue in the compact, linear
representation of your models in the Model Panel.
More specifically, the head size h of each gene determines the maximum width
w and maximum depth d of the sub-expression trees
encoded in the gene, which are given by the formulas:
w = (n - 1) * h + 1
d = ((h + 1) / m) * ((m +
1) / 2)
where m is minimum arity and n is maximum arity.
Thus, the GeneXproTools learning algorithm selects its models between these extreme cases, fine-tuning the ideal size and shape during the evolutionary process without human intervention.
The number of genes per chromosome is also an important parameter. It will determine the number of (complex) terms in your model as each gene codes for a different
parse tree (sub-expression tree or sub-ET). Theoretically, one could just use a huge single gene in order to evolve very complex models. But the partition of the chromosome into simpler, more manageable units gives an edge to the learning process and more efficient and elegant models can be discovered using
multigenic chromosomes.
Whenever the number of genes is greater than one, you must also choose a suitable
linking function for linking the mathematical terms encoded in each gene.
GeneXproTools 4.0 allows you to choose
addition, subtraction, multiplication, or division to link the sub-ETs. As expected, addition (and obviously
subtraction and And) works very well for virtually all problems but sometimes one of the other linkers could be useful for searching different solution spaces and finding a very good, albeit unexpected model.
Explore the best fitness functions.
For Function Finding problems, in the Fitness Function Tab of the Settings Panel you have access to 36 built-in fitness
functions. Additionally, you can also design your own custom fitness function and explore the solution space with it.
By choosing Custom in the Fitness Function box, the custom fitness
editor is activated.
You can design your own custom
fitness function using the Custom Fitness Function window to write the code of your fitness function. The code for the custom fitness function must be in JavaScript and can be tested before evolving a model with
it by pressing the Test button.
The 36 built-in fitness functions of GeneXproTools 4.0 for Function Finding:
The kind of fitness function you choose will depend most probably on the statistical function you are most familiar with. And although there is nothing wrong with this for all of them can accomplish an efficient evolution, you might want to try different fitness functions for they travel the fitness landscape differently: some of them very straightforwardly in their pursuits while others choose less obvious paths.
Explore all the learning algorithms.
GeneXproTools 4.0 uses two different learning algorithms for
Function Finding problems. The first – the basic gene expression algorithm
or simply Gene Expression Programming (GEP) – does not support the direct manipulation of random numerical constants,
whereas the second – GEP with Random Numerical Constants or GEP-RNC
for short – has a facility for handling them directly. So, these
two algorithms search the solution landscape differently and
therefore you might wish to try them both on your problems.
The kinds of models these algorithms produce are quite different
and, when both of them perform equally well on the problem at hand,
you still might prefer one to the other. But there are cases, however,
where numerical constants are crucial for an efficient modeling and,
therefore, the second algorithm is the default in
GeneXproTools 4.0. You activate this algorithm in the Settings Panel -> Numerical Constants by checking the Use Random Numerical Constants box.

The GEP-RNC algorithm is slightly more complex than GEP
as it uses an additional gene domain (Dc) for encoding the random
numerical constants. Consequently, this algorithm comes equipped
with an additional set of genetic operators (RNC mutation, Dc mutation, Dc inversion, and Dc IS transposition) especially developed for handling random
numerical constants (if you are not familiar with these operators,
please use the default values by clicking the Default button for
they work very well in all cases).
And last but not least since these parameters are crucial if you are handling numerical constants directly, you must also choose and adjust the range and type of numerical constants that will be used by the
GEP-RNC algorithm during the learning process. As for the
Number of Constants per Gene parameter, a good rule of thumb consists of using a small set of 10 different constants per gene as this seems to provide enough diversity for most problems without inflating the structural complexity much.
Explore essential evolutionary strategies for an efficient learning.
Predicting unknown behavior efficiently is of course the foremost goal in modeling. But extracting knowledge from the blindly designed models is becoming more and more crucial as this knowledge can be used not only to enlighten further the modeling process but also to understand the complex relationships between variables.
So, the evolutionary strategies we recommend in the GeneXproTools templates
for Function Finding reflect these two main concerns: efficiency and simplicity. Basically, we recommend starting the modeling process with the GEP-RNC algorithm
and a function set well adjusted to the complexity of the problem.
GeneXproTools 4.0 chooses the appropriate template for your problem according
to the number of variables in your data. This kind of template is a good starting
point that allows you to start the modeling process immediately with just a mouse click. Indeed, even if you are not familiar with evolutionary computation in general and
Gene Expression Programming in particular, you will be able to design complex nonlinear models immediately thanks to the templates of
GeneXproTools 4.0. In these templates, all the adjustable parameters of the learning algorithm are already set and, for instance, you don’t have to know how to create genetic diversity, how to set the appropriate population size, the chromosome architecture, the fitness function,
how to increase the complexity of your models, and so forth. Then, as you learn more about
GeneXproTools, you will be able to explore all its modeling tools and create quickly and efficiently very good models that will allow you to understand your data like never before.
There is, however, a very important setting in GeneXproTools that is not controlled by
GeneXproTools 4.0 templates and must be wisely chosen by you: the
number of training samples. Theoretically, if your data is well balanced and in good condition, evolutionarily
speaking, the more samples the better. But there's a catch, obviously: the larger the training set the
slower evolution or, in other words, the more time will be needed for generations to go by. So, you must compromise here and choose a training set with the appropriate size. A good rule of thumb consists of choosing 8-10 training samples for each independent variable in your data; all the remaining samples could be used for testing the generalizing capability of the evolved models.
So, after creating a new run you just have to click the Evolve button in the
Run Panel in order to design a model. Then you observe carefully the evolutionary process, especially the
curve fitting plot. Then, whenever you see fit, you can stop the run without fear of stopping the evolutionary process prematurely as
GeneXproTools 4.0 allows you to continue the evolutionary process at a later time by using the best-of-run model as the starting
point (evolve with seed). For that you just have to click on the Optimize button in the
Run Panel.

This strategy has enormous advantages as you might choose to stop the run at any time and then take a closer look at the evolved model. For instance, you can analyze its mathematical representation, its performance in the testing set,
evaluate a wide set of statistical functions for a quick and rigorous assessment of its accuracy,
see how it performs on a different testing set, and so on. Then you might choose to adjust a few parameters, say, choose a different fitness function, expand the function set, add a neutral gene,
apply parsimony pressure, change the training set for model
refreshing, and so forth, and then explore this new evolutionary path. You can repeat this process for as long as you want or until you are completely satisfied with the evolved model.
In addition, if you wish for GeneXproTools 4.0 to increase the complexity of
your models automatically, you just have to
activate the Complexity Increase
Engine
and then click the Evolve button and GeneXproTools will evolve better and better models composed of an increasing number of terms until no increase in best fitness takes place for a certain period of time.
Last modified:
May 26, 2008
|