Home About Us Contact Search >>
Getting Started with Function Finding  
Products Buy Now Downloads Support
 
 

Getting Started with Function Finding

This tutorial covers the fundamental steps in the creation of nonlinear regression models in the Function Finding Platform of GeneXproTools. We’ll start with a quick hands-on introduction to get you started, followed by a more detailed overview of the fundamental tools you can explore in GeneXproTools to create very good predictive models that accurately explain your data.


Hands-on Introduction to Nonlinear Regression

Designing a good nonlinear regression model in GeneXproTools is really simple: after loading your data, GeneXproTools takes you immediately to the Run Panel where you just have to click the Start button to create a model. This is possible because GeneXproTools comes with pre-set defaults that work very well with virtually all problems. We’ll learn later how to choose some of the most basic settings so that you can explore all the advanced tools of the software, but you can in fact design highly sophisticated and accurate models with just a click.

Run Panel - Function Finding


Monitoring the Design Process

While the model is being created by the learning algorithm, you can evaluate and visualize the actual design process through the real-time monitoring of different curve fitting charts and statistics in the Run Panel.

Run Panel - Designing Nonlinear Models


Model Evaluation & Testing

Then in the Results Panel you can further evaluate your model using different charts and statistics. It’s also in the Results Panel that you can check more thoroughly how your model generalizes to unseen data by checking how well it performs in the test set.

Results Panel - Evaluating & Testing Nonlinear Regression Models


Generating the Model Code

Then in the Model Panel you can see and Then in the Model Panel you can see and analyze the model code not only in the programming language of your choice but also as expression trees. GeneXproTools has 16 built-in grammars for Function Finding to generate code automatically in some of the most popular programming languages: Ada, C, C++, C#, Excel VBA, Fortran, Java, Java Script, Matlab, Pascal, Perl, PHP, Python, Visual Basic, VB.Net, and VHDL. But you can also add your own grammar through user-defined grammars, which means that in GeneXproTools you really can generate code automatically in the programming language of your choice.

Model Panel - Model Code in C++


Making Predictions

And finally, in the Scoring Panel of GeneXproTools you can make predictions using the generated JavaScript code of your model. This means that you don’t have to know how to deploy the model code to make predictions outside GeneXproTools: you can make them immediately within the GeneXproTools environment. GeneXproTools also deploys individual models and model ensembles automatically to Excel using the generated Excel VBA code of your models. So you can also make your predictions very conveniently in Excel.

Scoring Panel - Making Predictions with a Model


Loading Data

The nonlinear regressors GeneXproTools creates are statistical in nature or, in other words, data-based. And therefore GeneXproTools needs training data from which to extract the information needed to create the models. Data is also needed for validating and testing the generalizability of the generated models.

Data Formats

Before evolving a model with GeneXproTools you must first load the input data. GeneXproTools allows you to work both with Excel & databases and text files.

New Run Wizard - Training Data Window


For text files GeneXproTools supports two different data matrix formats. The first is the standard Records x Variables format where records are in rows and variables in columns, with the dependent or response variable occupying the rightmost position. In the small example below with five records, PRODUCTION is the response variable and LABOR, MATERIAL, and CAPITAL are the dependent or predictor variables:

LABOR	MATERIAL	CAPITAL	PRODUCTION
0.88287491	0.70249262	0.64540872	0.73044339
0.76598265	0.59360711	0.62686264	0.64924032
0.64562062	0.56965563	0.53160265	0.58497820
0.72908206	0.58913859	0.52247179	0.62160041
0.50690655	0.23498841	0.42968111	0.36493144
And the second is the Gene Expression Matrix format commonly used in DNA microarrays studies where records are in columns and variables in rows, with the dependent variable occupying the topmost position. For instance, in Gene Expression Matrix format the small dataset above corresponds to:

PRODUCTION 0.73044339 0.64924032 0.5849782 0.62160041 0.36493144
LABOR 0.88287491 0.76598265 0.64562062 0.72908206 0.50690655
MATERIAL 0.70249262 0.59360711 0.56965563 0.58913859 0.23498841
CAPITAL 0.64540872 0.62686264 0.53160265 0.52247179 0.42968111
This kind of format is the standard for datasets with a relatively small number of records and thousands of variables. Note, however, that this format is not supported for Excel files and if your data is kept in this format in Excel, you must copy it to a text file and then use this file to load your data into GeneXproTools.

GeneXproTools uses the Records x Variables format internally and therefore all kinds of input format are automatically converted and shown in this format in the Data Panel and Scoring Panel.

GeneXproTools supports the standard separators (space, tab, comma, semicolon, and pipe) and detects them automatically. The use of labels to identify your variables is optional and GeneXproTools also detects automatically whether they are present or not. However, if you use them you will be able to generate more intelligible code with each variable clearly identified by its name.

Run Panel - C++ Code with Labels

Loading Data Step-by-Step

To Load Input Data for Modeling

  1. Click the File Menu and then choose New.
    The New Run Wizard appears. You must give a name to your new run file (the default filename extension of GeneXproTools run files is .gep) and then choose Function Finding in the Problem Category box and the kind of source file in the Data Source Type box. GeneXproTools allows you to work both with Excel & databases and text files.
  2. Then go to the Training Data window by clicking the Next button.
    Choose the path for the training set by browsing the Open dialog box and choose the appropriate data matrix format. Irrespective of the data format used, GeneXproTools shows the loaded data in the standard Records x Variables format, with the dependent variable occupying the rightmost position.
  3. Then go to the Testing Data window by clicking the Next button.
    Repeat the same steps of the previous point if you wish to use a testing set to evaluate the generalizability of your models.
  4. Click the Finish button to save your new run file.
    The Save As dialog box appears and after choosing the directory where you want your new run file to be saved, the GeneXproTools modeling environment appears. Then you just have to click the Start button to create a model as GeneXproTools automatically chooses, from a gallery of templates, default settings that will enable you to evolve a model immediately.

Data Pre-Processing

In data mining, whether performed by learning algorithms or conventional statistical methods, it really pays to take a good look at your data before embarking on a complex, usually time consuming modeling process. It's true that evolutionary algorithms are particularly well equipped to deal with noisy data, but the better the data you feed them the better the models they produce.

GeneXproTools helps you find missing or invalid values in your datasets and prompts you to fix them before they are used for modeling. But the preparation of a well balanced data set should be done before loading the data into GeneXproTools, and we recommend you to particularly take care of the following:

  • Choose a well balanced dataset.
  • Choose a reasonable number of records for training.
    An excessively large dataset will slow the modeling process unnecessarily. If you have access to huge datasets, it’s good practice to use the surplus records for testing instead. A good rule of thumb consists of using about 10 records for each independent variable in your training data.
  • Check your datasets carefully for errors. Typographical or measurement errors generally cause outliers that can be detected by plotting one variable at a time, a task that can be easily accomplished within GeneXproTools in the Data Panel.

The visualization tools of GeneXproTools make it easy to identify outliers, which may well represent errors in the data. After loading your data into GeneXproTools, in the Data Panel you can visualize the distribution of values for each variable and also analyze the correlation between each independent variable and the dependent variable by studying their scatter plots.

Data Panel - Function Finding


Choosing the Function Set

GeneXproTools allows you to choose your function set from a total of 279 built-in mathematical functions and an unlimited number of custom functions, designed using the JavaScript language in the GeneXproTools environment.

Built-in Mathematical Functions

GeneXproTools offers a total of 279 built-in mathematical functions, including 186 different IF THEN ELSE rules, that can be used to design nonlinear regression models. This wide set of mathematical functions allows the evolution of complex and accurate models, easily built with the most appropriate functions. You can find the description of all the 279 built-in mathematical functions available in GeneXproTools, including their representation in the Knowledge Base.

The Function Selection Tools of GeneXproTools helps you in the selection of different function sets very quickly through the combination of the Show options with the Random/Default/Clear/Select All buttons plus the Increase/Reduce Weight buttons in the Functions Panel.

Functions Panel - Built-in Mathematical Functions


User Defined Functions

Despite the wide set of GeneXproTools built-in mathematical functions, some users sometimes want to model with different ones. GeneXproTools gives the user the possibility of creating custom functions (called Dynamic UDFs and also DDFs in GeneXproTools) and evolve models with them. A note of caution though: the use of custom functions slows considerably the evolutionary process and therefore should be used with moderation.

By selecting the Functions Tab in the Functions Panel, you have full access to the 279 built-in mathematical functions of GeneXproTools. Here is also the place where you can add your custom functions (Dynamic UDFs) to your modeling kit.

Functions Panel - A Function Set with Built-in & Custom Functions

To select a function, just check the Select box on the left of the Functions Panel. By default, the weight of each function is 1, but you can increase the probability of a function being included in your models by increasing its weight in the Select/Weight column. GeneXproTools automatically balances your function set with the number of independent variables in your data, therefore you just have to select the set of functions for your problem and then choose their relative proportions by choosing their weights.

To add a custom function to your modeling kit, just click the Add button on the Dynamic UDFs frame and the DDF Editor appears.

Custom Function Editor

By choosing the arity (minimum is 1 and maximum is 4) in the Arity box, the function header appears in the code window. Then you just have to write the body of the function in the code editor. The code must be in JavaScript and can be conveniently tested for compiling errors by pressing the Test button.

In the Definition box, you can write a brief description of the function for your future reference. The text you write there will appear in the Definition column of the Functions Panel.

Dynamic UDFs are extremely powerful and interesting tools as they are treated exactly like the built-in functions of GeneXproTools and therefore can be used to model all kinds of relationships not only between the original variables but also between derived features created on the fly by the learning algorithm. For instance, you can design a DDF so that it will model the log of the sum of four expressions, that is, DDF = log((expression 1) + (expression 2) + (expression 3) + (expression 4)), where the value of each expression will depend on the context of the DDF in the expression tree. A note of caution, though, although extremely useful, DDFs decrease considerably the speed of the algorithm and therefore we advise you to choose, whenever possible, your functions from the wide set of GeneXproTools built-in functions.


Creating Derived Features/Variables

Derived variables or new features can be easily created in GeneXproTools. They are created in the Functions Panel, in the Static UDFs Tab.

Functions Panel - User Defined Functions or New Features

Historically, derived variables were called UDFs or User Defined Functions and in GeneXproTools they are represented as UDF0, UDF1, UDF2, and so on. Note however that UDFs are in fact new features derived from the original variables in the training and test datasets. Like DDFs, they are implemented in JavaScript using the JavaScript editor of GeneXproTools.

New Features Editor

These user defined features are then used by the learning algorithm exactly as the original features, that is, they are incorporated into the evolved models adaptively, with the most important being chosen and selected according to the increase in performance they impart on the models.


Choosing the Model Architecture

In GeneXproTools the evolving models are encoded in linear strings or chromosomes. And the chromosome architecture includes the head size, the number of genes and the linking function. You choose these parameters in the Settings Panel -> General Settings Tab.

Settings Panel - General Settings Tab

The Head Size determines the complexity of each term in your model. In the heads of genes, the learning algorithms try out different arrangements of functions and terminals (original & derived variables and constants) in order to model your data. The plasticity of this architecture allows the discovery of a virtually infinite number of models of different sizes and shapes which are afterwards tested and selected during the learning process. The heads of genes are shown in blue in the compact Karva representation of your models in the Model Panel. This linear code is then translated into any of the built-in programming languages of GeneXproTools (Ada, C, C++, C#, Excel VBA, Fortran, Java, Java Script, Matlab, Pascal, Perl, PHP, Python, Visual Basic, VB.Net, and VHDL).

Model Panel - Karva Code

More specifically, the head size h of each gene determines the maximum width w and maximum depth d of the sub-expression trees encoded in the gene, which are given by the formulas:

w = (n - 1) * h + 1

d = ((h + 1) / m) * ((m + 1) / 2)

where m is minimum arity and n is maximum arity.

Thus, the learning algorithm selects its models between these extreme cases, fine-tuning the ideal size and shape during the evolutionary process, creating and testing new nonlinear features on the fly without human intervention.

Model Panel - Expression Trees

The number of genes per chromosome is also an important parameter. It will determine the number of fundamental terms or building blocks in your model as each gene codes for a different sub-expression tree (sub-ET). Theoretically, one could just use a huge single gene in order to evolve very complex models. But the partition of the chromosome into simpler, more manageable units gives an edge to the learning process and more efficient and elegant models can be discovered using multigenic chromosomes.

Whenever the number of genes is greater than one, you must also choose a suitable linking function for linking the mathematical terms encoded in each gene. GeneXproTools allows you to choose addition, subtraction, multiplication, or division to link the sub-ETs. As expected, addition (and obviously subtraction) works very well for virtually all problems but sometimes one of the other linkers could be useful for searching different solution spaces.

Model Panel - Karva Code Showing Linking by Multiplication


Choosing the Fitness Function

For Function Finding problems, in the Fitness Function Tab of the Settings Panel you have access to a wide range of built-in fitness functions. Additionally, you can also design your own custom fitness functions and explore the solution space with them. By choosing Custom in the Fitness Function box, the custom fitness editor is activated.

Settings Panel - Fitness Function Tab

You can design your own custom fitness function using the Custom Fitness Function window to write the code of your fitness function. The code for the custom fitness function must be in JavaScript and can be tested before evolving a model with it by pressing the Test button.


The built-in fitness functions of GeneXproTools for Function Finding:

The kind of fitness function you choose will depend most probably on the statistical function or error measure you are most familiar with. And although there is nothing wrong with this, for all of them can accomplish an efficient evolution, you might want to try different fitness functions for they travel the fitness landscape differently: some of them very straightforwardly in their pursuits while others choose less travelled paths, considerably enhancing the search process.


Exploring the Learning Algorithms

GeneXproTools uses two different learning algorithms for Function Finding problems. The first – the basic gene expression algorithm or simply Gene Expression Programming (GEP) – does not support the direct manipulation of random numerical constants, whereas the second – GEP with Random Numerical Constants or GEP-RNC for short – has a facility for handling them directly. These two algorithms search the solution landscape differently and therefore you might wish to try them both on your problems. For example, GEP-RNC models are usually more compact than models generated without random numerical constants.

The kinds of models these algorithms produce are quite different and, even if both of them perform equally well on the problem at hand, you might still prefer one over the other. But there are cases, however, where numerical constants are crucial for an efficient modeling and, therefore, the GEP-RNC algorithm is the default in GeneXproTools. You activate this algorithm in the Settings Panel -> Numerical Constants by checking the Use Random Numerical Constants box.

Settings Panel - Random Numerical Constants Tab

The GEP-RNC algorithm is slightly more complex than the basic gene expression algorithm as it uses an additional gene domain (Dc) for encoding the random numerical constants. Consequently, this algorithm comes equipped with an additional set of genetic operators (RNC mutation, Dc mutation, Dc inversion, and Dc IS transposition) especially developed for handling random numerical constants (if you are not familiar with these operators, please use the default values by clicking the Default button for they work very well in all cases, or you can learn more about them in the Knowledge Base).

And last but not least, since these parameters are crucial if you are handling numerical constants directly, you must also choose and adjust the range and type of numerical constants that will be used by the GEP-RNC algorithm during the learning process. As for the Number of Constants per Gene parameter, a good rule of thumb consists of using a small set of 10 different constants per gene as this seems to provide enough diversity for most problems without inflating the structural complexity much.


Exploring Different Evolutionary Strategies for an Efficient Learning

Predicting unknown behavior efficiently is of course the foremost goal in modeling. But extracting knowledge from the blindly designed models is also extremely important as this knowledge can be used not only to enlighten further the modeling process but also to understand the complex relationships between variables.

So, the evolutionary strategies we recommend in the GeneXproTools templates for Function Finding reflect these two main concerns: efficiency and simplicity. Basically, we recommend starting the modeling process with the GEP-RNC algorithm and a function set well adjusted to the complexity of the problem.

GeneXproTools chooses the appropriate template for your problem according to the number of variables in your data. This kind of template is a good starting point that allows you to start the modeling process immediately with just a mouse click. Indeed, even if you are not familiar with evolutionary computation in general and Gene Expression Programming in particular, you will be able to design complex nonlinear regression models immediately thanks to the templates of GeneXproTools. In these templates, all the adjustable parameters of the default learning algorithm are already set and therefore you don’t have to know how to create genetic diversity, how to set the appropriate population size, the chromosome architecture, the fitness function, how to increase the complexity of your models, and so forth. Then, as you learn more about GeneXproTools, you will be able to explore all its modeling tools and create quickly and efficiently very good regression models that will allow you to understand and model your data like never before.

There is, however, a very important setting in GeneXproTools that is not controlled by GeneXproTools templates and must be wisely chosen by you: the number of training records. Theoretically, if your data is well balanced and in good condition, evolutionarily speaking, the more records the better. But there's a catch, obviously: the larger the training set the slower evolution or, in other words, the more time will be needed for generations to go by. So, you must compromise here and choose a training set with the appropriate size. A good rule of thumb consists of choosing between 10-100 training records for each independent variable in your data; all the remaining records could be used to test how well the evolved models generalize.

So, after creating a new run you just have to click the Start button in the Run Panel in order to design a nonlinear regression model. Then you can monitor the evolutionary process, especially the different curve fitting charts. Then, whenever you see fit, you can stop the run without fear of stopping evolution prematurely as GeneXproTools allows you to continue the evolutionary process at a later time by using the best-of-run model as the starting point (evolve with seed). For that you just have to click the Continue button in the Run Panel to continue the search for a better model.

Run Panel - Designing and Optimizing Nonlinear Models

This strategy has enormous advantages as you might choose to stop the run at any time and then take a closer look at the evolved model. For instance, you can analyze its mathematical representation, its performance in the testing set, evaluate a wide set of statistical functions for a quick and rigorous assessment of its accuracy, see how it performs on a different testing set, and so on. Then you might choose to adjust a few parameters, say, choose a different fitness function, expand the function set, add a neutral gene, apply parsimony pressure for simplifying its structure, change the training set for model refreshing, and so on, and then explore this new set of conditions. You can repeat this process for as long as you want or until you are completely satisfied with your model.


Last modified: October 2, 2012

   

Download GeneXproTools for Windows Buy GeneXproTools Upgrade GeneXproTools
Download a free trial of GeneXproTools


Read more...

 
GeneXproTools  


   


GeneXproServer  



Other Tutorials



Quick Tour Videos





Gene Expression Programming


   Subscribe to the GEP-list
Enter 2 + 32 =
Signup Now

“As a professional software developer, I could have attempted to read up on all the latest developments in the field of evolutionary programming and start writing my own modeling tools. One look at the GeneXproTools demo, however, was enough to convince me of the absurdity of that thought. Not only does GeneXproTools have all the power that I would ever need, but it also allows me to customize all parts of the modeling process. I don’t have to know the first thing about evolutionary algorithms and yet I can write my own grammars or fitness functions if I wanted to. It is obvious that a huge amount of work went into the making of GeneXproTools, and I am now a very happy customer. Keep up the great work, Gepsoft!”

Glenn Lewis
Software developer, USA
   
 

More

 
 
 
 

 
Home | What's New | Products | Buy Now | Upgrade | Downloads | Quick Tour | Support | Contact Us | About Gepsoft | Sign Up
Tutorials | Videos | FAQ | Knowledge Base | Logistic Regression KB | Terms of Use | Privacy & Cookies
 
 

Copyright (c) 2000-2013 Gepsoft Ltd. All rights reserved.