Classification
						    
							
							This section covers the fundamental steps in the creation of 
							nonlinear classification models in the Classification 
							Platform of GeneXproTools. We’ll start with a quick hands-on introduction 
							to get you started, followed by a more detailed overview of the 
							fundamental tools you can explore in GeneXproTools to create 
							very good predictive models that accurately explain your data.
							
								
							                               
							
						   
								
									|  | Hands-on Introduction to Classification | 
					    
							
							
							
							
							Designing a good nonlinear classifier in GeneXproTools is really simple: 
							after importing your data from Excel/Database or a 
							text file, GeneXproTools takes you immediately to the 
							Run Panel where you just have to click the 
							Start Button to create a model. 
							This is possible because GeneXproTools comes with pre-set default 
							parameters and data pre-processing procedures 
							(including dataset partitioning and handling 
							categorical variables and missing values) that work 
							very well with virtually all problems. We’ll learn later how to choose 
							some of the most basic settings so that you can explore all the advanced tools of 
							the application, but you can in fact quickly design highly sophisticated and accurate 
							classification models in GeneXproTools with just a click.
							
	                        
							
							Monitoring the Design Process
							
							While the classification model is being created by the learning algorithm, you can evaluate and 
							visualize the actual design process through the real-time 
							monitoring of different 
							model fitting charts and statistics in the Run Panel, such as 
							different Binomial Fitting Charts, the 
							Classification Scatter Plot, the ROC Curve, the 
							Classification Tapestry, the  
							Confusion Matrix, 
							the Classification Accuracy and the Fitness.
	                        
							
							Model Evaluation & Testing
							
							Then in the Results Panel you can further evaluate your model using different charts 
							and additional
							measures of fit, such as the 
							area under the ROC curve, the sensitivity, 
							specificity, recall, precision and Matthews 
							correlation coefficient. It’s also in the Results Panel that you can check more thoroughly how your model 
							generalizes to unseen data by checking how well it performs in the 
							validation/test set.
							
	                        
							Generating the Model Code
							
							Then in the Model Panel you can see 
							and analyze the model code not only in the programming language of 
							your choice but also as a diagram representation or expression tree. 
							GeneXproTools includes 17 built-in programming 
							languages  or grammars for Classification. 
							These grammars allow you to generate code automatically in 
							some of the most popular programming 
							languages around, namely Ada, C, C++, C#, Excel VBA, Fortran, Java, Javascript, Matlab, 
							Octave, Pascal, Perl, PHP, Python, R, Visual Basic, 
							and VB.Net. 
							But more importantly GeneXproTools also allows you 
							to add your own programming languages through 
							user-defined grammars, which can be easily 
							created using one of the existing grammars as 
							template.
							
							
	                        
							Making Predictions
							
							And finally, in the Scoring Panel of 
							GeneXproTools you can make predictions with your 
							models using 
							the generated JavaScript code. This means that you don’t have to know how to deploy the model code to 
							make predictions with your models outside GeneXproTools: you can make them 
							straightaway within the 
							GeneXproTools environment in the Scoring Panel.
							
	                        
							GeneXproTools also 
deploys automatically to Excel individual 
							
models and 
model 
							ensembles using the 
							generated Excel VBA code. More importantly, 
							GeneXproTools allows you to embed the Excel 
							VBA code of your models in the Excel worksheet, thus 
							allowing you to make predictions with your models very 
							conveniently in Excel. In addition, for model 
							ensembles, GeneXproTools also evaluates the 
							majority class model or the 
average 
							probability model 
							and 
median probability model, which are usually more robust and 
							accurate classifiers than individual models.
							
	                        
							
							
							
						   
								
									|  | Loading Data | 
					    
							
							
							
																
							
							
							The nonlinear classifiers 
							GeneXproTools creates are statistical in 
							nature or, in other words, data-based. Therefore 
							GeneXproTools needs training data 
							from which to extract the information needed to create the models. Data is also needed 
							for validating and testing the generalizability of the generated models. 
							However, validation/test data, although recommended, 
							is not mandatory and therefore need not be used if 
							data is in short supply.
							
							
							Data Sources & Formats
							Before evolving a model with GeneXproTools you must first 
							import the input data into GeneXproTools. 
							GeneXproTools allows you to import data from 
							
Excel & databases,
 text files and
							
GeneXproTools files.
							
							
Excel Files & Databases
							
							
                            The loading of data from Excel/databases requires making 
							a connection with Excel/database and then selecting the 
							worksheets or columns of interest. By default GeneXproTools will set  
							the last column as the 
dependent or 
response variable, but you can easily set any of the other columns 
							as response variable by selecting the variable of interest and then 
							checking Response in the context menu.
                            
	                        
	                        
							Text Files
							 For 
text files GeneXproTools supports three 
							commonly used 
 data matrix formats. 
							 The first is the standard 
 Records x Variables 
							(Response Last)  format where records are 
							 in rows and variables in columns, with the dependent or response variable occupying the
								rightmost position. In the small example below with 
							seven records, iris_class is 
								the response variable and sepal_length, sepal_width, petal_length, and 
								petal_width are the independent or predictor variables:
							
							
								
									
								
								
									| sepal_length | sepal_width | petal_length | petal_width | iris_class | 
								
									| 4.9 | 3.1 | 1.5 | 0.1 | 0 | 
								
									| 6.7 | 3.3 | 5.7 | 2.5 | 1 | 
								
									| 5.1 | 3.5 | 1.4 | 0.2 | 0 | 
								
									| 5.8 | 2.6 | 4 | 1.2 | 0 | 
								
									| 4.3 | 3 | 1.1 | 0.1 | 0 | 
								
									| 5.7 | 2.9 | 4.2 | 1.3 | 0 | 
								
									| 5.6 | 2.8 | 4.9 | 2 | 1 | 
							
							
							
							 The second format, 
 Records x Variables (Response First), 
							 is similar to the first, with the difference that the response variable 
							 is in the first column.
							
							
							And the third data format is the 
Gene Expression Matrix  or GEM format commonly used
								in DNA microarrays studies or whenever the number of variables far exceeds 
								the number of records. In this format records are in columns and
								variables in rows, with the dependent variable
								occupying the topmost position. For instance, 
								the small dataset above in GEM format maps to:
							
							
							
								
									
								
								
									| iris_class | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 
								
									| sepal_length | 4.9 | 6.7 | 5.1 | 5.8 | 4.3 | 5.7 | 5.6 | 
								
									| sepal_width | 3.1 | 3.3 | 3.5 | 2.6 | 3 | 2.9 | 2.8 | 
								
									| petal_length | 1.5 | 5.7 | 1.4 | 4 | 1.1 | 4.2 | 4.9 | 
								
									| petal_width | 0.1 | 2.5 | 0.2 | 1.2 | 0.1 | 1.3 | 2 | 
							
							
							
							This kind of format is the standard for
							datasets with a relatively small number of records and thousands of
							variables. Note, however, that this format is not
							supported for Excel files and if your data is kept in this format in Excel, you must
							copy it to a text file and then use this file to load your data into GeneXproTools.
							
				 GeneXproTools uses the Records x
								Variables (Response Last) format internally and therefore all 
							kinds of input format are automatically
								converted and shown in this format both in 
							the Data Panel and Scoring Panel.
							
								For text files
								GeneXproTools supports the standard 
separators (space,
							tab, comma, semicolon, and pipe) and detects them automatically. The
							use of 
labels to identify your variables is optional and
								GeneXproTools also detects automatically whether they are
							present or not. Note however that the use of labels 
							allows you to
							generate more intelligible code where each variable 
							is clearly identified by
							its name.
							
	                        
							GeneXproTools Files
							
							
                            GeneXproTools files can be very convenient to use as 
							data source as they 
							allow the selection of exactly the same datasets 
							used in a particular run. This can be very useful especially if 
							you want to use the same datasets across different runs.
                            
	                        
							
							
Loading Data Step-by-Step
							
							To Load Input Data for Modeling
							
							
							  - Click the File Menu and then choose New.
 The New Run Wizard appears. You must give a name to your new run file (the default filename extension of
								GeneXproTools run files is .gep) and then choose 
								Classification in the Problem Category box and the kind of source file
								in the Data Source Type box.
								GeneXproTools allows you to work both with Excel 
								& databases, text
								files and gep files.
- Then go to the Entire Dataset (or Training 
								Set) window by clicking the Next button.
 Choose the path for the dataset by browsing the Open dialog
								box and choose the appropriate data matrix format. Irrespective
								of the data format used,
								GeneXproTools shows the loaded data in the standard 
								Records x
								Variables (Response Last) format, with the dependent variable occupying the
								rightmost position.
- Then go to the Validation/Test Data window by clicking the Next button.
 Repeat the same steps of the previous point if you wish to use a
								specific validation/test set to evaluate the 
								generalizability of your models. The loading of 
								a specific validation/test set is optional as 
								GeneXproTools allows you to split your data into 
								different datasets for training and 
								testing/validation in the Dataset 
								Partitioning window.
- Click the Finish button to save your new run file.
 The Save As dialog box appears and after choosing the directory where you want your new run file to be saved, the
								GeneXproTools modeling environment appears.
								Then you just have to click the Start button to create a model as
								GeneXproTools automatically chooses from a gallery of templates default settings that will enable you to evolve a model 
								with just a click.
							    
		                            |  | Kinds of Data | 
					    
		                    
		                    
                    
                            GeneXproTools supports both numerical and 
							
categorical variables, 
                            and for both types also supports 
missing values. Categorical and 
                            missing values are replaced automatically by simple 
							
default mappings 
                            so that you can create models straightaway, but you can choose 
                            more appropriate mappings through the Category Mapping Window and 
                            the Missing Values Mapping Window.
                            
                            
                            Numerical Data
                            
                           
							GeneXproTools supports all kinds of tabular numerical datasets, 
							with variables usually in columns and records in rows. In GeneXproTools 
							all input data is converted to numerical data prior to model creation, 
							so numerical datasets are routine for GeneXproTools and sophisticated 
							
 
							visualization tools and statistical analyses are available in GeneXproTools 
							for analyzing these datasets.
                            
							As long as it fits in memory, the 
dataset size is unlimited both for the training and validation/test 
							datasets. However, for big datasets efficient heuristics for splitting the 
							data are automatically applied when a run is created in order to ensure 
							an efficient evolution and good model generalizability. Note however that 
							these 
default partitioning heuristics only apply if the data is loaded as a single file; 
							for data loaded using two different datasets you can access all the partitioning 
							(including the default) and sub-sampling schemes of GeneXproTools  
							in the 
Dataset Partitioning Window and in the 
							
General Settings Tab.
                            
                            
                            
                            Categorical Data
                            
                           
                            GeneXproTools supports all kinds of 
categorical variables, both as part of entirely 
                            categorical datasets or intermixed with numerical variables. In 
							all cases, during data loading, the 
                            categories in all categorical variables are automatically replaced by numerical values 
                            so that you can start modeling straightaway.
							
							GeneXproTools uses simple heuristics to make this initial mapping, but then lets you choose
							more meaningful mappings in the 
Category Mapping Window.
							
	                        
							Dependent categorical variables are also supported in
							
Classification problems with more than two classes. 
							In these cases the 
							mapping is made so that only one class is singled out, resulting in a binomial outcome, 
							such as {0, 1} or {-1, 1}, which can then be used to create classification models. The merging of the response variable in 
							
Classification is handled in the 
Class Merging & Discretization Window.
							
							The beauty and power of GeneXproTools support for 
categorical variables goes beyond 
							giving you access to a sophisticated and extremely useful tool for changing and 
							experimenting with different mappings easily and quickly by trying out different scenarios 
							and seeing immediately how they impact on modeling. Indeed GeneXproTools also generates 
code 
							that supports data in exactly the same format that was loaded into GeneXproTools. This means 
							that all the code generated both for external model deployment or for scoring internally in 
							GeneXproTools, also supports categorical variables. Below is an example in C++ of a 
							classification model with both numeric and categorical variables.
							
							
//------------------------------------------------------------------------
// Classification model generated by GeneXproTools 5.0 on 09/10/2013 
// GEP File: D:\GeneXproTools\Version5.0\Tutorials\LoanRisk_02.gep
// Training Records:  667
// Validation Records:   333
// Fitness Function:  ROC Measure, ROC Threshold
// Training Fitness:  725.459596831407
// Training Accuracy: 77.21% (515)
// Validation Fitness:   776.96843875231
// Validation Accuracy:  78.68% (262)
//------------------------------------------------------------------------
#include "math.h"
int gepModel(double d[]);
double gep3Rt(double x);
double gepMin2(double x, double y);
double gepMax2(double x, double y);
void TransformCategoricalInputs(char* input[], double output[]);
int gepModel(char* d_string[])
{
    const double ROUNDING_THRESHOLD = 12.0532552252155;
    const double G1C9 = -2.3947874385815;
    const double G2C1 = -6.17908261360515;
    const double G2C9 = 7.41535691396832;
    const double G4C6 = 5.4155095065157;
    double d[20];
    TransformCategoricalInputs(d_string, d);
    double dblTemp = 0.0;
    dblTemp = (d[11]*G1C9);
    dblTemp += (((((d[2]*G2C9)+(d[2]-d[1]))+d[10])/2.0)-gepMin2(gepMax2
	      ((d[18]-d[15]),d[4]),((d[7]+G2C1)/2.0)));
    dblTemp += (d[0]*((d[5]*atan(d[0]))*d[6]));
    dblTemp += (((((d[9]+d[14])/2.0)+d[13])*d[0])-gepMin2
	      (gep3Rt((G4C6*d[13])),d[12]));
    return (dblTemp >= ROUNDING_THRESHOLD ? 1 : 0);
}
double gep3Rt(double x)
{
    return x < 0.0 ? -pow(-x,(1.0/3.0)) : pow(x,(1.0/3.0));
}
double gepMin2(double x, double y)
{
    double varTemp = x;
    if (varTemp > y)
        varTemp = y;
    return varTemp;
}
double gepMax2(double x, double y)
{
    double varTemp = x;
    if (varTemp < y)
        varTemp = y;
    return varTemp;
}
void TransformCategoricalInputs(char* input[], double output[])
{
    if(strcmp("A11", input[0]) == 0)
        output[0] = 1.0;
    else if(strcmp("A12", input[0]) == 0)
        output[0] = 2.0;
    else if(strcmp("A13", input[0]) == 0)
        output[0] = 3.0;
    else if(strcmp("A14", input[0]) == 0)
        output[0] = 4.0;
    else output[0] = 0.0;
    
    output[1] = atof(input[1]);
    
    if(strcmp("A30", input[2]) == 0)
        output[2] = 1.0;
    else if(strcmp("A31", input[2]) == 0)
        output[2] = 2.0;
    else if(strcmp("A32", input[2]) == 0)
        output[2] = 3.0;
    else if(strcmp("A33", input[2]) == 0)
        output[2] = 4.0;
    else if(strcmp("A34", input[2]) == 0)
        output[2] = 5.0;
    else output[2] = 0.0;
    
    output[4] = atof(input[4]);
    
    if(strcmp("A61", input[5]) == 0)
        output[5] = 1.0;
    else if(strcmp("A62", input[5]) == 0)
        output[5] = 2.0;
    else if(strcmp("A63", input[5]) == 0)
        output[5] = 3.0;
    else if(strcmp("A64", input[5]) == 0)
        output[5] = 4.0;
    else if(strcmp("A65", input[5]) == 0)
        output[5] = 5.0;
    else output[5] = 0.0;
    
    if(strcmp("A71", input[6]) == 0)
        output[6] = 1.0;
    else if(strcmp("A72", input[6]) == 0)
        output[6] = 2.0;
    else if(strcmp("A73", input[6]) == 0)
        output[6] = 3.0;
    else if(strcmp("A74", input[6]) == 0)
        output[6] = 4.0;
    else if(strcmp("A75", input[6]) == 0)
        output[6] = 5.0;
    else output[6] = 0.0;
    
    output[7] = atof(input[7]);
    
    if(strcmp("A101", input[9]) == 0)
        output[9] = 1.0;
    else if(strcmp("A102", input[9]) == 0)
        output[9] = 2.0;
    else if(strcmp("A103", input[9]) == 0)
        output[9] = 3.0;
    else output[9] = 0.0;
    
    output[10] = atof(input[10]);
    
    if(strcmp("A121", input[11]) == 0)
        output[11] = 1.0;
    else if(strcmp("A122", input[11]) == 0)
        output[11] = 2.0;
    else if(strcmp("A123", input[11]) == 0)
        output[11] = 3.0;
    else if(strcmp("A124", input[11]) == 0)
        output[11] = 4.0;
    else output[11] = 0.0;
    
    output[12] = atof(input[12]);
    
    if(strcmp("A141", input[13]) == 0)
        output[13] = 1.0;
    else if(strcmp("A142", input[13]) == 0)
        output[13] = 2.0;
    else if(strcmp("A143", input[13]) == 0)
        output[13] = 3.0;
    else output[13] = 0.0;
    
    if(strcmp("A151", input[14]) == 0)
        output[14] = 1.0;
    else if(strcmp("A152", input[14]) == 0)
        output[14] = 2.0;
    else if(strcmp("A153", input[14]) == 0)
        output[14] = 3.0;
    else output[14] = 0.0;
    
    output[15] = atof(input[15]);
    
    if(strcmp("A191", input[18]) == 0)
        output[18] = 1.0;
    else if(strcmp("A192", input[18]) == 0)
        output[18] = 2.0;
    else output[18] = 0.0;
}
                            
                            
                            Missing Values
                            
                           
                            GeneXproTools supports 
missing values both for numerical and categorical variables. 
                            The 
supported representations for missing values consist of NULL, Null, null, NA, na, ?, 
                            blank cells, ., ._, and .*, where * can be any letter in lower or upper case.
							
							When data is loaded into GeneXproTools, the missing values are automatically replaced by 
							zero so that you can start modeling right away. But then GeneXproTools allows you to choose 
							different mappings through the 
Missing Values Mapping Window.
							
	                        
							In the Missing Values Mapping Window you have access to pre-computed 
data statistics, 
							such as the majority class for categorical variables and the average for numerical variables, 
							to help you choose the most effective mapping.
							
							As mentioned 
above for categorical values, GeneXproTools is not just a useful platform 
							for trying out different mappings for missing values to see how they impact on 
							model evolution and then choose the best one: GeneXproTools generates 
code with support 
							for missing values that you can immediately deploy without further hassle, allowing you 
							to use the exact same format that was used to load the data into GeneXproTools. The 
							sample 
Python code below shows a 
classification model with missing values in 
							5 of the 16 input variables:
							
							
#------------------------------------------------------------------------
# Classification model generated by GeneXproTools 5.0 on 09/10/2013 
# GEP File: D:\GeneXproTools\Version5.0\Tutorials\CreditApproval_01.gep
# Training Records:  460
# Validation Records:   230
# Fitness Function:  Bounded ROC, ROC Threshold
# Training Fitness:  910.514625888451
# Training Accuracy: 86.74% (399)
# Validation Fitness:   928.183091315842
# Validation Accuracy:  86.96% (200)
#------------------------------------------------------------------------
from math import *
def gepModel(d_string):
    ROUNDING_THRESHOLD = -8.05237665014851
    G1C8 = 1.01779229102451
    G2C9 = 1.44871364482559
    G4C2 = -4.84108096560564
    d = [0.0] * len(d_string)
    TransformCategoricalInputs(d_string, d)
    varTemp = 0.0
    varTemp = G1C8
    varTemp = varTemp + pow(atan((((1.0/((d[5]-G2C9)))+((d[1]+d[2])-
	     (d[7]*d[14])))-d[5])),2.0)
    varTemp = varTemp + (d[9]/d[3])
    varTemp = varTemp + (exp(tanh((tanh(d[13])+d[4])))*(G4C2+
	     pow(log(d[8]),2.0)))
    if (varTemp >= ROUNDING_THRESHOLD):
        return 1
    else:
        return 0
def TransformCategoricalInputs(inputList, outputList):
    
    outputList[1] = { 
        "?" : lambda : 0.0,
    }.get(inputList[1], lambda : float(inputList[1]))()
    
    outputList[2] = float(inputList[2])
    outputList[3] = { 
        "l" : lambda : 1.0,
        "u" : lambda : 2.0,
        "y" : lambda : 3.0,
        "?" : lambda : 2.0,
    }.get(inputList[3], lambda : 0.0)()
    
    outputList[4] = { 
        "g" : lambda : 2.0,
        "gg" : lambda : 1.0,
        "p" : lambda : 3.0,
        "?" : lambda : 2.0,
    }.get(inputList[4], lambda : 0.0)()
    
    outputList[5] = { 
        "aa" : lambda : 1.0,
        "c" : lambda : 2.0,
        "cc" : lambda : 3.0,
        "d" : lambda : 4.0,
        "e" : lambda : 5.0,
        "ff" : lambda : 6.0,
        "i" : lambda : 7.0,
        "j" : lambda : 8.0,
        "k" : lambda : 9.0,
        "m" : lambda : 10.0,
        "q" : lambda : 11.0,
        "r" : lambda : 12.0,
        "w" : lambda : 13.0,
        "x" : lambda : 14.0,
        "?" : lambda : 2.0,
    }.get(inputList[5], lambda : 0.0)()
    
    outputList[7] = float(inputList[7])
    outputList[8] = { 
        "f" : lambda : 1.0,
        "t" : lambda : 2.0,
    }.get(inputList[8], lambda : 0.0)()
    
    outputList[9] = { 
        "f" : lambda : 1.0,
        "t" : lambda : 2.0,
    }.get(inputList[9], lambda : 0.0)()
    
    outputList[13] = { 
        "?" : lambda : 0.0,
    }.get(inputList[13], lambda : float(inputList[13]))()
    
    outputList[14] = float(inputList[14])
    
    
                            
                            Normalized Data
                            
                           
                            GeneXproTools supports different kinds of 
data normalization (
Standardization, 
                            
0/1 Normalization and 
Min/Max Normalization), normalizing all numeric input variables 
                            using 
data statistics derived from the training dataset. This means that the validation/test dataset 
                            is also normalized using the 
training data statistics such as averages, standard deviations, 
                            and min and max values evaluated for all numeric variables.
							
	                        
	                        
							Data normalization can be beneficial for datasets with variables in very different 
							scales or ranges. Note, however, that data normalization is not a requirement even 
							in these cases, as the learning algorithms of GeneXproTools can handle 
							
unscaled data 
							quite well. But since it might help, GeneXproTools allows you to see very quickly and 
							easily if normalizing your data improves modeling: if not, also as quickly, 
							you can revert to the original raw data.
							
							It’s worth pointing out that GeneXproTools offers not just a convenient way of trying out 
							different normalization schemes. As is the case for 
							
categorical variables and 
							
missing values, 
							GeneXproTools generates 
code that also supports data scaling, allowing you to deploy 
							your models confidently knowing that you can use exactly the same data format that 
							was used to load the data into GeneXproTools. Below is a sample code in 
							
R of a 
							
classification model created using data standardized in the GeneXproTools environment.
							
#------------------------------------------------------------------------
# Classification model generated by GeneXproTools 5.0 on 09/10/2013 
# GEP File: D:\GeneXproTools\Version5.0\Tutorials\CreditApproval_02.gep
# Training Records:  460
# Validation Records:   230
# Fitness Function:  Bounded ROC, ROC Threshold
# Training Fitness:  919.23449068333
# Training Accuracy: 88.70% (408)
# Validation Fitness:   940.661471804493
# Validation Accuracy:  88.70% (204)
#------------------------------------------------------------------------
gepModel <- function(d_string)
{
    ROUNDING_THRESHOLD <- 0.58279768801143
    G1C3 <- 6.29523819696646
    G1C2 <- 7.55729850154118
    G1C4 <- 8.78109073152867
    d <- TransformCategoricalInputs(d_string)
    d <- Standardize(d)
    dblTemp <- 0.0
    dblTemp <- (((min(exp(d[9]),(d[6]+G1C4))+(d[10]/G1C2))-G1C3)*d[10])
    dblTemp <- dblTemp + min(d[15],d[7])
    dblTemp <- dblTemp + tanh((tanh(((gep3Rt(d[11])+d[7])-min(d[14],d[14])))+
	      (d[15]/d[4])))
    dblTemp <- dblTemp + min(d[12],(tanh(((exp((d[4]-d[14]))*
	      (d[13]*d[14])) ^ 2))-d[5]))
    return (if(dblTemp >= ROUNDING_THRESHOLD) 1 else 0)
}
gep3Rt <- function(x)
{
    return (if (x < 0.0) (-((-x) ^ (1.0/3.0))) else (x ^ (1.0/3.0)))
}
TransformCategoricalInputs <- function (input)
{
    output <- rep(0.0, 15)
    output[4] = switch (input[4],
        "l" = 1.0,
        "u" = 2.0,
        "y" = 3.0,
        "?" = 2.0,
        0.0)
    output[5] = switch (input[5],
        "g" = 2.0,
        "gg" = 1.0,
        "p" = 3.0,
        "?" = 2.0,
        0.0)
    output[6] = switch (input[6],
        "aa" = 1.0,
        "c" = 2.0,
        "cc" = 3.0,
        "d" = 4.0,
        "e" = 5.0,
        "ff" = 6.0,
        "i" = 7.0,
        "j" = 8.0,
        "k" = 9.0,
        "m" = 10.0,
        "q" = 11.0,
        "r" = 12.0,
        "w" = 13.0,
        "x" = 14.0,
        "?" = 2.0,
        0.0)
    output[7] = switch (input[7],
        "bb" = 1.0,
        "dd" = 2.0,
        "ff" = 3.0,
        "h" = 4.0,
        "j" = 5.0,
        "n" = 6.0,
        "o" = 7.0,
        "v" = 8.0,
        "z" = 9.0,
        "?" = 8.0,
        0.0)
    output[9] = switch (input[9],
        "f" = 1.0,
        "t" = 2.0,
        0.0)
    output[10] = switch (input[10],
        "f" = 1.0,
        "t" = 2.0,
        0.0)
    output[11] = as.numeric(input[11])
    output[12] = switch (input[12],
        "f" = 1.0,
        "t" = 2.0,
        0.0)
    output[13] = switch (input[13],
        "g" = 1.0,
        "p" = 2.0,
        "s" = 3.0,
        0.0)
    output[14] = switch (input[14],
        "?" = 0.0,
        as.numeric(input[14]))
    output[15] = as.numeric(input[15])
    return (output)
}
Standardize <- function (input)
{
    AVERAGE_11 = 2.21739130434783
    STDEV_11 = 3.88625331874295
    input[11] = (input[11] - AVERAGE_11) / STDEV_11
    AVERAGE_14 = 177.708695652174
    STDEV_14 = 155.270169808247
    input[14] = (input[14] - AVERAGE_14) / STDEV_14
    AVERAGE_15 = 1077.55652173913
    STDEV_15 = 5812.28518492274
    input[15] = (input[15] - AVERAGE_15) / STDEV_15
    return (input)
}
		                    
		                   
							    
		                            |  | Datasets | 
					    
		                    
		                    
							Through the 
Dataset Partitioning Window and the 
sub-sampling schemes in the General Settings Tab, 
							GeneXproTools allows you to split your data into different datasets that can be used to: 
							
								- Create the models (the training dataset or a sub-set of the training dataset).
- Check and select the models during the design process (the validation dataset or a sub-set of the validation dataset).
- Test the final model (a sub-set of the validation set reserved for testing).
							
							Of all these datasets, the 
training dataset is the only one that is mandatory as GeneXproTools 
							requires data to create data models. The 
validation and 
test sets are optional and 
							you can indeed 
							create models without checking or testing them. Note however that this approach is not recommended 
							and you have indeed better chances of creating good models if you check their 
							generalizability 
							regularly not only 
							during 
model design but also during 
model selection. However if you don’t have 
							enough data, 
							you can still create good models with GeneXproTools as the 
learning algorithms of GeneXproTools 
							are not prone to 
overfitting the data. In addition, if you are using GeneXproTools to create 
							
random forests, the need for validating/testing the models of the ensemble is less important 
							as ensembles tend to generalize better than individual models.
							
                            
                            Training Dataset
                            
                           
                            The 
training dataset is used to create the models, either in its entirety or as a 
							
sub-sample 
                            of the training data. The sub-samples of the training data are managed in the Settings Panel.
							
	                        
							GeneXproTools supports different 
sub-sampling schemes, such as 
bagging and 
mini-batch. 
							For example, to operate in 
bagging mode you just have to set the sub-sampling to 
Random. 
							In addition, you can also change the number of records used in each bag, allowing you 
							to speed up evolution if you have enough data to get good generalization.
							
							Besides 
Random Sampling (which is done with replacement) you can also choose 
Shuffled 
							(which is done without replacement), 
							
Balanced Random (in which a sub-sample is 
							randomly generated so that the proportion of 
							positives and negatives are the same), 
Balanced Shuffled 
							(similar to Balanced Random, 
							but with the sampling done without replacement),
							
Odd/Even 
							Cases (particularly useful for time series 
							data), and different 
Top/Bottom partition 
							schemes with control over the number of records 
							drawn either from the top or bottom of the dataset.
							
							
All types of random sampling can be used in 
mini-batch mode, which is an extremely useful 
							sampling method for handling big datasets. In mini-batch mode a sub-sampling of the 
							training data is generated each 
p generations (the 
							
period of the mini-batch, which 
							is adjustable and can be set in the Settings Panel) and is used for training 
							during that period. This way, 
							for large datasets good models can be generated quickly using an 
							overall high percentage of 
							the records in the training data, without stalling the whole evolutionary process 
							with a huge dataset that is used each generation. It’s important however to find a 
							good balance between the size of the mini-batch and an efficient model evolution. 
							This means that you’ll still have to choose an appropriate number of records in order to ensure 
							good generalizability, which is true for all datasets, big and small. A simple rule 
							of thumb is to see if the best fitness is increasing overall: if you see it fluctuating 
							up and down, evolution has stalled and you need either to increase the mini-batch size or 
							increase the time between batches (the period). The chart below shows clearly the overall 
							upward trend in best fitness for a run in mini-batch mode with a period of 
							50.
							
	                        
		                         
	                         
							The training dataset is also used for evaluating 
data statistics that are used in certain models, 
							such as the average and the standard deviation of predictor variables used in models created 
							with 
standardized data. Other data statistics used in GeneXproTools include: pre-computed 
							suggested mappings for 
missing values; training data constants used in Excel worksheets 
							both for models and ensembles deployed to Excel; min and max values of variables when 
normalized data 
							is used (
0/1 Normalization and 
Min/Max 
							Normalization), and so on.
							
							It’s important to note that when a sub-set of the training dataset is used in a run, the 
							
training data constants pertain to the data constants of the entire training dataset as 
							defined in the 
Data Panel. For example, this is important when designing 
ensemble models 
							using different random sampling schemes selected in the Settings Panel.
							
                            
                            
                            Validation Dataset
                            
                           
                            GeneXproTools supports the use of a 
validation dataset, which can be either loaded as a 
                            separate dataset or generated from a single dataset using GeneXproTools 
partitioning algorithms. 
                            Indeed if during the creation of a new run a single dataset is loaded, GeneXproTools automatically 
                            splits the data into Training and Validation/Test datasets. GeneXproTools uses optimal strategies 
                            to split the data in order to ensure good model design and evolution. These 
default partition strategies 
                            offer useful guidelines, but you can choose different partitions in the 
Dataset Partitioning Window 
                            to meet your needs. For instance, the 
Odds/Evens partition is useful for 
                            
times series data, allowing 
                            for a good split without losing the time dimension of the original data, which obviously can help in 
                            better understanding both the data and the generated models.
							
	                        
							GeneXproTools also supports 
sub-sampling for the validation dataset, which, as explained for the 
							training dataset 
above, is controlled in the Settings Panel.
							
							The sub-sampling schemes available for the validation data are exactly the same available for the 
							training data, except of course for the 
							
mini-batch strategy which pertains only to the training data.
							
							It’s worth pointing out, however, that the same sampling scheme in the training and validation data 
							can play very different roles. For example, by choosing the Odds or the Evens, or the Bottom Half or 
							Top Half for validation, you can reserve the other part for testing 
							and only use this 
test dataset at the very 
							end of the modeling process to evaluate the accuracy 
							of your model.
							
							
Another ingenious use of the random sampling schemes available for 
							the validation set in the Classification Platform (especially Random 
							and 
							Shuffled sub-sampling) consists of calculating the 
cross-validation accuracy of 
							a model. A clever and simple way to do this, consists of creating a run with the model you want 
							to cross-validate. Then you can copy this model 
n times, for instance by importing it 
n times. 
							Then in the History Panel you can evaluate the performance of the model for different sub-samples 
							of the validation dataset. The average value for the fitness and 
							
favorite statistic shown in the 
							statistics summary of the History Panel consist of the cross-validation results for your model. 
							Below is an example of a 30-fold cross-validation 
							evaluated for the training and validation datasets 
							using random sampling of the respective datasets. 
							
	                        
                            
                            Test Dataset
                            
                           
                            The dividing line between a 
test dataset and a 
							
validation dataset is not always clear. 
                            A popular 
                            definition comes from modeling competitions, where part of the data is hold out and not 
                            accessible to the 
                            people doing the modeling. In this case, of course, there’s no other choice: you create your 
                            model and then others check if it is any good or not. But in most real situations people do have 
                            access to all the data and they are the ones who decide what goes into training, validation and 
                            testing.
							
							GeneXproTools allows you to experiment with all these scenarios and you can choose what works best 
							for the data and problem you are modeling. So if you want to be strict, you can hold out part of 
							the data for testing and load it only at the very end of the modeling process, using the 
							
Change Validation Dataset functionality of GeneXproTools.
							
							Another option is to use the technique described 
							above for the validation dataset, where you 
							hold out part of the validation data for testing. For example, you hold 
							out the Odds or the Evens, 
							or the Top Half 
							or Bottom Half. This obviously requires strong willed and very disciplined people, so it’s perhaps 
							best practiced only if a single person is doing the modeling.
							
							The takeaway message of all these what-if scenarios is that, after working with GeneXproTools 
							for a while, you’ll be comfortable with what is good practice in testing the accuracy of 
							the models you create with it. We like to claim that 
							the learning algorithms of GeneXproTools are not prone to 
overfitting and now 
							with all the partitioning and sampling schemes of GeneXproTools you can develop a better sense of 
							the quality of the models generated by GeneXproTools.
							
							And finally, the same 
cross-validation technique described above 
							for the validation dataset can be 
							performed for the test dataset.
							
                            
                            
							
						   
								
									|  | Choosing the Function Set | 
					    
							
							
							
							GeneXproTools allows you to choose your function set from a total of 
							279 built-in mathematical functions and an unlimited number of 
							custom functions, designed using the JavaScript language in the GeneXproTools environment.    
							
	                        
							Built-in Mathematical Functions
							
							GeneXproTools offers a total of 279 built-in mathematical
							functions, including 186 different if then else rules, that 
							can be used to design both linear and nonlinear classification models. This wide 
							range of mathematical functions allows the evolution of 
							highly sophisticated 
							and accurate models, easily built with the most appropriate functions. 
							You can find the description of all the 279 built-in mathematical functions 
							available in GeneXproTools, including their representation 
							here.
							
							
							
							The Function Selection Tools of GeneXproTools 
							can help you 
							in the selection of different function sets very quickly
							through the combination of the Show options with the
							Random, Default, Clear, and Select All buttons plus the 
							Add/Reduce
							Weight buttons in the Functions
							Panel.
							
	                        
							User Defined Functions
							
							Despite the great diversity of GeneXproTools 
							built-in mathematical functions, some users 
							sometimes want to model with different ones. 
							GeneXproTools gives the user the possibility 
							of creating custom functions (called
							Dynamic UDFs and represented as DDFs 
							in the generated code) in order to evolve models with them. 
							Note however that the use of custom functions is 
							computationally demanding, slowing considerably the evolutionary process and therefore should be used with moderation.
							
							
							
							By selecting the Functions Tab in the Functions Panel, you have full access to 
							all the available functions, including all the 
							functions you've designed and all the built-in math functions. It's also here in the Functions Panel 
							that you add
							the custom functions (Dynamic UDFs or DDFs) to your modeling 
							toolbox.
							
	                        
		                         
	                         
							To add a custom function to your function set, just check the checkbox on the Select/Weight column and 
							select the appropriate weight for the function (the weight determines the probability of each function being drawn 
							during mutation and other random events in the creation/modification of programs). 
							By default, the weight of each newly added function is 1, but you can increase the probability of a function being included in your models by increasing its weight in the Select/Weight column. GeneXproTools automatically balances your
							function set with the number of independent variables in your data,
							therefore you just have to select the set of functions for your problem and then choose their relative
							proportions by choosing their weights.
							
							To create a new custom function, just click the 
Add button on the Dynamic UDFs frame and the
							DDF Editor appears. You can also edit old functions 
							through the 
Edit button or remove them 
							altogether from your modeling toolbox by clicking 
							the 
Remove button.
							
	                        
		                         
	                         
							
							By choosing the number of arguments (minimum is 1 and maximum is 4) in the 
							Arguments combobox, the function header appears
							in the code window. Then you just have to write the body of the function in the code editor. The code must be in JavaScript and can be
							conveniently tested for compiling errors by clicking the Test button.
							
							In the Definition box, you can write a brief description of the function for your future reference. The text you write
							there will appear in the Definition column in the Functions Panel.
							
							Dynamic UDFs are extremely powerful and interesting tools as they are treated exactly 
							like the built-in functions of GeneXproTools and therefore can be used to model 
							all kinds of relationships not only between the original variables but also between 
							derived features created on the fly by the learning algorithm. For instance, you can design 
							a DDF so that it will model the log of the sum of four expressions, that is, 
							DDF = log((expression 1) + (expression 2) + (expression 3) + (expression 4)), 
							where the value of each expression will depend on the context of the DDF in the 
							program.
							
						   
								
									|  | Creating Derived Features/Variables | 
					    
							
							
							
							Derived variables or 
new features can be easily created in GeneXproTools 
							from the original variables. 
							They are created in the Functions Panel, in the Static UDFs Tab.
							                                   
	                        
							Derived variables were originally called 
							
UDFs or 
User Defined Functions 
							and therefore in the code generated by GeneXproTools they are represented as UDF0, UDF1, UDF2, and so on. Note however that 
							UDFs are in fact new features derived from the original variables in the training and 
							validation/test datasets. 
							Like 
DDFs, they are implemented in JavaScript using the JavaScript editor of GeneXproTools.
							
	                        
							These user defined features are then used by the learning algorithm exactly as 
							the original features, that is, they are incorporated into the evolving models 
							adaptively, with the most important being chosen and selected according to 
							how much they contribute to the performance of each 
							model.
							
						   
								
									|  | Choosing the Model Architecture | 
					    
							
							
							
							
							In GeneXproTools the candidate solutions or models are 
							encoded in linear strings or chromosomes with 
							a special architecture. This architecture includes 
							genes with different gene domains (head, 
							tail, and 
							random constants domains) and a linking function to 
							link all the genes. So the parameters that you can 
							adjust include the
							 head size, the  number of genes and the  linking
							function. You set the values for these parameters in the Settings Panel -> General
							Settings Tab.
							
	                        
		                         
	                         
	                        
							
							The Head Size determines the complexity or 
							maximum size of each term in your model. In the
							heads of genes, the learning algorithm tries out different arrangements of functions and
							terminals (original & derived variables and constants) in order to model your data. 
							The plasticity of this architecture allows the creation of an infinite number of models 
							of different sizes and shapes. A small number of these models (a 
							population) is 
							randomly generated and then tested to see how well 
							each model explains the data. Then according 
							to their performance or fitness the models are selected to reproduce with some minor changes, 
							giving rise to new models. This process of selection and 
							reproduction is repeated for 
							a certain number of generations, leading to the discovery of better and better models. 
							
							
							
							
							The heads of genes are shown in blue in the compact
							 
							linear representation (Karva notation) of the model 
							code in the Model Panel. 
							This linear code, which is the representation 
							that the learning algorithms of GeneXproTools use 
							internally, is then translated into any 
							of the built-in programming languages of 
							GeneXproTools (Ada, C, C++, C#, Excel VBA, Fortran, Java, JavaScript, 
							Matlab, Octave, Pascal, Perl, PHP, Python, R, Visual Basic, and VB.Net) 
							or any other language you add through the use of 
							GeneXproTools custom 
							grammars.
							
	                        
							
							More specifically, the  head size  h of each gene determines the maximum width
							 w and maximum depth  d of the  sub-expression trees
							encoded in each gene, which are given by the formulas:
							
								w = (n - 1) * 
								h + 1
								d = ((h + 1) / 
								m) * ((m +
							1) / 2)
							
							
							where  m is minimum arity (the smallest 
							number of arguments taken by the functions in the 
							function set) and  n is maximum arity (the 
							largest number of arguments taken by the functions 
							in the function set).
							
							GeneXproTools also allows you to visualize clearly the model structure 
							and composition by showing the expression trees of all your models in the Model Panel. 
							
	                        
							Thus, the learning algorithm selects its 
							models between the minimum possible size (which is a 
							model composed only of one-element trees) and the 
							maximum allowed size, 
							fine-tuning the ideal size and shape during 
							the evolutionary process, creating and 
							testing new features on the fly 
							without human intervention.
							
							The
 number of genes per chromosome is also an important parameter. It determines the number of 
							fundamental 
terms or 
building blocks in your models as each gene codes for a different
							sub-expression tree (sub-ET). Theoretically, one could just use a huge single gene 
							in order to evolve very complex models. However, by 
							dividing the chromosome into simpler, 
							more manageable units gives an edge to the learning process and more efficient and elegant models 
							can be discovered this way.
							
							Whenever the number of genes is greater than one, you must also choose a suitable
							linking function to connect the mathematical terms encoded in each gene.
							GeneXproTools allows you to choose
							
addition, 
subtraction, 
multiplication,
							
 division, 
average, 
min, and 
							max to link the sub-ETs. As expected, addition, subtraction 
							and average work very well for virtually all problems but sometimes one of the other linkers could be useful for searching different solution spaces. 
							For example, if you suspect the solution to a problem involves a quotient between two big terms, you can choose two genes linked by division. 
							
							
							
						   
								
									|  | Choosing the Fitness Function | 
					    
							
							
							
														For Classification problems, in the Fitness Function 
							Tab of the Settings Panel you have access to a a 
							total of 52
							
							built-in fitness
							functions, most of which combine multiple objectives, such as the use 
							of different reference simple models, cost matrix, 
							different types of rounding thresholds, including evolvable thresholds, 
							lower and upper bounds for 
							the model output, parsimony pressure, variable pressure, and many more.
							
	                        
	                        
							Additionally, you can also design your own 
							
							custom fitness functions and explore the solution space with 
							them.
							By clicking the Edit Custom Fitness button, the 
Custom 
							Fitness
							Editor is opened and there you can write the code of 
							your fitness function in JavaScript.
							
	                        
	                        
							The kind of fitness function you choose will depend most probably on the 
							
 
							cost function 
							or error measure you are most familiar with. And although there is nothing wrong with this, 
							for all of them can accomplish an efficient 
							evolution, you might want to try different 
							fitness functions for they travel the 
							fitness landscape differently: some of them 
							very straightforwardly in their pursuits 
							while others choose less travelled paths, 
							considerably enhancing the search process. Having 
							different fitness functions in your modeling toolbox is also 
							essential in the design of ensemble models.
							
							
						   
								
									|  | Exploring the Learning Algorithms | 
					    
							
							
							
							
							GeneXproTools uses two different learning algorithms for
							Classification problems. The first – the basic gene expression algorithm
							or simply
							Gene Expression Programming (GEP) – does not support the direct manipulation of random numerical constants,
							whereas the second – GEP with Random Numerical Constants or 
							
							GEP-RNC
							for short – implements a structure for handling them directly. These
							two algorithms search the solution landscape differently and
							therefore it might be a good idea to try them both on your problems. 
							For example, GEP-RNC models are usually more 
							compact than models generated without random 
							numerical constants. 
							
							
							The kinds of models these algorithms produce are quite different
							and, even if both of them perform equally well on the problem at hand,
							you might still prefer one over the other. But there are cases, however,
							where numerical constants are crucial for an efficient modeling and,
							therefore, the GEP-RNC algorithm is the default in
							GeneXproTools. You activate this algorithm in the Settings Panel -> 
							Numerical Constants by checking the Use Random Numerical Constants checkbox. 
							In the Numerical Constants tab you can also adjust the range and 
							type of constants and also the number of constants per gene. 
							
	                        
		                         
	                         
							
							The GEP-RNC algorithm is slightly more complex than 
							the basic gene expression algorithm as it uses an additional gene domain (Dc) for encoding the random
							numerical constants. Consequently, this algorithm 
							includes an additional set of genetic operators (RNC 
							Mutation, Constant Fine-Tuning, Constant Range 
							Finding, Constant Insertion, Dc Mutation, Dc Inversion, Dc IS 
							Transposition, and Dc Permutation) especially developed for handling random
							numerical constants (if you are not familiar with these operators,
							please use the default Optimal Evolution Strategy by 
							selecting Optimal Evolution in the Strategy combobox 
							as it works very well in all cases; or you can learn more about the 
							genetic operators in 
							the
							
							Legacy Knowledge Base).
			                
	                        
							
							
						   
								
									|  | Exploring Different Evolutionary Strategies for an Efficient Learning | 
					    
							
							
							Predicting unknown behavior efficiently is of course the foremost goal in modeling. 
							But extracting knowledge from the blindly designed models is 
							also extremely important 
							as this knowledge can be used not only to enlighten further the modeling process but also 
							to understand the complex relationships between variables.
							
							So, the evolutionary strategies we recommend in the GeneXproTools 
templates
							for Classification reflect these two main concerns: efficiency and simplicity. Basically, we recommend starting the modeling process with the 
							
GEP-RNC algorithm 
							and a function set well adjusted to the complexity of the problem.
							
							GeneXproTools selects the appropriate template for your problem according
							to the number of independent variables in your data. This kind of template is a good starting
							point that allows you to start the modeling process 
							straightaway with just a mouse click. Indeed, even if you are not familiar with evolutionary computation in general and
							
Gene Expression Programming in particular, you will be able to design complex classification models 
							(both linear and nonlinear) immediately thanks to the templates of
							GeneXproTools. In these templates, all the adjustable parameters of the 
							default learning algorithm are already set 
							and therefore you don’t have to know how to create genetic diversity, how to set the appropriate population size, the chromosome architecture, 
							the number of constants, the type and range for the 
							random constants, the fitness function, the stop 
							condition, and so forth. Then, as you learn more about
							GeneXproTools, you will be able to explore all its modeling tools and create quickly and efficiently very good 
							classification models that will allow you to understand 
							and model your data like never before.
							
							So, after creating a new run you just have to click the 
							
Start button in the
							Run Panel in order to design a classification model. 
							GeneXproTools allows you 
							to monitor the evolutionary process by giving you 
							access to
							different model fitting charts, including different 
							curve fitting charts, scatter plots and residual 
							plots. Then, whenever you see fit, you can stop the run 
							(by clicking the 
Stop button) without fear of stopping 
							evolution prematurely as
							GeneXproTools allows you to continue to improve the 
							model by using the best model thus far (or any other 
							model) as the starting
							point (
evolve with seed 
							method). For that you just have to click the 
							
Continue button in the
							Run Panel to continue the search for a 
							better model. Alternatively you can simplify or 
							complexify your model by clicking either the 
							Simplify or 
Complexify buttons.
							
	                        
							
							This strategy has enormous advantages as you might choose to stop the run at any time and then take a closer look at the evolved model. For instance, you can analyze its mathematical representation, its performance in the 
							validation or test set,
							evaluate essential statistics and measures of fit 
							for a quick and rigorous assessment of its accuracy, 
							see how it performs on a different test set, and so on. Then you might choose to adjust a few parameters, say, choose a different fitness function, expand the function set, add a neutral gene,
							apply parsimony pressure for simplifying its 
							structure, change the training set for model
							refreshing, and so on, and then explore this new 
							set of conditions to further improve the model. You can repeat this process for as long as you want or until you are completely satisfied with 
							your model.