Gepsoft - FAQ

Home

About Us

Contact

Blog


What's New	Products	Buy Now	Downloads	Forum	FAQ

GeneXproTools

GeneXproServer

Gene Expression Programming


Frequently Asked Questions

		Installation Issues and Demo Functionality How can I move the license to a system when I replace a system? Does the e-mailed key for unlocking the software unlock it for a limited time... How to restore the DEMO functionality? Is GeneXproTools compatible with Windows XP/Vista/7/Server?

		Classification Issues Which values should be used for the 0/1 rounding threshold... I was wondering if it is fair to add false positives and false negatives from the testing set...

		General Issues How does the software deal with calculation errors in the code? Can I use nominal attributes in addition to numeric attributes? Is there a way to weight the data cases? I have a data set for classification with a lot of zeros in it... Is there a method of inserting specific numerical constants? Is there a facility so the user can write in their own functions? I can't find a detailed description of the functions gau(x), gau(x,y)... How can one know that the model obtained is the best possible one? Is the name of the variables provided in the generated code arbitrarily assigned... Is there a possibility of reusing function blocks? What is the difference between Static and Dynamic UDFs? Once a model has been evolved, is it possible to apply it against a test set of data? Is it possible for me to create my own fitness functions? Is it possible to modify the evolved equation by hand and then run the program? How do I simplify the generated code? It is unclear how functions are weighted... The DNA microarrays sample run has 4 functions and more than 7000 variables...

		Learning Algorithms and Evolution Issues There appears to be no control for elitism in the evolutionary process... I have some questions about elitism. Theoretically, elitism means... I know that the new population is chosen using some selection technique... I have some doubts about how the selection process and genetic modification work... During mutation how is it decided that a function is used rather than a terminal? How is mutation implemented? With or without replacement? Is there a way to turn on a record of all genomes that are produced in each generation?



Installation Issues and Demo Functionality

		How can I move the license to a system when I replace a system?
		The license is a small file that is copied to the folder where GeneXproTools was installed and can be moved from system to system.

		Does the e-mailed key for unlocking the software unlock it for a limited time or can I run the software for an unspecified period?
		The e-mailed key unlocks GeneXproTools for an unspecified time period, as the license is perpetual.

		How to restore the DEMO functionality? I have the Professional Edition, but I want to 'test' the ability for classification using about 5400 input variables. Of course, this is beyond what is allowed for the Professional version, but can I use the DEMO version just to test this case (without getting the coded results)? I already tried downloading the DEMO again, but the install automatically registered me for only the Professional Edition.
		You don’t have to download and install the Demo again; to restore the Demo functionality, you just have to remove temporarily the license file (gxpt40.license) from the application folder and restart GeneXproTools.

		Is GeneXproTools/GeneXproServer compatible with Windows XP/Vista/7/Server?
		Both GeneXproTools and GeneXproServer have been tested and are compatible with Windows 2000, Windows XP, Windows Vista, Windows 7, Windows Server 2003 and Windows Server 2008. They are compatible with both 32 and 64 bits (x64 and Itanium) versions of the operating systems.



Classification Issues

		Which values should be used for the 0/1 rounding threshold: will the current model not adjust itself to any value you have chosen?
		You are right: the plasticity of GeneXproTools learning algorithms is extraordinary and they can, indeed, find a way around this (you see the same plasticity operating with different function sets, different linking functions, different random constants, different program structures and so on). But for most problems there is a certain 0/1 rounding threshold that works best. It will depend mainly on three factors: (1) on the kind of data you have (only positive values, only negative values, a mixture of the two, normalized or non-normalized data, etc.); (2) on the kind of functions you are using (for instance, if you are using functions that by themselves already round the values they return to 0 or 1, in this case a rounding threshold between 0 and 1 will do); and (3) on the presence of random numerical constants (when a diverse range of random numerical constants is available it is always possible for the algorithm to work its way around even the most adverse situations, such as a negative rounding threshold combined with only positive inputs and a function set with only addition and multiplication).

		I have a classification task with two classes and I was wondering if it is fair to add false positives and false negatives from the testing set back into the training set.
		You should prepare your training data to match the statistics of the testing data.



General Issues

		How does the software deal with calculation errors in the code? Consider, for instance, a modified piece of code from a model I generated: ((((y33)(y246))/(xyz))*(1.0/3.0)). Turns out that the first part of this code can be negative. In which case you are raising this negative term to a fractional power. How does the code deal with this?
		All the code generated by GeneXproTools is generated against a training dataset and, in this dataset, there are never calculation errors for the generated models as all the models that return calculation errors during the evolutionary process have zero fitness and are therefore excluded from the population. Calculation errors might however appear in the testing set or during scoring as the model was not checked against these particular values during training and in both cases GeneXproTools shows you clearly for which data points calculation errors are returned.

		Can I use nominal attributes in addition to numeric attributes?
		It is possible to use GeneXproTools to solve problems with nominal values, such as Yes and No. For that you’ll just have to convert them into numbers, for instance, 1 and 0 or 1 and -1 for Yes and No, respectively.

		Is there a way to weight the data cases? I would like to be able to process the data I'm using by weighting it. The data that I am working with contains both very high quality cases and some that are not as accurate. When modeling this data, I would like to place more importance on the accuracy of the model for the high quality data but not neglect the low quality data that I have. Initially, I did this by assigning a weighting factor to each test case. I then used this weighting factor to scale the residual from the model for each case, which lead to an R-square fitness value that drove the regression model to focus on the best quality data. I would like to use GeneXproTools to model my data, but I still need to weight the data. Is there a way to do this? Any suggestions would be appreciated.
		The simplest way of doing this consists of duplicating a certain number of times the high quality sample cases, thus giving more clout to these sample cases in model design. Another alternative would be the creation of a custom fitness function where you could take into account your “weighting factor”.

		I have a data set for classification with a lot of zeros in it, and I cannot get the program to initialize. The data set has 10 variables and 200 samples with almost equal numbers of members in each of the two categories (0 and 1). Is it possible to increase the number of attempts beyond 2000 or which settings would be best to try and alter?
		The fact that you are unable to initialize the run after 2000 tries might mean several things and most of the time there’s a way around this. But without seeing the dataset and the settings you are using, it’s hard to say what might be the cause. Sometimes by increasing the number of tries (you’ll see how this is done below) or by increasing the population size you can get the program to initialize, but it is not uncommon for the problem to persist and, in this case, you’ll need to look carefully at your data in order to determine the cause and then fix it. First, start by checking the rounding threshold: perhaps it is not the most appropriate for your data. Second, use random numerical constants and experiment with different intervals. Third, try different function sets in order to pinpoint which functions are generating calculation errors (for instance, for a dataset with lots of zeros it might be advisable not to weight too much the division operator). And finally, it is always a good idea to start by using smaller sub-sets of your training data and see if the problem disappears, and then increase the number of training samples until the problem appears again. This could help you understand what the problem might be. Notwithstanding, GeneXproTools allows you to change the number of tries either directly or indirectly. The simplest way consists of increasing the population size as this, for the present purposes, is similar to increasing the number of tries. But you can also change the default 2000 tries to another value by changing this setting directly in the App.config file of GeneXproTools. Search for "Run.Startup.Timeout" value="2000" and replace 2000 with the new value. Then you must close the application and start again.

		Is there a method of inserting specific numerical constants (such as the speed of light or the Faraday constant) into the equations being generated?
		There are different ways of doing this. The most straightforward and most efficient way consists of creating different UDFs, each returning a constant value. Another straightforward way consists of introducing the numerical constant through the Change Seed window. This obviously implies having the GEP-RNC algorithm turned on and a starting model (seed) in which the desired constant is introduced by hand. The problem with this strategy is that if the seed is not good enough, it (plus the numerical constants) might be lost in the evolutionary process; but obviously if the seed is very good and if its numerical constants play an important role, the constants will most certainly keep and other models will most probably evolve that are descendants of this seed and use the same constants. For those cases where only one specific constant is involved and you are sure that a good model can be created using that one constant, you can turn on the GEP-RNC algorithm and then choose the same lower and upper bounds for the random numerical constants so that the evolved models will only use the constant you are interested in. Another straightforward way, albeit not very efficient in terms of computational resources, consists of creating different DDFs, each returning a constant value. Contrary to UDFs which are only evaluated at the beginning of each run, DDFs are very high maintenance as they are evaluated each time they appear in the models, so you must always use UDFs for such simple functions returning constant values. Yet another form of introducing specific numerical constants is the popular method used in evolutionary computation of assigning different inputs to introduce the constants. This obviously means that the numerical constants must be introduced in all the datasets (training, testing, and scoring). This method is also very efficient, and although not as elegant, is in fact very similar to using UDFs for introducing the constants.

		Is there a facility so the user can write in their own functions (I was thinking of hyperbolic functions for example)?
		As a matter of fact hyperbolic functions are already part of the built-in functions of GeneXproTools, but yes, through the use of Dynamic UDFs (DDFs) you can write your own functions and then evolve models with them. And except for the maximum number of arguments these user defined functions take, there are no restrictions whatsoever concerning their form or structure.

		I can't find a detailed description of the functions gau(x), gau(x,y), gau(x,y,z), and gau(a,b,c,d).
		All GeneXproTools functions are implemented in C++, so the best way to find out their exact definitions is by checking the C++ grammar or the code output for each one of them. For example, the gau series corresponds to: gepGau = return exp(-pow(x,2)); gepGau2 = return exp(-pow((x+y),2)); gepGau3 = return exp(-pow((x+y+z),2)); gepGau4 = return exp(-pow((a+b+c+d),2));

		How can one know that the model obtained is the best possible one?
		For complex real-world problems there is no guarantee that any of the evolved models is the ‘best possible one’ (the same can obviously be said for any mathematical model handcrafted by a human modeler). But one can make sure that it is most definitively one of the best possible models with the performance on the testing set being the real indicator of how good these models are.

		Is the name of the variables provided in the generated code arbitrarily assigned (and indexed) or are they made from the headers of the variables columns?
		When no headers are present in the datasets to identify each variable or attribute, GeneXproTools assigns automatically an index to each variable, for instance, d₀ corresponds to the variable in the first column, d₁ to the variable in the second, and so on. When headers are used though, they can be used not only to generate more intelligible code by choosing Use Labels in the Model Panel, but they also make things more intelligible as they are used in the Variable Heat Maps of the Run Panel, in the Data Panel, and in the History Section of the Report Panel. Notwithstanding, in either case GeneXproTools uses internally the d₀ – d_(n-1) notation and in fact uses it in both the parse trees and the compact Karva code.

		In the models I am trying to discover with GeneXproTools, some function blocks are of particular importance. I am trying to make their discovery easier. This function blocks are of the form: v1^c1/(c2 + v2^c1)c3*v3, and so on, where v_i are variables and c1 is a positive small integer, c2 and c3 are real numbers. Being a beginner I don't know yet if I can use Static or Dynamic UDFs for this goal. More precisely it is not clear to me if this UDF can deal only with variables or if it can use constants too.
		You should use Static UDFs in order to explore and reuse particular function blocks that you are interested in. There are virtually no limits concerning the form of UDFs and they can use both variables and constants.

		What is the difference between Static and Dynamic UDFs?
		Static UDFs (UDFs for short) are used to describe and explore relationships between the variables in your data such as (v1+v5+v9)/15. These UDFs are static in nature because the variables v_i they use represent certain variables in your data and, in the parse trees, they always occupy a terminal or leaf position. Dynamic UDFs (DDFs for short) behave exactly like a normal built-in function of GeneXproTools and therefore their variables or arguments are fluid, that is, their meaning will depend on their context on the parse trees. For instance, the DDF (e1+e2+e3)/15 will take three arguments and in the parse trees will appear as a bud node with three branches where each branch represents a particular expression e_i. These expressions might be as simple as a constant or as complex as functions involving hundreds of variables.

		Once a model has been evolved, is it possible to apply it against a test set of data?
		Yes, and if you load your testing data before evolving the model, GeneXproTools will immediately evaluate its performance on the testing set and show you the results straightaway on the Run Panel. You can also load different testing sets and have GeneXproTools evaluate model performance on all of them.

		Is it possible for me to create my own fitness functions?
		Yes, you can create very interesting custom fitness functions as GeneXproTools gives you access not only to basic parameters such as the number of samples, averaged target output, and so forth but also useful information about the topology of the evolving models such as program size, used variables, and number of literals.

		Is it possible to modify the evolved equation by hand and then run the program? This would be useful when trying to simplify the equation by hand.
		You can do this with GeneXproTools and it is very simple. For that you must know how to perform basic manipulations in Karva notation. In the online tutorial Karva Notation: The Native Language of GeneXproTools you can find some tips. The most important though, is to write the equation as a tree and then linearize the tree by writing down all its elements from top to bottom and from left to right. After that you’ll be able to create a valid seed program and introduce it through the Change Seed Window and then run it just as you run any other program.

		How do I simplify the generated code? Is there a program available that will convert the code generated into a simplified algebraic expression or parse tree? It is easy enough to do by hand for simple codes but for the more elaborate ones it is rather error prone by hand.
		GeneXproTools allows you to do automatic simplifications through the Simplification Method and also allows you to do simplifications by hand through the Change Seed Window. And since GeneXproTools also draws the evolved code in tree form (parse trees or expression trees) you shouldn’t have any problems making and checking the simplifications you are doing.

		It is unclear how the weighting of functions and the non-weighting of possibly hundreds or thousands of terminals is dealt with. There is a short note on this on the Help File that suggests that there is some rationalization of the difference between the numbers of functions with weights and the numbers of terminals (variables). For example, when selecting terminals for the tails of a gene, are all terminals weighted equally? If not, how are they weighted since there is no domain knowledge that would assist in doing this other than give equal weight to all? Also, when choosing in the head (functions or terminals) if we had only 4 functions and 100 terminals then clearly the terminals would be selected much too often. Do the weightings get adjusted so that the functions and terminals get approximately the same overall weight or is there some other rule of thumb used? It is important to understand this especially when trying to understand the impact of changing function weights.
		First of all, terminals are not weighted in GeneXproTools, that is, they are all weighted equally, no matter whether they are being chosen for the heads or the tails. So the only way of weighting certain variables would be either by duplicating them in the datasets or creating UDFs returning a certain terminal. In the heads, however, terminals are chosen together with functions and, given that in version 4.0 there are no limits concerning the number of variables, it became essential to automatically balance the function set. As you point out, this is documented in the Help File (also accessible through the Online Knowledge Base) and was optimized to produce an efficient evolution. The rule of thumb is very simple: when the number of functions in the function set is smaller than the number of terminals, for each position in the heads, the probability of it being a function is 1/2. When functions outnumber terminals though, all elements (functions and terminals) are equally weighted. These rules are operational both during the creation of the initial population and during mutation. Also important is that when you are using random numerical constants, the rule of thumb for choosing the terminal “?” (the special terminal that represents the random numerical constants) both for the heads and tails is 1 out of 3 terminals.

		The DNA microarrays sample run has 4 functions (+ - * / with weights 5 5 5 1 respectively) and more than 7000 variables. As runs proceed it doesn't seem to me that functions are actually getting selected 1/2 of the time. Do the functions only get this probability of selection in the initial population selection? Or is it that the terminals are getting higher allocation since there are so many more of them?
		No, the rules are exactly as explained in the previous Q&A and they apply both during the creation of the initial population and during mutation. The reason it might appear that functions are being introduced with a lower probability is that you are looking at the results of an evolutionary process with simple elitism and if the best-of-generation happens to be a small program then it will leave its mark in future generations (it also depends on the type of genetic operators you are using and their rates). There are some simple things you can try using the DNA microarrays example to study this: First, set the plots in the Run Panel to All Sizes and Avg/Best Size; second, set the stop condition to zero generations (the initial population) in order to check the sizes of all the programs randomly created in the initial population; third, revert to a normal stop condition and create several runs to see the size profile of the evolving populations; fourth, see also what happens when only mutation is switched on; and finally, see also what happens when you evolve with parsimony pressure.



Learning Algorithms and Evolution Issues

		In GeneXproTools there appears to be no control for elitism in the evolutionary process. Does GeneXproTools just choose the single best (or n best) candidates to promote to the next generation?
		GeneXproTools practices the simplest form of elitism, which consists of keeping the single best of each generation intact and clone it to promote the next generation. And because GeneXproTools is optimized for efficiency, the number of programs cloned into the next generation is equal to 1 and cannot be changed in GeneXproTools.

		I have some questions about elitism. Theoretically, elitism means that from the current population the n elite members of the population are moved to the next generation (n is likely to be 1 but I’m not sure... not even really sure if elitism is used in GeneXproTools). If this is done are these elite individuals not subjected to further genetic manipulation (mutation, recombination, etc.)?
		GeneXproTools uses simple elitism (or the cloning of the single best individual of the population). This means that, each generation, the genome of the best individual is copied unchanged into the next generation. However, like all individuals of the population, this individual is also subjected to selection and, being the very best, it has the highest probability of being selected to reproduce with modification. So, this individual not only gets special treatment (cloning) but also is selected according to its fitness like everybody else. And its lucky descendants (besides the clone, of course) are subjected to the genetic operators of mutation, inversion, transposition, and recombination in order to create genetic diversity.

		I know that the new population is chosen (selection process) using some selection technique. One guesses that the process is either roulette (weighted selection based on fitness) or tournament selection. I would guess that roulette is used based on some comments in Ferreira’s book but I’m not sure.
		Yes, you are right. In GeneXproTools roulette-wheel selection is used and it is implemented as described in Ferreira’s book.

		I have some doubts about how the selection process and genetic modification work. Is the entire set of new individuals selected at once and then mutation, inversion, etc. applied by selecting from this new population? And how are the rates applied? For example, are they picked up randomly n times (where n is the number of individuals in this selected population)... then apply mutation, inversion etc and then put back into the population? And for the combination operators, are they chosen 2 at a time and put the 2 new children back into the population, removing the parents?
		The selection and reproduction processes are done as shown in the GEP flowchart, that is, first, individuals are selected by roulette-wheel to be reproduced (so, in your words, they are all selected at once). And when they reproduce, depending on the rates and number of the genetic operators being used, their genomes might get modified or not. There are, however, basically two different kinds of modification rates: (1) rates like mutation rates which relate to points (or individual targets) in the chromosomes; and (2) rates like inversion or transposition rates which relate to individual chromosomes. This means that when you are applying the mutation rate (the first kind of modification rate) the probability of mutation is evaluated relatively to the number of points in the chromosome. But since chromosomes in artificial evolutionary systems are usually very small (comparatively to chromosomes in nature), it is advisable to use all the chromosomes in the population in order to determine the number of points to mutate. For instance, for 30 chromosomes of size 50, there are a total of 1500 potential targets, so, in this case, a mutation rate of say 0.05 means that a total of 75 points are randomly mutated each generation. When, for example, you are evaluating a transposition rate or an inversion rate (the second type of modification rate), the probability of transposition/inversion is evaluated relatively to the number of chromosomes in the population. For instance, for a population with 30 individuals, a transposition rate of 0.3 means that, in this case, a total of 9 different chromosomes are randomly chosen to undergo transposition. Recombination is similar to the latter, with the difference that in this case you’ll need an even number of chromosomes to recombine (if, for instance, the total number of chromosomes to recombine gives 9, only 8 are in truth recombined). And in all cases, when a chromosome is modified, say chromosome 2, the new version replaces the old one (recombination is no different and, for instance, if chromosomes 5 and 8 were selected to undergo recombination, then the two new versions would replace the old chromosomes 5 and 8). So, as you can see, different modifications might accumulate on a chromosome during its reproduction or, in other words, a chromosome might be modified more than once by different genetic operators during its reproduction.

		During mutation how is it decided that a function is used rather than a terminal (here again is the question of selecting a symbol based on weightings of functions and non-weighting of terminals)?
		During mutation, the same rules that are used to create the initial population apply, that is, when the number of functions in the function set is smaller than the number of terminals, for each mutation point in the heads, the probability of it mutating into a function is 1/2. When functions outnumber terminals though, all elements (functions and terminals) are equally weighted.

		How is mutation implemented in GeneXproTools? With or without replacement? And the other operators?
		Mutation is implemented with replacement, but the remaining operators (inversion, transposition, and recombination) are implemented without replacement.

		Is there a way to turn on a record of all genomes that are produced in each generation? This would be most useful for teaching and for understanding details of the evolutionary process.
		The amount of information required to keep track of all the programs created during a run is quite voluminous and therefore is not saved in any way in GeneXproTools. However, since more and more people are using GeneXproTools for teaching purposes we’ll try and include this feature in a future version.

Time Limited Trial

Try GeneXproTools for free for 30 days!