How to limit the number of used variables
  • Instead of using custom fitness function and checking aModelInfo[1], is there any option in GeneXproTools to define the maximum number of variables in the final model?

    I couldn't find anything.

  • You will have the possibility of controlling straightforwardly the number and type of used variables in version 6. But for now there’s a cool way of controlling this that requires knowing how to manipulate Karva code and using the Sub-set Selection Modeling Strategy. I’ll try and explain this with an example.

    Let’s use the ConcreteStrength sample run. Suppose you wanted to create a model with only the 4 predictor variables with the highest correlations with the target. In this case, these would be d0, d3, d4, and d7 (you can see their R-squares in the Statistics Charts in the Data Panel).

    Now you need to create a starting model with a relatively high fitness value or, better yet, an R-square of around 0.5-0.6. (At this stage it doesn’t matter what variables are being used in the model.)

    Then, in the Change Seed Window, you tweak the seed so that only the variables you’re interested in are expressed. For example, if you have a term like d1-d3, try replacing the d1 by d0, d4 or d7 and check if the fitness remains more or less the same and then choose the best one. Do that for all the var nodes that are being expressed (the best way to do this is to open the Change Seed Window with the Model Panel behind, showing the Expression Trees).

    Now you need to replace all the non-expressed var terminals either by the vars you are interested in (d0, d3, d4, and d7, in this case) or constants (the best way to do this is again to open the Change Seed Window with the Model Panel behind, showing in this case the Karva code with the K-expressions so that you can see clearly the parts that are not being expressed).

    And you are now almost ready to use this model as seed (it’s a good idea to save the run at this stage because you might need to come back to this seed model).

    But first you need to set the Modeling Strategy to Sub-set Selection in the Genetic Operators Tab (this strategy will only use the variables that exist in a particular population). But there’s still the problem of the initial population with its initial random programs: you could be unlucky and get some good models with unwanted variables in them. If these models replace your seed or propagate (that’s why you needed a seed model with fairly high fitness), you’ll have to try again (just select the seed model again and click Continue).

    One way of increasing your chances of success (i.e., not losing your seed or having unwanted variables appear in the population) is to set the population size to just 10 programs (you can increase it latter when you have a better seed).

    Have fun!

    Candida Ferreira

  • Thanks Candida for detailed explanation.

    I try this method, but as I'm testing a database with more than 200 variables, it's almost impossible to try all of them.

    It's better to wait for version 6 to have more control on models.


Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!