Dataset Partitioning & Sub-sampling

Home

About Us

Contact

Blog


What's New	Products	Buy Now	Downloads	Forum

GeneXproTools Online Guide Learn how to use the 5 modeling platforms of GeneXproTools with the Online Guide

Last update: February 19, 2014

Dataset Partitioning & Sub-sampling

Through the Dataset Partitioning Window and the sub-sampling schemes in the General Settings Tab, GeneXproTools allows you to split your data into different datasets that can be used to:

Create the models (the training dataset or a sub-set of the training dataset).
Check and select the models during the design process (the validation dataset or a sub-set of the validation dataset).
Test the final model (a sub-set of the validation set reserved for testing).

Of all these datasets, the training dataset is the only one that is mandatory as GeneXproTools requires data to create data models. The validation and test sets are optional and you can indeed create models without checking or testing them. Note however that this approach is not recommended and you have indeed better chances of creating good models if you check their generalization regularly not only during model evolution but also during model selection. However if you don’t have enough data, you can still create good models with GeneXproTools as the learning algorithms of GeneXproTools are not prone to overfitting the data. In addition, if you are using GeneXproTools to create random forests, the need for validating/testing the models of the ensemble is less important as ensembles tend to generalize better than individual models.

Training Dataset

The training dataset is used to create the models, either in its entirety or as a sub-sample of the training data. The sub-samples of the training data are managed in the Settings Panel.

GeneXproTools supports different sub-sampling schemes, such as bagging and mini-batch. For example, to operate in bagging mode you just have to set the sub-sampling to Random. In addition, you can also change the number of records used in each bag, allowing you to speed up evolution if you have enough data to get good generalization.

Besides Random sampling (which is done with replacement), you can also choose Shuffled (which is done without replacement); Balanced Random (used in classification and logistic regression runs where a sub-sample is randomly generated so that the proportion of positives and negatives are the same); and Balanced Shuffled (similar to Balanced Random, but with the sampling done without replacement).

All types of random sampling can be used in mini-batch mode, which is an extremely useful sampling method for handling big datasets. In mini-batch mode a sub-sampling of the training data is generated each p generations (the period of the mini-batch, which is adjustable and can be set in the Settings Panel) and is used for training. This way, for large datasets good models can be generated quickly using a high percentage of the records in the training data, without stalling the whole evolutionary process with a huge dataset that is used each generation. It’s important however to find a good balance between the size of the mini-batch and an efficient model evolution. This means that you’ll still have to choose an appropriate number of records in order to ensure good generalizability, which is true for all datasets, big and small. A simple rule of thumb is to see if the best fitness is increasing overall: if you see it fluctuating up and down, evolution has stalled and you need either to increase the mini-batch size or increase the time between batches (the period). The chart below shows clearly the overall upward trend in best fitness for a run in mini-batch mode with a period of 50.

The training dataset is also used for evaluating data statistics that are used in certain models, such as the average and the standard deviation of predictor variables used in models created with standardized data. Other data statistics used in GeneXproTools include: pre-computed suggested mappings for missing values; training data constants used in Excel worksheets in models and ensembles deployed to Excel; min and max values of variables when normalized data is used, and so on.

It’s important to note that when a sub-set of the training dataset is used in a run, the training data constants pertain to the data constants of the entire training dataset as defined in the Data Panel. For example, this is important when designing ensemble models using different random sampling schemes selected in the Settings Panel.

On the other hand when a sub-set of the training dataset is used to evolve a model, the model statistics or constants, if they exist, remain fixed and become an integral part of the evolved model. Examples of such model parameters include the evolvable rounding thresholds of classification models and logistic regression models.

In addition to random sampling schemes, GeneXproTools supports non-random sampling schemes, such as using just the odd or even records; the top half or bottom half records; and the top n or bottom n records.

Validation Dataset

GeneXproTools supports the use of a validation dataset, which can be either loaded as a separate dataset or generated from a single dataset using GeneXproTools partitioning algorithms. Indeed if during the creation of a new run a single dataset is loaded, GeneXproTools automatically splits the data into Training and Validation/Test datasets. GeneXproTools uses optimal strategies to split the data in order to ensure good model design and evolution. These default partition strategies offer useful guidelines, but you can choose different partitions in the Dataset Partitioning Window to meet your needs. For instance, the Odds/Evens partition is useful for times series data, allowing for a good split without losing the time dimension of the original data, which obviously can help in better understanding both the data and the generated models.

GeneXproTools also supports sub-sampling for the validation dataset, which as explained for the training dataset above, is controlled in the Settings Panel.

The sub-sampling schemes available for the validation data are exactly the same available for the training data, except of course for the mini-batch strategy which pertains only to the training data.

It’s worth pointing out, however, that the same sampling scheme in the training and validation data can play very different roles. For example, by choosing the Odds or the Evens, or the Bottom Half or Top Half for validation, you can reserve the other part for testing and only use this test dataset at the very end of the modeling process to evaluate the accuracy of your model.

Another ingenious use of all the random sampling schemes available for the validation set (Random, Shuffled, Balanced Random and Balanced Shuffled, with the last two available only for Classification, Logistic Regression and Logic Synthesis) consists of calculating the cross-validation accuracy of a model. A clever and simple way to do this, consists of creating a run with the model you want to cross-validate. Then you can copy this model n times, for instance by importing it n times. Then in the History Panel you can evaluate the performance of the model for different sub-samples of the validation dataset. The average value for the fitness and favorite statistic shown in the summary of the History models consist of the cross-validation results for your model. Below is an example of a 30-fold cross-validation evaluated for the training and validation datasets using random sampling of the respective datasets.

Test Dataset

The dividing line between a test dataset and a validation dataset is not always clear. A popular definition comes from modeling competitions, where part of the data is hold out and not accessible to the people doing the modeling. In this case, of course, there’s no other choice: you create your model and then others check if it is any good or not. But in most real situations people do have access to all the data and they are the ones who decide what goes into training, validation and testing.

GeneXproTools allows you to experiment with all these scenarios and you can choose what works best for the data and problem you are modeling. So if you want to be strict, you can hold out part of the data for testing and load it only at the very end of the modeling process, using the Change Validation Dataset functionality of GeneXproTools.

Another option is to use the technique described above for the validation dataset, where you hold out part of the validation data for testing. For example, you hold out the Odds or the Evens, or the Top Half or Bottom Half. This obviously requires strong willed and very disciplined people, so it’s perhaps best practiced only if a single person is doing the modeling.

The takeaway message of all these what-if scenarios is that, after working with GeneXproTools for a while, you’ll be comfortable with what is good practice in testing the accuracy of GeneXproTools models. We like to claim that GeneXproTools is not prone to overfitting and now with all the partitioning and sampling schemes of GeneXproTools you can develop a better sense of the quality of the models generated by GeneXproTools.

And finally, the same cross-validation technique described above for the validation dataset can be performed for the test dataset.

See Also: