Why does the New Run Wizard ask me for two datasets?
  • I'm modeling a data set and when I step through the wizard, I load an initial file and then I'm asked to load a 2nd file (I think it's the validation file)...

    My question is: Is there a way to load just one single file that holds both the training AND the validation data ?


    It feels awkward to have to split my file into two separate files. I've used other modeling apps and it's always been a single file and then I split the data for training / validation within the app. It looks like on the General Settings tab, I have an option to set the training records and validation test records

    but why would I set this if I've already been forced to import two separate files ??
  • The New Run Wizard gives you the option to load two datasets but you don't have to. Basically it covers the two cases: when you have only one dataset and when you already have a training and validation/test datasets pre-defined. 

    When you have the training and validation data in different files
    In this case you should load the training data in the Training Data window of the New Run Wizard and load the validation data in the final step. This will create a run where each dataset in GeneXproTools corresponds to the data you loaded.

    When you have one file only
    In this case GeneXproTools assumes that you want to split the file into a Training Set and a Validation Set and does so automatically. You only need to load the data file in the second step of the wizard and then press Finish. 

    In this case GeneXproTools shuffles the data and then uses some heuristics to split it into two datasets. After the run is created you can change the size and type of the partitions by choosing the Data->Dataset Partitioning menu. Here is the Dataset Partitioning window:


    As you can see it is possible to split the data into two datasets choosing the odd records for training and the even ones for validation, or you can just partition them in order (the top n records are used for training and the remaining for validation/test) or you can shuffle them as GeneXproTools did by default. You can also change the size of the datasets.

    The Data Sampling in the Settings Panel
    This is a different feature of GeneXproTools. Here is a snippet of that part of the application:


    Sampling is only applied when GeneXproTools is creating models or visualizing them in the Results Panel. This is useful for bagging and mini-batch mode. This is also useful if you want to reserve part of the validation set for testing, for example, by choosing the top half of the validation set for validation and reserving the bottom half for testing at the very end.

    I hope this answers your question.

  • It DID answer my question and so did the text "Select data source (entire dataset or training dataset) " within the wizard panel  which I saw once I opened my two eyes ! :>


Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!