Using sub-sampling
  • Hello,


    My question would be:

    I have a dataset for classification. The proportion of positive cases is 21,81%, the negative cases is 78,19%.

    Is there any advantage in setting Balanced Random or Balanced Shuffled sub-sampling instead of using all the data? Is it known, wheater the relative proportion of positive and negative cases have any effect on the algorithm? Should mini-batch mode used in this case?

    Thank you for the answer in advance!

  • Hi,

    It depends on what you want to do with your model. But in general, and for the distribution of classes you have, I wouldn’t use any of the Balanced schemes because the algorithm won’t have any problems with such a dataset. I only tend to use Balanced sub-sampling when the dataset is very unbalanced, say 1-5% of positives, and the models I’m getting are not satisfactory even after skewing the values in the cost matrix to suit my needs. It seems to me that it’s better to try and create models using class distributions that are as similar to the original as possible, because you’ll get better generalization. So I usually try to find a good combination of fitness function, rounding threshold, and cost matrix that allows me to do that. Only when I’m unable to find a good solution do I resort to a balanced sub-sampling scheme and see if the generalization I’m getting is any good. It happens sometimes however, especially if the dataset is extremely unbalanced, that using a balanced scheme is the only way to design a good model.

    The balanced sub-sampling schemes are also useful if you’re creating ensembles, as blending models generated using different sampling methods is a good strategy for creating good random forests.

    As for the mini-batch mode, I would only use it if I had a large dataset and needed to speed things up.

    It would be great to know what is the experience of other users.



  • Many thanks for the answer!

    I have tried the Sub-sampling and it did not show better results.

    The problem at hand is not obvious, and it is difficult to find good models.

    I have had more luck with complexity increase. I started with only one gene, and enabled to develop models with up to 7 genes. Although the results are under evaluation at the moment, but it seems to be the best method so far.

    Anyway, many thanks for the answer!



Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!