Multi-Class Classification

Shivakumar

If I have a variable that has multiple classes for er.g about 8-9 classes and I want to predict it, how can I do this. In the data import, it dichotomizes the variable to '0' and '1'.

Shivakumar

Hi, As I understand it the software allows one to discretize the predicted variable into '0' & '1' and we can choose which value we want to use. So if there are 4 classes that need to be predicted, then I can dichotomize it 4 different times and run it.

However the end outcome that is desired is for a new set of records, how do we classify it into these 4 classes. Each separate algorithm for the each of the 4 classes will operate independently and predict whether the record belongs to a specific class or not. It could be possible that a record could be shown as belonging to more than 1 class after the prediction. How is this resolved?

Dr. Candida Ferreira

Hi,

Yes, you’re right. If you have 4 classes then you have to create 4 different models for each of the classes and then combine the models to evaluate the final outcome.

You can create the n models in GeneXproTools either in a single run or in n different runs, but this is a matter of personal preference as both methods are quite straightforward. For simple problems, I usually create all my models under the same gep file, keeping the best model for a particular sub-task before proceeding to the next task by changing the Singled Out Class, and so on, until I have all my models, one for each class.

For more complex problems where I need to do a lot of model selection, I usually create and keep my models in n different gep files, but I use the same starting gep file and only do Save As before proceeding to the next sub-task and changing the Singled Out Class. Then when I’m done, I put all the models (one for each class) together in a gep file by importing them (the parameters of each model, such as the Rounding Threshold, don’t change when you import them, so they’ll continue to predict the same class).

Now that you have all your models, you can deploy them to Excel, for example, choosing the Probability[1] output. This way you will be able to evaluate quite easily the class a particular record belongs to by applying the argmax function to the probability outputs of all your n models. I show how to implement this in Excel in an example with the Satellite Images dataset with 6 classes. Both the final gep file with all the 6 models and the Excel file with all the calculations and formulas for evaluating the argmax function in Excel are available for download from the links below:

GeneXproTools file

Excel file

Shivakumar

Thank you very much for the response. Makes a lot of sense.

If I use an ensemble for each classification; so if I have 6 classes, I would have 6 ensembles and each ensemble would predict for its class. How would one then combine the results to predict the class membership of a record. I am having a real life study that I need to analyse and currently checking out various ensemble applications to see which one predict the best.

Thanks

Shiv

Dr. Candida Ferreira

Hi Shiv,

The process is quite straightforward and very similar to what I outlined above.

I would create the 6 ensembles in the Logistic Regression Platform, creating a different GeneXproTools file for each ensemble. Then I would deploy each ensemble to Excel choosing Probability[1] as the model output.

Then in Excel you have access both to the Average Probability Model (AvgProb) and Median Probability Model (MedianProb) for each ensemble. For each ensemble you then choose the one that performs best (you can also adjust the thresholds for both these models quite easily in Excel). Then you just have to copy/paste the output of each ensemble to a new spreadsheet in order to combine the 6 ensembles and then evaluate the class for each record using the argmax function as I explained above.

Candida Ferreira

Rebecca

Dr. Candida Ferreira

Hi Rebecca,

The first argument is correct. I explain how to do that in the comment above and I also include an example complete with the .gep file with all the models for each individual class and an Excel spreadsheet with all the calculations for evaluating the class using the argmax function.

Regarding your last question, the GEP models are indeed called from Excel using embedded macros that GeneXproTools automatically generates during Model/Ensemble Deployment to Excel. You can access the VBA code of the gepModels in the Excel worksheet by enabling the Developer tools and then selecting View Code in the Developer tab.

Hope this helps.

Best,

Candida

Rebecca

Dear Candida,

I have gone through the classification as you described. I
have 3 classes. The data for each class are saved in three different text
files. The problem is when I want to import the data to GeneXproTools, I am facing the
following error since I included only one class for training on it:

“Unable to proceed: The response Variable has only 1 kind of
class”

So how to train a model separately for each class and
corresponding data? Is it feasible to add a sample from another class (e.g.
just one sample) to solve the problem?

Thanks

Rebecca

Dr. Candida Ferreira

Hi Rebecca,

If you're solving a 3-class problem, your dataset should include records from all the 3 classes and GeneXproTools generates all the necessary datasets for you automatically so you don’t have to generate them yourself. These datasets are binary as GeneXproTools models each class separately both in the Classification Framework and in the Logistic Regression Framework (see the Knowledge Base article on Class Merging & Discretization). You then can combine the models as described above.

Hope this helps.

Candida Ferreira

Rebecca

Dr. Candida Ferreira

Hi Rebecca,

Your first question:

1. After ensembling the model to Excel, there is a parameter called "Training Data Constant" which is the same for all of the five trained models! I know it is used when the model output is not a numeric value. But what is this parameter, how it is obtained and why it is fixed for all?

The "Training Data Constants" (either Prior Probability[1] or Majority Class, depending on the Model Output you chose for your models) are constants derived from the Training Data and therefore will have the same value if the same training data was used to create the models. So, the Prior Probability[1] is the percentage of positive cases in the training data expressed as a ratio; and the Majority Class equals 1 if there are more positive cases in the training data than negative cases and zero otherwise.

Your second question:

2. Using the logistic regression analysis, we obtain the following equation for each of the classes:

probabilityOne = 1.0 / (1.0 + exp(-(SLOPE * y + INTERCEPT))) and for a given dataset, the probabilities for each class are compared to find the highest probable class. I have gone through the manual to understand the philosophy behind this concept but it was not easy to grasp some technical parts such as:

- How SLOPE and INTERCEPT are calculated for each class?

All the details are given in the documentation for the Logistic Regression Framework:

Logistic Regression Analytics Platform

- The general formula for the logistic probability p is: p= 1 / (1 + exp(-(a*x+ b))). So here y (which is a nonlinear GEP-based equation in terms of inputs) is replaced with x in the original equation?

That's correct (this is also explained in the above mentioned tutorial).

Rebecca

I recently found out that a good paper on multi-class classification with GEP has been published by a group of researchers at MSU. It seems to be a sound reference for those interested in multi-class classification. This is the link: http://www.sciencedirect.com/science/article/pii/S026322411500679X

Howdy, Stranger!

Categories

Tagged