Log Odds Analysis

Home

About Us

Contact

Blog


What's New	Products	Buy Now	Downloads	Forum

GeneXproTools Online Guide Learn how to use the 5 modeling platforms of GeneXproTools with the Online Guide

Last update: February 19, 2014

Log Odds Analysis

Log Odds and Logistic Regression
Logistic Fit Chart

Log Odds and Logistic Regression

The Log Odds Chart is central to the Logistic Regression Model. It’s with its aid that the slope and intercept of the Logistic Regression Model are calculated. And the algorithm is quite simple. As mentioned previously, it’s quantile-based and, in fact, just a few additional calculations are required to evaluate the regression parameters.

So, based on the Quantile Table, one first evaluates the odds ratio for all the bins (you have access to all the values on the Log Odds Table under Odds Ratio). Then the natural logarithm of this ratio (or the Log Odds) is evaluated (the Log Odds values are also shown on the Log Odds Table under Log Odds).

Note, however, that there might be a problem in the evaluation of the log odds if there are bins with zero positive cases. But this problem can be easily fixed with standard techniques. Although rare for large datasets, it can sometimes happen that some of the bins end up with zero positive cases in them. And this obviously results in a calculation error in the evaluation of the natural logarithm of the odds ratio. GeneXproTools handles this with a slight modification to the Laplace estimator to get what is called a complete Bayesian formulation with prior probabilities. In essence, this means that when a particular Quantile Table has bins with only negative cases, then we do the equivalent of priming all the bins with a very small amount of positive cases.

The formula GeneXproTools uses in the evaluation of the Positives Rate values p_i for all the quantiles is the following:

where μ is the Laplace estimator that in GeneXproTools has the value of 0.01; Q_i and T_i are, respectively, the number of Positive Cases and the number of Total Cases in bin i; and P is the Average Positive Rate of the whole dataset.

So, in the Log Odds Chart, the Log Odds values (adjusted or not with the Laplace strategy) are plotted on the Y-axis against the Model Output in the X-axis. And as for Quantile Regression, here there are also special rules to follow, depending on whether the predominant class is “1” or “0” and whether the model is normal or inverted. To be precise, the Log Odds are plotted against the Model Upper Boundaries if the predominant class is “1” and the model is normal, or the predominant class is “0” and the model is inverted; or against the Lower Boundaries if the predominant class is “1” and the model is inverted, or the predominant class is “0” and the model is normal.

Then a weighted linear regression is performed and the slope and intercept of the regression line are evaluated. And these are the parameters that will be used in the Logistic Regression Equation to evaluate the probabilities.

The regression line can be written as:

where p is the probability of being “1”; x is the Model Output; and a and b are, respectively, the slope and intercept of the regression line. GeneXproTools draws the regression line and shows both the equation and the R-square in the Log Odds Chart.

And now solving the logistic equation above for p, gives:

which is the formula for evaluating the probabilities with the Logistic Regression Model. The probabilities estimated for each case are shown in the Logistic Fit Table.

Besides the slope and intercept of the Logistic Regression Model, another useful and widely used parameter is the exponent of the slope, usually represented by Exp(slope). It describes the proportionate rate at which the predicted odds ratio changes with each successive unit of x. GeneXproTools also shows this parameter both in the Log Odds Chart and in the companion Log Odds Stats Report.

Logistic Fit Chart

The Logistic Fit Chart is a very useful graph that allows not only a quick visualization of how good the Logistic Fit is (the shape and steepness of the sigmoid curve are excellent indicators of the robustness and accuracy of the model), but also how the model outputs are distributed all over the model range.

The blue line (the sigmoid curve) on the graph is the logistic transformation of the model output x, using the slope a and intercept b calculated in the Log Odds Chart and is evaluated by the already familiar formula for the probability p:

Since the proportion of Positive responses (1’s) and Negative responses (0’s) must add up to 1, both probabilities can be read on the vertical axis on the left. Thus, the probability of “1” is read directly on the vertical axis; and the probability of “0” is the distance from the line to the top of the graph, which is 1 minus the axis reading.

But there’s still more information on the Logistic Fit Chart. By plotting the dummy data points, which consist of up to 1000 randomly selected model scores paired with dummy random ordinates, one can clearly visualize how model scores are distributed. Are they all clumped together or are they finely distributed, which is the telltale sign of a good model? This is valuable information not only to guide the modeling process (not only in choosing model architecture and composition but also in the exploration of different fitness functions and class encodings that you can use to model your data), but also to sharpen one’s intuition and knowledge about the workings of learning evolutionary systems.

Indeed, browsing through the different models created in a run might prove both insightful and great fun. And you can do that easily as all the models in the Run History are accessible through the Model selector box in the Logistic Regression Window. Good models will generally allow for a good distribution of model outputs, resulting in a unique score for each different case. Bad models, though, will usually concentrate most of their responses around certain values and consequently are unable to distinguish between most cases. These are of course rough guidelines as the distribution of model outputs depends on multiple factors, including the type and spread of input variables and the complexity of the problem. For example, a simple problem may be exactly solved by a simple step function.

Below is shown a Gallery of Logistic Fit Charts typical of intermediate models generated during a GeneXproTools run. It was generated using the same models used to create the twin ROC Curve Gallery presented in the ROC Analysis section. The models were created for a risk assessment problem with a training dataset with 18,253 cases and using a small population of just 30 programs. The Classification Accuracy, the R-square, and the Area Under the ROC Curve (AUC) of each model, as well as the generation at which they were discovered, are also shown as illustration. From top to bottom, they are as follow:

Generation 0, Accuracy = 65.33%, R-square = 0.0001, AUC = 0.5273
Generation 5, Accuracy = 66.03%, R-square = 0.0173, AUC = 0.5834
Generation 59, Accuracy = 66.92%, R-square = 0.0421, AUC = 0.6221
Generation 75, Accuracy = 68.99%, R-square = 0.1076, AUC = 0.7068
Generation 155, Accuracy = 69.93%, R-square = 0.1477, AUC = 0.7597
Generation 489, Accuracy = 74.15%, R-square = 0.2445, AUC = 0.7968

Generation 0, Accuracy = 65.33%, R-square = 0.0001, AUC = 0.5273

Generation 5, Accuracy = 66.03%, R-square = 0.0173, AUC = 0.5834

Generation 59, Accuracy = 66.92%, R-square = 0.0421, AUC = 0.6221

Generation 75, Accuracy = 68.99%, R-square = 0.1076, AUC = 0.7068

Generation 155, Accuracy = 69.93%, R-square = 0.1477, AUC = 0.7597

Generation 489, Accuracy = 74.15%, R-square = 0.2445, AUC = 0.7968

Besides its main goal, which is to estimate the probability of a response, the Logistic Regression Model can also be used to make categorical or binary predictions. From the logistic regression equation introduced in the previous section, we know that when a Positive event has the same probability of happening as a Negative one, the log odds term in the logistic regression equation becomes zero, giving:

where x is the model output at the Logistic Cutoff Point; and a and b are, respectively, the slope and the intercept of the regression line.

The Logistic Cutoff Point can be obviously used to evaluate a Confusion Matrix (in the Logistic Regression Window it is called Logistic Confusion Matrix to distinguish it from the ROC Confusion Matrix), in which model scores with Prob[1] higher than or equal to 0.5 correspond to a Positive case and a Negative otherwise.

In the Logistic Fit Table, GeneXproTools shows the Most Likely Class, the Match, and Type values of the Logistic Confusion Matrix (you can see the graphical representation of the Logistic Confusion Matrix in the Confusion Matrix Tab). For easy visualization, the model output closest to the Logistic Cutoff Point is highlighted in light green in the Logistic Fit Table. Note that the exact value of the Logistic Cutoff Point is shown in the companion Logistic Fit Stats Report.