Dear Gepsoft Team,
I think that the new Knowledge Base is quite promising. I have however some difficulty in understanding some of the Fitness Functions.
I was able to find some definitions in Wikipedia, but there are few which are not listed there. I could not find Dual Margin definiton for example, but there some others as well.
Is there some literature on this subject, which could be useful?
Thanks in advance!
The Dual Margin fitness function, and also the Margin with Penalty and Triple Margin fitness functions, were inspired by the idea of a large margin in Support Vector Machines. All these “margin” fitnesses are multi-objective in the sense that they try to optimize more than one constraint at the same time. For example, they all care for correct classifications and misclassifications, but they give more weight to correct classifications outside the margins (that is, as far away from the rounding threshold as possible) than inside the margins. Ideally this would allow the learning algorithm to find good classifiers with large margins and good generalizability. But of course the ultimate test is always how well the model does indeed generalize in the test set.
Candida
Many thanks for your answer. To be honest, I do not know, how to put the problem I face without exposing my ignorance....
I think maybe the best way is, if I tell straigthforward, that I just do not know what margin in statistics mean. I have also problem guessing, what entropy, purity or rank measure could mean. As I understand, that it is not your task to clarify basic statistics, I would be very glad, if you could propose some literature, as I was unable to find description in Wikipedia for every fitness function.
Many thanks for the help and please forgive my ignorance..
Balazs
Hi Balazs,
Ignorance is not a problem in this case; we do say after all that GeneXproTools is for everyone and no math/statistics/programming is required. And I have no doubts that it’s really true.
But anyway, the purity and entropy fitness functions are very similar in terms of behavior. They both were borrowed from clustering, where they are used to measure the purity of a cluster, that is, they both measure the extent to which a cluster contains records of a single class. In GeneXproTools we are not using these measures for clustering, but the same idea applies: we measure the purity/entropy of the model output (the predicted class) compared to the actual values: if they are the same, their purity is max and the entropy is min and the model is a perfect one with 100% accuracy.
So I guess now you can see how they can be used as fitness functions. We also combined these measures with the MSE (mean squared error) because in Information Theory circles it seems that people dream of finding some algebraic process of combining entropy with MSE. Well, in GeneXproTools that’s easy! And indeed, it seems that by combining the two, better models can be created than just by entropy alone. Which of course should not be surprising, at least in this context, because when we combine the entropy/purity with the MSE, we are exploring what I like to call the model structure dimension, not just the elements of the confusion matrix (TP, TN, FP, FN).
As for the Rank Fitness Function, it’s a very interesting measure that moves in a solution space similar to the ROC measure, which is based on the area under the ROC curve (AUC ROC).
We have plans to provide rigorous definitions of all these terms in the Knowledge Base and also pseudo code for the fitness functions (it’s a big project and it will take some time), but I think that high-level definitions of each fitness function and understanding in which domain they operate (at the level of the confusion matrix only or the richer dimension of the model structure), is sufficient to be able to explore them efficiently.
Candida
It looks like you're new here. If you want to get involved, click one of these buttons!