Classification for tradeables

balintba

Hello,

I am trying to develop profitable trading strategies for foreign exchange market with your software. I have a problem which occurs very often,

and it seems to be somewhat difficult or at least time consuming to overcome.

I have created an Excel file, each record represents a trading day. I have included in the variables, the open/high/low/close of the actual trading day, and the same figures for 2 earlier days. I have the following derived variables: True Range (High - Low), Average True Range (with periods 5, 10, 20, 50), Moving Averages (with same periods) and also the percentage price change of today, yesterday, the last 5, 10, 20 days.

The classification criteria: If we have bought today the tradeable, would we have reached sooner the Position Open Price + 3*Average True Range(Period: 5) - in this case the classification value is 1, or would we have reached sooner the Position Open Price - 1*Average True Range (Period: 5) - in this case the classification value is 0.

This criteria would mean, that in the first case we could make profit, in the second we would have ended up with loss. The average profit is 3 times higher, than the average loss according to this strategy.

I really hope, that I could describe the problem properly, please forgive me for the bad english...

Now:

GeneXPro generates a lot of models, which share similar characteristics: during a long period of time, almost every record are classified as 1, or 0. (Please see attachment). It is very bad, because if I have traded according to this set of rule, the chance, that I end up losing my money is pretty high. I would like to avoid such models, which classify each record as positive or negative in a long row.

Is there any built in fitness function, which prohibits such classification? I have had the most luck with Hinge Loss, Maximum Likelihood and Triple Margin, but if I let the program run considerable amount of time, even these fitness functions prone to find similar solutions. It would be nice to find a solution, which results are more evenly distributed, even if they are less accurate. (I have approx. 75% negative and only appr. 25% positive in my sample, this means, that a very primitive function, which classifies every record as negative, returns high accuracy).

If there is no such built in fitness function, do some of you have any idea, how I should write one myself?

Does the rounding threshold have any influence on this?

I would appriciate other users experience, as I have heard, I am not the only one trying to use the program for such problems.

Many thanks in advance!

Best regards,

Balazs

Dr. Candida Ferreira

Hi Balazs,

Great question. I’m not sure if I’ll be able to help without taking a look at the data.

There are two things that come to mind that you could try. The first is to try different values for the cost matrix. The second is to try the False Balance series of fitness functions, for they have very different behaviors compared to most others. These fitnesses were specially designed to force a balance between FP and FN, which is very hard to get if you don’t force it somehow. BTW, I designed this class of fitness functions to help in the evolution of good random forests.

Candida

balintba

Thank you very much for the answer, Candida!

I want to say, that it was possible to develop acceptable models, please see in the attachment. This model allows hopefully to trade profitable in the other direction. This model classifies the positive values at the same proportion, as the risk-reward potential of the strategy (10,92% TP - 29,36% FP in the training dataset and 10,45% TP - 26,66% FP in testing) If I am not mistaken, after every 1 $, I could expect 0,1092*3 - 0,2936*1 = 0,034 $ in the training dataset and 0,1045*3 - 0,2666*1 = 0,047 $ in the testing dataset. Although both values are positive, I would not trade this, because the proportions are somewhat to close to each other.

On the other hand, if I have traded the negative signals (sell position instead of buy), I would be abel to realize more profit. (13,39% FN - 49,50% TN in testing dataset and 10,89% FN - 48,83% TN in training dataset) This means 0,4883*1 - 0,1089*3 = 0,1616 $ in the training dataset or 0,4950 - 0,1339*3 = 0,0933 $ in the testing dataset, after each 1 $ invested per trade.

The distribution of positive`s and negative`s are better. It is a pity, that the profit factor deteriorated by appr. 43% in the testing dataset in comparison with the training dataset, but the signals are still valid and tradeable. It is propably the market reality. Although the charts seems noise and full of false classification, I would not say that the model is bad. It is common knowledge that there is no possibility to make excessive returns on the market (without having extra information over other market participants)

This model was developed with Hinge Loss Fitness Function. False Balance returned good models as well, as you proposed, their evaluation takes a little more time.

May I ask you, what do you propose for Cost&Gain Matrix? I used the following values: TP - 3, FP - 1, TN -1, FN - 3, as I would realize these returns upon false/true classification cases.

I can send you of course the data, but I do not want to burden you... If you have however interest in this fields of problems, please let me know.

It would be interesting to know other users experience with such problems. Is it possible to develop better models?

Balazs

Dr. Candida Ferreira

For the cost matrix, and starting from the defaults of GeneXproTools, I would try moving the decimal separator 1 place to the right for the FP and FN. For example, if the defaults are:

TP = 0.355 TN = 0.645
FP = -0.645 FN = -0.355

try changing them to:

TP = 0.355 TN = 0.645
FP = -6.45 FN = -3.55

This will give you a sense of the direction they are moving and then you can adjust it to give you what you need.

Also important is that if you’re using the ROC threshold or the ROC fitness with such a skewed cost matrix, you’ll usually get better results if you use another threshold such as the average threshold.

Another thing that you could try is the Avg2 linking function. I found that it works slightly better than addition with time series data.

And yes, it would be great if you could send me the gep file for me to take a look.

The Cost Matrix entry on the Knowledge Base has some additional information that you might find useful:

http://www.gepsoft.com/genexprotools/FitnessFunctions/CostMatrix.htm

Candida

Howdy, Stranger!

Categories

Tagged