System failure: Can we predict in advance if a seemingly 'robust' system will fail?

Tomyani · December 5, 2013, 3:13pm

So,

NOTE: THIS THREAD HAS NOTHING TO DO WITH THE QUESTION OF WHETHER OR NOT P123 SHOULD REQUIRE ROBUSTNESS TESTS ON R2G SYSTEMS. PLEASE DON’T POST ANY COMMENTS ON THAT TOPIC HERE.

This thread is attempting to start, or see if there’s any interest in, a discussion on:
a) What are some inherent factors in system design that might cause seemingly robust systems to ‘fail’ dramatically out of sample. And -
b) What are some new / additional tests beyond the ‘usual suspects’ that we might run that might give insights into system weakness in advance.

It is hoped that 1-2 ‘really good new ideas’ will emerge from this thread. If that happens, I’ll consider it a success.

I am beginning this thread specifically for new ideas for testing / predicting robustness on systems that have already passed a) even/odd testing, b) date shift testing, c) ranking system parameter shift sensitivity studies, d) 5,10 and 20 position size testing and e) ‘random’ in buy rules testing.

I am beginning this thread in light of the initial tests Olikea posted on this system that ‘failed’ out of sample:
https://www.portfolio123.com/app/opener/sim/search?searchIncludeSaved=1&searchUserName=olikea&quickSearch=1&searchName=Robust+Test

In this initial thread from Oliver many people threw out (mostly) untested ideas around why they thought the system failed out of sample.

I ran some additional tests that I am sharing here:
https://docs.google.com/presentation/d/1x6C7j4KdiOeo4s1JQTnObBS_ydGB7CyCpcx90N9T-nM/pub?start=false&loop=false&delayms=3000

I believe that this topic, while an old one, may get to some new ideas around whether or not, and how, we can know if any trading system we’ve built is worth allocating capital to, and if we can come up with new/better ideas around determining how much we might want to allocate to individual subsystems (although I haven’t really taken that on yet).

Best,
Tom

aurelaurel · December 5, 2013, 4:29pm

Tom you’ve spent time on this, it shows. Thanks.

A note though about the universe having inherent alpha. Is that really possible for 3000 stocks to have alpha? I think 3000 stocks would qualify as being a benchmark in itself. In this case a tradable micro/small/lowermid cap benchmark adjusted for dividends.

MisterChang · December 5, 2013, 4:36pm

I read somewhere (“Predictive Analytics” I think it was) where a quant fund manager had a quality check. He made lots of money for a number of years, then, when the quality check dropped below his threshold, he closed the fund. I have no idea what this quality check should consist of for a P123 model.

olikea2 · December 5, 2013, 5:05pm

It is an interesting study and thank you for posting it.

A long time ago I argued that larger portfolios are more robust

However, since that time I have come to a slightly different conclusion. It is not so much the number of holders per say that is important, it is the number of transactions - i.e. completed trade cycles (bought/sold).

The front page of the original simulation shows the number all to clearly:

Realised Winners: 65/99

In other words, the model had a total of 99 trades.

Stock returns have a large random component, e.g. the director leaves, the building burns down, they discover a new drug. You can call this “noise”. The bit we are interested in is that which is not noise but something that gives us some insight into how well certain stock picks might do. We might call this our “signal”.

The thing about noise is that the law of large numbers says that as you include more and more observations (trades) the noise will tend to cancel itself out. The more you flip a coin, the closer the distribution comes to 50/50. Once you can get the noise down, what you are left with is the “signal”, something that actually works, that provides predictive value.

What I have noticed is that high turnover systems tend to do better “out of sample” than low turnover systems. Equally the same for larger portfolio sizes. In some of the high turnover 20 stock systems I have, there are close to 6,000 realised transactions during the simulation, and these are ones that have worked well out of sample.

I have also noticed that larger cap models tend to do better than smaller cap models, out of sample (all other things - turnover and holdings- equal). This might be because they are less volatile, so there is less “noise” in the system.

I would argue the original model is not robust because it was based on 99 trades. How many conclusions can be drawn from such a few number of observations, especially among volatile stocks?

It is an interesting analysis of ranking weights that you did too. I think it actually goes even further than this - a lot of the factors in the ranking overlap with each other. While they may be slightly different flavours, they are likely to pick up similiar stocks. So having more than one is like “double weighting” or “triple weighting” a factor even if you “distribute evenly”. I think this explains why the model still produces alpha (even at a much reduced level) during the in-sample period.

I guess robustness is about maximising (signal-to-noise) ratio.

Perhaps one could simulate a number of random trades on a given universe that matches the turnover of the trading system, work out the standard deviation.Then there is the concept of “Standard Error”, which relates to the standard deviation by the factor 1/sqrt(n) - where n is the number of observations. We could look at the profit per trade and the standard error on the profit per trade, and clearly this must be related to the number of observations (number of trades).

Clearly, the standard error is much higher on a system with 99 trades, than with 6,000 trades.

It isn’t a perfect approach, because backtests inevitably have hindsight bias. Only, the designer knows how much optimisation was required (which is effectively using the benefit of hindsight).

To me the “rule of thumb” approach is that you can have :

-low position count high turnover
-high position count low turnover
-medium position count, medium-low turnover, high market cap

But what you can’t have is:

-low position count, low turnover, low market cap

Thanks,

Oliver

pvdb · December 5, 2013, 5:19pm

In the past weeks I’ve spent some time trying out the ideas described in an article called “The Probability of Backtest Overfitting”: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2326253

The basic idea is pretty simple. You use all the models that you have backtested (*) and you get the equity curves for them. Then, you divide the backtest period into a number of subperiods, let’s say 14 individual years. Then you take a set of 7 years, and pretend this is in-sample. The other 7 years will be out-of-sample. Based on the 7 years in-sample, you select the “best” model, and see how it does in the 7 out of sample years. You also test all other models on those 7 out of sample years, and then you check how well your selected model did compared to all other models by computing its rank among all models. You repeat this for every possible set of 7 in-sample / out-of-sample years, which will give you a lot of samples (“14 choose 7” = 3432 in fact). If the selected model consistently outperforms the other models, then you’re on to something. This test also gives some insight into the performance drop you can expect out of sample, by comparing the in-sample performance with the out-of-sample performance for all of the 3432 samples.

(*): In theory you should use all models you ever backtested. This is impossible of course, especially if you’re trying to evaluate someone else’s model. Nevertheless, I think this approach can be used to gain insight into robustness.

I have not tried this on every model I could find. Instead, I created a lot of variations of the TF12 ranking system by using random weights for all the factors. My goal was not to test TF12 itself, but to evaluate which selection method works best: should we select the “best” model based on AR, DD, Sharpe, etc? Which metric or combination of metrics allows us to separate models with true alpha from the ones that are over-optimized and fail out of sample?

Some preliminary conclusions regarding model selection:

Selecting a model based purely on its annual return, or based on the Sharpe ratio, works best. It outperforms all other methods I tried.
Selecting a model that shows the lowest volatility or lowest drawdown is a very bad idea. (It might be the case that the Sharpe ratio works because of the return part, not the volatility part).
Selecting a model because it has the highest slope in the top half or top decile of the ranking system performance works reasonably well.
Selecting a model because it has the most consistent slope (best approximated by a straight line) in the top half or top decile of the ranking system performance works reasonably well as well.

Another thing I tested was the influence of portfolio sizes on out-of-sample performance. Roughly speaking, my results are consistent with what Tom and Oliver are saying: larger ports are much more robust. Ports of 5 or 10 positions do much worse than 20 position ports. I also found that more than 20 does not add much additional robustness though. (edit: I have not looked at turnover in relation to port size)

In nearly all runs I see that out of sample performance is lower than in sample performance. I don’t have exact numbers from my testing, and the tests take too long to re-run them now, but lower out of sample return is common. I need to spend some more time on investigating this.

I have also run the algorithm on all R2G models. A few models clearly dominate all others according to this test. However, all R2G models are specifically developed on this whole backtest period and some models show great results on every possible subperiod (individual years). However, that does not convince me that those models will necessarily work well out of sample (they might, they might not). After this exercise, I realized that this method works best when you also “develop” the models in all of the 3432 in-sample periods. Of course no human will do that, but you could optimize the weights with some heuristics and then see whether fiddling with the weights hurts or helps. In a sense, my random weights were a very trivial form of this.

The main surprise for me was the fact that in-sample performance (AR, equivalent to looking at the top 1% bucket or so) works better than any of the other methods I tested, even better than looking at the slope of the ranking system.

Btw, if you have suggestions for other selection methods, I’d be happy to try them out and report the results back.

o806 · December 5, 2013, 5:30pm

Hi Tom:

Great work – and very timely as I’m revising the process I’ll be using to develop new ranking systems. Your conclusions have confirmed several steps I was planning to take. One major thing I’m planning to add to what you outline is to increase the “views” of the ranking results from 2 (odd and even universes) to 8 (by looking at 4 sub time periods separately). Ideally, I’d like to see the addition of a factor show improvement in all 8 views. Less than 8 will involve a judgment call. If it only makes 5 or 6 views better, then it is definately suspect in my mind.

Here are the time periods, I’m considering using:

Years 2000-2002 (3 years, side ways for small value caps)
Years 2004-2006 (3 years, gentle bull)
Years 2010-2012 (3 years, gental bull with a small dip in 2001)
Years 2003, 2009, 2013 (3 “rocket” years, a ranking system has to be really bad not to do well in these.)

I don’t plan to look at 1999 because my understanding is Compustat’s earings estimate data is not complete for that year. Also, I’ll not be giving much weight to the 2007-2008 period because those are drawdown years which could be moderated by adding a market timing system as the final stage of development.

Any comments?

Brian

Jrinne · December 5, 2013, 6:01pm

Tom,

Great post! I’m still trying to digest all of the information in order to find something to contribute.

After looking for about 10 minutes, I must say you are probably on to something with regard to the universe. I was initially awed by this wonderful ranking system. After downloading it and running it on my usual more liquid universe, I’m trying remember and recreate what I thought was so wonderful. Still good but not something I haven’t seen before.

Anyway, great work and I will continue to look for something worth contributing.

olikea2 · December 5, 2013, 6:22pm

Brian,

I am not sure the solution to the problem is more testing and more analysis. I think this is part of the problem to begin with.

Even if a super “robustness test” is created, then we may test thousands of models, and one happens to pass, and go along with that.

The thing is I have noticed with a lot of “robustness testing” is that people seem to do tests that make them “feel better” but actually have no value. It is more like a ritual. Odd/Even tests, different time periods. I think its all highly dubious.

Lets say we split up the universe into 8 or some other number. We then take a thousand trading systems, and test them on each universe, and reject those that didn’t perform well in all of them.

How is this any different from just starting off with the “big” universe to begin with?

Similarly, another neat “trick” by traders is to consider an “in sample period”, and leave some “out of sample period”. They then optimise in-sample, and then test on the “out of sample” to verify the system really does work well out of sample, on unseen data.

However, if you pick a million trading systems that choose stocks by throwing darts at a dartboard, a thousand may do great “in sample”, and of those thousand, maybe 10 will also do great in the “out of sample” period too. So those the ones you trade and subsequently get crushed.

The separation of “in sample” and “out of sample” is really only in the imagination. If you reject all trading systems that do not perform well “out of sample” then really, all you have ended up doing is including your “out of sample” period in the “in sample”. There just comes a point where it doesn’t make any difference any more.

I think the solution involves going back to common sense. Going back to financial theory and starting off with a hypothesis about how a particular trading strategy might work - then testing it. Blindly testing thousands or millions of factors and permutations is sure to give you the most astonishing backtested performance you will ever find, but it’s going to fall apart.

The time spent testing “robustness” is better spent thinking about why trading strategies work, and why they don’t work. I had some (hard) lessons fairly early on. Buying low trailing P/E stocks often fails because the “E” is about to fall off a cliff. Projected E might give you a hint as to what is going to happen, but this also is at best a guess, and the analyst may be wildly optimistic or pessimistic.

I think a lot of the reason why TF-12 based sim feel apart is precisely because of the way the ranking was constructed. In fact the clues in the name - TF - Top Factors, each found as a result of a large run of tests. Like a chef trying to cook a great meal, the best ingredients were chosen and mixed together, and the precise weights repeatedly tested over and over until the optimal result was found.

It isn’t a “terrible” ranking by any means, and on larger portfolios it has produced alpha in the out of sample period, and the factors do have some merit. But it massively overstated the potential returns.

Instead, I prefer to think more about the big picture, what are we trying to achieve, what is “value”, what is “momentum”. Rather that simply trying to combine many of the pre-canned factors together, I am more interested in writing custom formulea for valuation. I think that the way ranking works, the fuzzy logic, it can be unreliable and produce unreliable results. Case in point is one of my systems is near the top of the “low risk” rank. When you try to combine too many rank factors together, what you may end up with is a mish-mash of everything, and not terrific in any one particular area, or some great in one area and terrible in another, and so on. Instead, I think it may be a good idea to start to think about going back to good old-fashioned maths, try to come up with a better formula for valuation. I think I had started this way back with 4th Gen, when all of my valuation metrics were custom formulas, and I have noticed that they have even made it into official P123 ranking systems.

Of course, this does not provide a prescription for finding robustness among black box models. But I think the approach “make a hypothesis, then test” is much better than “test, test, test”.

The latter will give you a much better backtest, the former is much more likely to work.

Oliver

MisterChang · December 5, 2013, 7:15pm

I think you are supposed to optimize to your in-sample data, then check the results out-of-sample, just as a final sanity check. You’re not supposed to then optimize for this out-of-sample data set, because then it’s like optimizing for the entire data set to begin with, like you said Oliver.

When you go from even stocks to odd stocks, you’ve effectively doubled your data set, reducing the chances of overfitting. But then if you continue to optimize, including the new data, the chances of overfitting goes up again.

InspectorSector · December 5, 2013, 7:23pm

MisterChang - the point is that if the out of sample results are not good then you chuck the system and try again. Thus the out of sample is inadvertently optimized.
Steve

Jrinne · December 5, 2013, 7:45pm

Mr. Chang,

Does optimization hurt your future returns? I fully understand that optimiztion will make it so that your future returns are not as good as your backtested results. Furthermore, adding new factors is dangerous. It has been my belief that optimizing factors that have already been tested out-of-sample probably will not hurt-it could hurt but it also could help some (three is change and probability with all of this). In any case, I wonder if anyone is aware of any study (or reason to think) that optimization actually harms results on average (as compared to harming your ability to predict results). This would be optimization of factors that have already been included in your system.

dwpeters · December 5, 2013, 8:07pm

“System failure: Can we predict in advance if a seemingly ‘robust’ system will fail?”
Quite simply, I don’t think so. Back when this system was created who could have predicted the 2008 bear market, or the 2010 flash crash, or even the 2011 drop? Who knows what lies around the corner? You could spend a lot of time ruling out possible causes for failure and the future could then throw you something completely unexpected. Rather than spending so much time on one strategy, I spend time trying to develop strategies that can profit in any market environment: bull market, bear market, crash. A crash is the toughest but I’m trying out one strategy that has a good chance of profiting on a regular basis as well as profiting big during a crash, and I’m investigating another. There are a number of ways to handle the type of breakdown in the example system: trade a number of different strategies (diversification truly is the holy grail), compare real time equity returns of a number of strategies and only trade the best - I do this in an informal manner. When I feel a strategy is underperforming my expectations I take it out of rotation and bring in another strategy, or create a new one one, or if I don’t have something better I leave the money sit until I do. I did do a formal test rotating among a diverse set of strategies, it did comparable to a strategy that was developed based on the top factors over the long term. If it was easier to test and implement I probably would have traded this way. It’s very much like comparing a buy and hold portfolio of funds vs. a fund rotation strategy - a good rotation strategy will generally have higher risk adjusted returns.

mgerstein · December 5, 2013, 9:24pm

I’d say two factors:

One problem is to assume a the context, environment, backdrop or whatever you want to call it is constant. That may be fine in some disciplines. When working with the markets, it is a major error and a powerful system killer. I gave an example in another thread of spectacular five-year backtest results i got for a fixed income CEF model. But I know with 100% certaintyy, or at least as near 100% as anything in markets can be, that it’s bogus and will implode out of sample since we’ll be switching from a falling interest-rate environment to one in which rates are stable (best case) or rising.

The other problem is designer expectations. Any alpha above zero is excellent, spectacular. It’s great to do better, but trying to push it is asking for trouble. All credible strategies run into cold spells now and then. Trying to design a system so that it wallops the market every year, to me, raises nothing but red flags.

A system that does OK a decent amount of the time is much more likely to reflect bona fide merit and do well, out of sample, then one that looks too good to be true.

I’m not sufficiently expert in technical analysis to address that, but on the fundamental side, most of the “usual suspects” achieve that status because over time, they really do tend to work well. But they must be applied sensibly. For example, Olikea2 pointed out that PE could be a problem if E is going into the tank. Right! That’s why PE should be accompanied by other factors that suggest the company more likely than not isn’t gong into the tank. Also, we need to be sensitive about the frequent and often dreadful distortions that find their way into TTM EPS, which, of course, throws PE TTM out of whack. So accompany TTM PE with PECurrY, PENextY, P/Sales (a far more useful metric than many realize), etc. Watch out for growth rank factors – chances are the top few percentiles are filled with companies with whacky and clearly unsustainable growth numbers. Think like this about everything you use.

If you want a recommendation for a process, I propose this (it’s essentially the process I use):

Decide why each factor/rule you use “should” work and do not use any numerical measurements, test results, etc. to support this. If you can’t explain in theory why an item should work, don’t use it.
Evaluate backtest/sim results on a three-part scale (“Great,” “OK,” or “Oops”) and don’t take more than thirty seconds to do it, bearing in mind that that any alpha above zero is spectacular.
Repeat using various time horizons selected on the basis of different market environments and the likely relevance to what you expect in the market going forward. This is a very important step. (Aside from that, do not do any robustness testing.)
Run a list of stocks that make your model as of today.
Go to the p123 data panels and/or whatever other source you like and evaluate each company/stock. Ask yourself: Is this stock consistent with the sort I’d hoped to find? Take negative answers as cues as to how and where your models need to be refined. (And going back to the usual suspects, this process will show guide you in doing what it takes to work with these.) This is a critical step on which I spend a lot of time.
If your test results show exceptionally bad episodes you can’t easily explain, get a historical list of stocks that passed as of that time, and look at the companies/stocks that especially damaged performance. This variation of Step 5 can be very revealing. BUT, but, but, but . . . do not do this for the max drawdown periods when everything went into the tank. There’s a lot more seeming-randomness to relative performance during crises and obsessing on those periods can tangle you up and make a mess that will hurt more than it helps over the long term. (If you expect another max drawdown, forget your regular model: go cash or into a short etf or a short model.)
If/when you do encounter episodes of bad out-of-sample performance (it happens to EVERYBODY), review to determine whether this is simply a storm that needs to be weathered (as would be the case if the theory of you model remains sound and the market environment, culture, etc. remains consistent with what you assumed) or a structural development that requires revision of your strategy.

This may not produce eye-popping backtest/sim results. But it will likely produce for you more satisfying out-of-sample results.

MisterChang · December 5, 2013, 9:39pm

Jim, I think optimization done right is good, and optimization done wrong is overfitting and bad. If you guys are talking about the optimizer in p123, then I don’t know because I’ve never used it. Maybe it’s just my terminology, but any changing of your model in an effort to make it better can be considered optimization. The unoptimized model is just your equal-weighted universe or index fund. I think the best models look good in backtest AND have justification for why it works. To this point, I wonder if there is dimishing returns here, where all the best strategies have already been found, and we are just making small tweaks here and there to squeeze out a little more alpha.

Don, how do you decide when to drop a model? Once I decide to trade a model, I follow all the stock picks 100% mechanically, because if I add my personal judgement it will probably reduce returns. So then I also need hard rules to tell me when to stop trading a model. It’s very tricky because any good model can underperform from time to time so it’s hard to tell if it’s temporary or it permanently broke. Even Warren Buffett had several major drawdowns in his career.

Here’s a tentative set of rules:
Stop trading the model if:
-It underperforms the benchmark 3 years in a row
-Its drawdown is 10% worse than the benchmark
-It underperforms the benchmark by more than 20% in any time frame
-If the trailing bucket rank trend deterioates

I just made these up, I guess they are dependent on the strategy being traded. I’d encourage R2G designers to think about what types of conditions would qualify as “this model is no longer works” and educate subscribers.

Jrinne · December 5, 2013, 9:59pm

I think rotating strategies is going to work/not work, at least in part, depending on whether you are using trend-following or whether your strategy uses regression to the mean.

Don. Do you think your strategies are trend following?

Thanks.

dwpeters · December 5, 2013, 11:29pm

Like I said, I don’t have a formal process. If I am not comfortable with the strategy’s performance, or volatility, then I will review the live results vs. the backtest and make a decision. It’s not something I do often. I agree that my personal judgement is likely to hurt performance. I only over rule trades in my daily short strategy - the individual trades are exceptionally risky and I’ve been doing it long enough to recognize a couple chart patterns that are dangerous from a short perspective. Almost certainly I miss more winning trades than losing, but the losses on a bad short trade can be extreme. In any case, I feel it is better to miss an opportunity to make money than to rush into an opportunity to lose it. Although I would like his long term returns, the type of drawdowns that Warren Buffet has experienced are not acceptable to me. I firmly believe such drawdowns can be mitigated by trading low correlated trading strategies without sacrificing returns - the book feature supports this.

Jim,
In my formal test there were periods where different strategies outperformed, so I think using a simple trend following strategy to switch to the outperforming strategy has merit. Like any trend following approach, you will be late at the changes during which time the approach will underperform, and just like price changes, market regime changes can be sudden.

InspectorSector · December 6, 2013, 1:08am

I would like to make a couple of observations on this thread. First is that I stopped the sim at the end of 2004 and 2005 and took screenshots of the monthly performance. I have circled the high volatility months relative to the benchmark (which I changed to S&P 600). As you can see, there are many huge months but also some huge losses. I’m not sure about others but I would not be able to trade such a system as I have difficulty stomaching a 10% loss in one month despite the offsetting good months. For me this is a warning sign because 2004 and 2005 are “in-sample” years. A little degradation and the losing months will catch up with the big months. When this happens, the nature of compounding volatility is such that the system will lose money. i.e. a 10% loss followed by a 10% gain does not bring you back to 100%. This phenomenon means that volatility always works against you and this is one reason why more stocks are better than less.

I also ran one year sims starting in 1999 all the way up to 2012. As you can see the Beta from 1999-2003 is about 0.5 more or less. Then from 2004 onward, the beta is consistently about ~1.00. I’m hoping a stats major will chyme in to explain the sudden jump in Beta and the consistency of it. (Jim?)

Steve

aurelaurel · December 6, 2013, 8:41am

Also 5 stocks is too few to accurately assess the system as a failure. It could be just bad luck.
Same system with 10 or 20 holdings creates alpha. TF12 is not dead yet.

I haven’t been here for very long but my understanding was that 5 stocks models were specially made for R2G. Because for models to be subscribed to they need great performance and so designers need to push the envelope. However for a personal model it doesn’t make sense to use just 5 stocks. It’s inherently so risky !

olikea2 · December 6, 2013, 9:42am

I want to explain a little more about why I think a lot of approaches to robustness are flawed.

Let’s imagine a trading system, that picks stocks based on a pseudo-random number generator.

A “pseudo” random generator is something that computers use to generate a sequence of numbers from a given starting number called a seed… They are not truely random, because it is a sequence and if you start at the same point it will produce the same set of numbers, so it is actually deterministic. However, for all intents and purposes the numbers that it spits out look random.

This is important because it means such a “backtest” is repeatable and gives the same result each time, provided the seed is a fixed constant.

This is clearly a useless trading system, we can be pretty sure of that ex-ante, because a system that selects stocks randomly is no more or no less likely to select stocks that outperform the market. It is possible to “get lucky” but then the point about trading systems is to rely on luck as little as possible - there needs to be some overall strategy.

So given how this is set up, let’s go through a thought experiment of how this might be developed. And let’s imagine that you the developer do not know that this is a pseudo-random number generator. This may be the result of latest “machine learning” or some other sophisticated formula.

After building a super computer to conduct millions of backtests, you trial a million different “trading systems”. The average performance is going to be about the same as the benchmark (less transaction costs). However, by definition, 1% of the trading systems will produce results that are in the top 1% of everything tested. Of your original million, 10,000 trading systems are in the top 1% of performance. If you analyse the trading system individually, you will conclude that it really does have some “benefit”. In fact, the probability that the results are from pure chance alone are less than 1%. All of your statistical analysis will “prove” that. Great, your really onto something.

As a contentious trader, however, you have held back an some time period for some “out of sample” testing. So we take the 10,000 top 1% and test the in our “out of sample” period. Because they are really just random, we know the average result is going to be the same as the original test - 10,000 systems with an average result likely no better than the benchmark. However, even out of those some will have done very well. If you take the top 1%, the top 100 trading systems in the out-of-sample period, their performance is likely to match the performance of the “in sample” period - in both cases, in the top 1%.

So you take these 100 trading systems, think great, they have performed in the top 1%, very unlikely by chance alone, and they have worked great in the “in sample” and “out of sample” periods. They are clearly robust and even if one or two blow up, if we diversify the money among all 100 then we should have a great result.

What happens?

Well, your result is likely to be no better than the benchmark.

Ironically, you could present such as system to your clients, and they could analyse it using the techniques mentioned and come to the conclusion that there is only a 0.01% chance the result is from luck alone. With a high degree of confidence, they will say your system really “has something” . Their analysis is correct, there is only a 0.01% chance the systems have achieved so well from luck alone. However, they really have achieved their result from luck alone. They didn’t know that what they were looking at had already been pre-selected from a million systems.

One issue is that the distinction between “out of sample” and “in sample” is something of an illusion. (It’s why I don’t like Even/Odd tests). If you use the “out of sample” period to filter out all of the trading systems that didn’t work, then really, the “out of sample” has become part of the sample.

Most people would see that it is clearly ridiculous to construct a trading system based off a pseudo-random number generator, and might think my example is rather contrived. Possibly. But if you construct a highly “sophisticated” trading system that is so advanced that even you don’t quite know what it does or why it works - how can you be sure it isn’t just a fancy pseudo-random number generator?

I think if you want “robustness” it has to involve going back to basics, establishing what is really going on, based on economic and financial principles.

I do not think the solution is “more testing”. Splitting the universe into more components, conducting more “out of sample”, more “variations” and so on may well make you “feel better”, but I hope as I have demonstrated, is useless in determining robustness.

Jrinne · December 6, 2013, 10:44am

Oliver,

It is clearly true that if you are running a lot of trials you will find good looking systems by chance. Kurtis made this point when he said that he had used some automated systems that got Sharpe Ratios greater than 5. He said the systems were useless out-of-sample. I’m guessing the computer was able to run millions of iterations.

I believe you like larger number of trades and holdings as well as fewer factors. I think you also like out of sample results. I don’t know what you think about “proven” factors that are published and tested over long periods and foreign markets.

I’m only just beginning to get any out-of-sample results. I’m forced to either use my own backtesting or trust a black-box (with backtesting results) at this point. Despite the limitations, what can be looked at in the backtesting?