Robustness Testing.... tested

Well, with all this talk about robustness testing, I thought it would be a good idea to make this more evidence based. Are any of the proposed “robustness tests” actually any good at picking up a model that is or is not going to work in real time?

I pick on the usual example, TF-12 as that is one I know failed in real time. I run a series of tests that simulate what a user in 2006 would have seen if they had run such robustness tests.

I have created a presentation here with the results

I believe these results indicate the proposed robustness testing are not of value. I admit this confirms my previous instinct, so there may be presenter bias. I encourage debate on the issue.

All of the tests are public and can be found here.

Oliver

That is why the robustness test should be done in the ranking system by varying the node weightings and then running the model for the numerous possible combinations of weights.

Georg

Interesting exercise.

But I looked at the ranking system and right off the bat found factors amounting to 40% of the model that fail the common-sense test:

ROI%Q
Pr2CashFlQ
Pr2SalesQ
Pr2SalesInclDebt –ranked relative to universe
Pr2SalesInclDebt –ranked relative to industry
AssetTurnQ

Most of these items reflect a lack of appreciation of the seasonal fluctuations that are quite normal in the business world and the lumpiness of many businesses apart from seasonality. And Pr2SalesInclDebt is a completely off the wall item and I have no idea why any investor would bother to compute such an item. And that rank relative to industry – Oy!

When designing in hindsight, many bizarre items can look good if the designer is sufficiently persistent and the fact that these items got into a popular ranking system back in 2006 is a case in point. (And by the way, this whole robustness crisis is not by any means limited to markets. Check the cover feature on the Economist a few weeks ago, which described how all through the academic and scientific communities studies that pass juries and peer reviews and get into prestigious journals fail when upon subsequent replication.)

As Marco and Paul probably recall, I muttered and cursed every time items like these (which were embedded in the database by people without finance backgrounds long before p123 was founded) came up for QA during the course of database projects and openly wished we could just kill them (and others like them). But that’s easier said than done considering they were in widespread use and, I now suppose, by some very successful, based on backtesting, models. And based on some quick searches I just did among publicly-visible systems, it looks like some are still quite popular.

As William O’Shaughnessy said in What Works on Wall Street, “if there is no sound theoretical, economic, or intuitive, common sense reason for the relationship, it’s most likely a chance occurrence.”

Oliver,

I noticed that when I run the performance on that ranking system it also looks almost perfect with the first bucket being the lowest and each step being higher: in a near linear fashion no matter how many buckets (before 2006). Also there are not that many factors.

Is it even remotely possible that that can happen just by accident (that it can be that good)?

I have to wonder if the market changed (just a hypothesis). A piece of evidence that could support this hypothesis is that another ranking system of yours (Advanced Pullback w/ATRN) looks great before 2006 and looks useless now. One common denominator that has stopped working for me is long-term momentum factors.

Random < 0.8 can’t protect us against changes in the market (may not be changes in the market). I can see why you like to diversify you systems.

In any cases very interesting!!! Will be interesting hear your comments (and others)as to how that could be.

I think the idea behind Pr2SalesInclDebt is that you are adding in the amount of debt per share. Sometimes companies (like GM) have a low P/S ratio because they are highly leveraged, i.e. most of their capital structure comes from debt financing, rather than equity financing. If equity capital is only 10% of the invested capital, is it fair to assign 100% of the sales to that equity? I think adding debt is rather blunt - it is far from a perfect ratio, but I wouldn’t necessarily write it off.

As for whether to use Q or TTM values well, you might argue that TTM helps normalise seasonal fluctuation. However, this is a ranking, not an absolute, so perhaps if companies are doing better than their peers in their sector/industry in the most recent quarter, maybe they are a better bet?

I am slightly playing devil’s advocate here: - it wasn’t obvious to me at the time that there was anything wrong with the system, it looked quite sensible.

I think the problem is that it is an “engineers” ranking. It has been put together by a lot of testing, first testing of all the factors, then selecting the top, then altering the weights for the best result. It has no particular theme to it, such as “growth at a reasonable price” or “turnaround situations” or etc. Just a collection of factors. If you were to ask in plain english what it does, one would struggle.

I want to follow-up a bit on my last post regarding the Q items.

On the Street, you need to compare 12 moths to 12 months (A, TTM, PTM, etc. in p123 lingo) or Q to year-ago Q (PYQ in p123 lingo) to get a proper picture. (PQ, consecutive quarter, came into vogue back around 2000 as pundits and company executives looked for anything they could latch onto to cheerlead internet stocks higher, but we all know what happened to those.)

That said, in hindsight testing, it’s easy to understand why the Q items would do well. They’re part of the mix when the Street reacts to strong TTM type items, and if the Q was good enough so much the better for the TTM item. So there is going to be some correlation between Q and TTM. But going forward, given the lumpiness of the Q item, models such as that mentioned above will veer all over the place as the stocks selection respond to the piece rather than the whole of what the Street considers. So whether live performance works or doesn’t is more a matter of luck, or perhaps even a stable business climate (which would give non-seasonal firms a better chance of having Qs reasonably represent 12 month tallies).

Marc could you expand on your reasoning about Pr2SalesInclDebt please?

Oliver - I would like to applaud you for providing a sample simulation for analysis. I hope others will do the same.

Now, I have to disagree with the conclusion that the ranking system “is the problem”. You are providing results of a complete system, not a ranking system in isolation. If you look at the ranking system “out of sample” from 2006-present using the ranking system performance page, you will see that it is picture perfect. Same universe, same minimum price, only differences being minimum liquidity and market capitalization rules. You can’t ask for much more than the out of sample performance, can you? I don’t buy the explanations on the table so far and we need to look deeper to come up with an answer. This may involve how ranking system performance is measured to how buy/sell rules are chosen, to position sizing.

BTW - does anyone know whether the ranking system performance uses previous close or next open prices? If it uses previous close then there is a serious issue there, especially with weekly rebalance.

Steve


Exactly,
Move the buy rules to the universe, set 10 position, variable slippage and 0.5cts per share. Oliver your ranking system is still getting alpha.


Robustness testing has to have two components to show that system :

a - Has no curve fitting
b - Works in different cycles.

The Random test (or something similar) takes care of (a). (b) is harder, we need as much good data as possible, and probably turn market timing off

Oliver you presentation shows a ranking system that stops working due to a market cycle.

“The Random test (or something similar) takes care of (a).”

Marco - there is nothing like a demonstration i.e. show me the proof.
Steve

Marco - I would like to explain further. Although I didn’t agree with Oliver’s conclusion that the ranking system was the problem, he was absolutely correct in his assessment that the proposed ruggedness test is simply moving down the ranking system. This will degrade performance of ranking systems that rely on Rank>xx. However, those developers who have overly optimized ranking systems will get a very big pat on the back whether deserved or not. If you feel that there is some benefit to the random() test then you should be able to demonstrate it by taking a publicly available ranking system, perhaps one developed by P123, and creating a sim with the RS that has good results until you execute the random test. Then analyze the results and show that the test flagged an overly optimized set of buy/sell rules.

I would suggest a better ruggedness test is to examine ranking systems to see if they are overly optimized. This can be done by examining the top bucket of 200 buckets to see if it is overly exaggerated compared to the rest of the buckets. If it gives 150% annual return for example and the buckets to the left are significantly lower then you can be sure the RS has been manipulated for this effect.

Steve

Curve-fitting is the process of tweaking a system (data-torture) until the best backtested results are achieved. P123 has tools that, if misused, can help you do that, like the optimizer. For example Rank < 97.6 or PE < 13, etc can be the result of tweaking. These rules target or exclude particular stocks to achieve max. performance. Curve-fitting is much more dangerous with less holdings , like 5 or 10.

To test for curve-fitting by
-introducing a random component
-taking the actual trades and excluding the top 10%
-OddId universe then evenId universe
are all valid methods. Not sure a proof is needed. But let us know if you have a better way.

thanks

I should make it more clear - I am not “blaming” the ranking system as such, but I am saying the abililty for it to pick 5 stocks has been over-estimated.

A long time ago, I argued that portfolios with more positions are more robust

here is the out of sample with 10

and here it is with 20

The 20 position one has actually performed the best.

I knew, and argued a priori that larger portfolios are more robust. The evidence seems to support that conclusion a full five years later.

I would modify my argument since 5 years ago though, to say it is actually not just about the size, but it is about the ratio (number of trades / degrees of freedom). Usually, larger portfolios end up with more trades. The original sim only had 99 trades. Not enough to draw meaningful conclusions. I have witnessed smaller portfolios perform well out of sample, but they tend to be high turnover.

I second Steve’s argument for evidence about whether the “random” test does anything useful. How exactly is it used to determine if a portfolio is robust?

Marco - I can repeat what I said previously. Overly optimized ranking systems will perform best using the random test. Overly optimized ranking systems push stocks that had a negative event down so they don’t show up in the system. All you have to do is optimize to an extreme for top bucket of 200 then put in a buy rule of Rank > 99.5. Done! Incredible system! How is the random function demonstrating ruggedness? It is doing the exact opposite of what you are trying to achieve…rewarding over-optimization of ranking systems.

Perhaps the best solution is to publicize the ranking system performance with the exact universe and buy/sell filters with buckets consistent with the number of positions.

As for demonstration. I understand the negative benefit of what you are proposing. But I have some difficulty believing the positive benefit. Seeing is believing.

Steve

I would also like to emphasize that time is much better served correcting the difficulties with creating a good system such as the buy/sell difference issue and inability to use #Previous in ports/sims. These issues make it more difficult to generate a system that will perform well with this ruggedness test.

Steve

Steve,

I would like #previous in sims also. RatingPos (not RankPos) would be good also.

What are your concerns with buy/sell difference? Does it affect the performance of some of your sims? Does sell rule: RankPos > X not work in your sims? Replacing Rank < 101 with RankPos < 5 (or 10 depending on which sim) actually reduces my returns by about 2%.

Just curious. I know you have a good reason.

One strategy with buy/sell difference is to force a sell so that the system goes back to make sure the stock is still the best one to own. So not only is this process (sell–>buy) going to be accentuated by the fact that stock ranks are lower (and no longer fit the rules in place) but re-buys won’t happen in some instances. This will cause lowered performance and a large increase in brokers fees and slippage. Add onto this the fact that the stats are wrong to begin with.

As for #previous, one cannot make a tight system based on FOrder because you have to account for all of the buy filtering going on. For example, I may want to buy the top ranked stock within an industry, but because of all the other buy rules, the top-ranked stock may have already been excluded. So instead of using FOrder(…) = 1 I have to use something like FOrder(…) < 5. This problem propagates to the sell side as well.

I assume the same thing would occur with RankPos.

Lets be clear. I don’t know if the Random() is a problem with any of my R2G systems. I’m arguing over the principle of throwing this in without understanding the consequences. And I feel that these other issues are more important to clear up for creation of better systems, over and above this “ruggedness test”.

The other problem I have is that there is no definitive pass/fail criteria, something that every “test” has. I’m sure with your background that you know this Jim. Therefore any degradation of results no matter how small, could cause upset subscribers and lead to never ending discussions as to why something is or isn’t a problem. And worse it may lead to loss of subscribers for artificial reasons.

Steve

Steve. I see exactly what you mean. Missed that you meant buy/sell difference with regards to using random for robustness testing. Thanks.

Isn’t testing for over-fit-ness different from testing for robustness?

Being over-fit has to do with having too few data points and too many rules. You test for it by setting aside data for out-of-sample testing. Such as designing and optimizing with even stocks then testing with odd stocks. Or designing with small caps and testing with mid caps. When we get international stocks it will be a tremendous value, if for nothing other than out-of-sample testing.

Robustness has to do with the results being insensitive to variation. But there is a wide variety of variation that you can cover. You can vary: the rank weightings, the buy rules, the sell rules, the universe, the number of positions, etc. Wherever you have a number, you can vary it. If you have a rule like PE < 13, the way to vary it is to try PE < 12, then PE < 14, etc. If you vary factor A by 10% and the results change by 50%, and if you vary factor B by 10% and the results change by 1%, then the system is robust to factor B but fragile to factor A. (This assumes varying your factors by 10% is reasonable, which you have to justify that 10% is better than, say 5%).

You also may want robustness to macro economic conditions. In which case you need more data that covers more different types of environments.

It’s not clear what using Random < 0.8 actually tests for. It pushes you down the ranking system, right? So it’s like randomly removing highly ranked stocks that you would have bought. Does this test for over-fitting or robustness. I’m scratching my head. It doesn’t actually vary the rank weights or the rules. I guess it’s varying the universe. So at best, it varies one dimension, but leaves all other dimensions fixed. As we know, each model is multi-dimensional, and it’s very easy to unintentionally neglict important dimensions.

And the future can always change in unpredictable ways that no amount of statistical rigor can account for. But if a model is over-fit or fragile, then you shouldn’t even consider it, because it’s not realistic. So statistical rigor is necessary, but not sufficent, for a model to work in real time.

Seems like “over-fit” is a clear, unambiguous term. If you have two data points and fit a quadratic equation through it, it is over fit, no confusion there. Robustness is much more murky. Robust to what? Robust to the collapse of capitalism and a returnn of communism? Robust to a nuclear bomb in New York city? Different people will demand different levels of robustness, just like different people have different risk requirements.