Over-Optimization

I have three related questions for P123 users.

  1.  What are the warning signs that a system is over-optimized?
    
  2.  Given two roughly comparable systems (i.e. with a good number of common elements), one of which performs much better in a backtest than another, in what circumstances would you choose the lesser-performing system?
    
  3.  Are strategies subject to regression to the mean? In other words, given ten strategies, will the one that performs the strongest be more likely to perform more poorly in the future, and will the one that performs the weakest be more likely to excel?
    

Here are my answers. But I’m eager to hear yours.

To the first question, the warning signs for me are: a) if the returns are very high and the number of stocks held is very low (fewer than fifteen); b) if slippage is inadequately accounted for; c) if you double the number of holdings, the returns are very much lower; d) if it includes rules that can’t be explained in a way that makes sound financial or mathematical sense.

To the second question, it depends on the backtest. If the higher-performing system doesn’t violate any of the above rules, then I think I would always choose it if the backtest were ten years or twelve years long and the outperformance was pretty consistent. If the systems are not roughly comparable, sure, then the lower performing system might have some advantages. But if they’re truly comparable, I would think the better-performing system would always have a slightly higher probability of performing better out-of-sample.

To the third question, I’m of two minds. I know that regression to the mean is statistically equivalent to imperfect correlation, and that if past results are negatively correlated to future results, regression to the mean will be more prevalent than persistence. I also know that if you take a large set of actively managed mutual funds and compare two years-long periods, you’ll probably find reversal to be more prevalent than persistence (i.e. a negative correlation between the periods). On the other hand, I’m also convinced that well-designed strategies will always outperform poorly designed strategies, and that one way to tell how well a strategy is designed is to look at its performance. So I would tend to bet that a strategy that outperformed in one period would continue to do so in a second period.

I’m not really sure about any of this, so I’d be happy to be swayed.

Yuval,

without having read your answers…

Good performance for 6-24 month and then a sudden drop-off OR stellar in-sample returns and OOS returns similar or less than benchmark.

If there is an argument for overfitting in the better performing strategy, then I would rather choose the lesser performing system; the same I would do if the better performing system has much less stocks traded than the lesser-performing system.

Most stocks experience regression to the mean, hence this should also apply to strategies simply because statistically it is not possible to always pick the outperforming stocks. However, what is more relevant for a strategy is that its investment theme is usually only relevant for a couple of years, before “regime change” kicks in. Unfortunately a lot of strategies on p123 are still focused to factors which generated brilliant small cap returns in the last decade. In this regard a backtest strategy also has its limitations as it only shows response to past market trends.

Just my 2 satoshis!

Yuval, I don’t know if I’d call this overoptimization, but a warning sign I try to be mindful of is when an inordinate # of top rated stocks have dramatic blowups / gut-kick moments. To me it’s a sign that the market has figured out the relevant factors for the strategy and the only remaining high rated stocks have already been picked through leaving the chaff. Several years ago after reading Greenblatt’s book I built some Excel based models inspired by the approach (I was unaware of p123 for a long time), and I got the sense that the model he described was already significantly breaking down. The Quality-Value approach described seemed to be selecting for a high number of companies with pending but yet undisclosed bad news. There were just too many numerically cheap, quality companies having big 20-50% losses. Obviously, they weren’t as cheap or as and quality as the numbers were saying.

All,

I was having a discussion with Parker once. He was talking about “mean reversion” and I was talking about “regression-to-the-mean.” It quickly became obvious we were talking about 2 different things.

From Wikipedia:

Doubt that I could explain the difference and keep Nisser awake;-)

Yuval, I believe some of your discussion is most consistent with “mean reversion.” It is a good concept: likely to be used by Finance people like Parker for sure. Just a different definition, I think.

Mean reversion happens sometimes and other times not. Perhaps it is likely with, say, seasonal stocks. The decline in one part of the season will be likely to be reversed in another portion of the cycle.

Regression-to-the-mean is always there and never goes away. As undeniably true as three sides of a triangle will add to 180 degrees. It follows as a theorem from the definition of correlation. Correlation being a mathematical definition will behave as theorems predict. The only time it does not happen is when the correlation is one.

It will always be there affecting the aggregate out-of-sample Designer Model results, for example. You have a better chance of defying gravity.

-Jim

Let me give you a long and winding answer (gee, how out-of-character for me!!!) that may set important context before addressing the specifics.

There is no such thing as a mathematical or statistical series or phenomenon. When we speak of them, we are using verbal shorthand to describe something else. (Actually, that can probably be said for every field to which statistics applies.) For those who are knowledgable about investing (or whatever other field may be applicable), the notion that statistical information is just a shorthand description of something else is so well known, it need not be stated over and over again. And in fact, people who do state it again and again find it hard to participate in sensible conversation and are likely to be shunned as a-holes.

Example: Stock price momentum. It doesn’t exist. It never existed. It never will exist. And because old-time academicians thought investors really believed in it, they came up with all sorts of nonsense to refute it, such as the random walk theory, etc. When legit folks are talking about momentum, what they are really talking about, even if they don’t verbalize it (and they pretty-much never do) is this: On day 1, a stock moves because of reason A. If reason A continues to exist on days 2, 3, 4, etc., the stock will continue to move as it did on day 1. It’s a lot easier to talk in terms of momentum rather than in terms of the sustainability of the underlying reason for the price action. And since the world does tend to change in more of an evolutionary than revolutionary manner, we usually have enough sustainability to create the conditions that empiricists see and refer to as the momentum factor.

This applies all over the place. In technical analysis, we speak of support-and-resistance, overbought-oversold, MACD, RSI, etc., etc., etc. All flow from behavioral phenomenon that are so well accepted that they need not be referred to in discussion (and are best not referred to lest the speaker come to be seen as an overly pedantic jerk).

The same holds true of mean reversion. It’s a verbal shorthand way of addressing a very very well established body of knowledge relating to what known is known to happen when conditions become extreme. There is an inherent tendency to exhaust and reverse. Lao Tzu noticed it and described it in the Tao Teh Ching. Aristotle learned of it and it forms the foundation for Nicomacean Ethics. In Economics 101, they describe it by discussing, for example, how excess profits attract additional suppliers into a market and how this keeps happening until profits recede to the point of unacceptability, at which time suppliers start to exit and things move in the other direction. Sometimes these things play out quickly (technical analysis). Sometimes, they play out generationally (the current Democratic Left representing a corrective stage that started when Regan and the Republicans took over DC back in 1980). Etc., etc., etc.

So starting with your third topic, regression to the mean, yes it exists. Its built into the human condition (actually, we never regress “to” the mean; we tend to regress (or, rather, move) “to and then beyond” the mean in the other direction and back and forth.

As to whether a good performing model will regress to the mean, the answer is tied not to the model’s performance but to the market phenomenon to which the model is tying itself. The more a model tilts toward market extremes, the more confidently we can assume its performance will regress as extremes correct. (This is why I am so bothered by things like MktCap-smaller-is- better and similar items. I know it’s pushing toward the “risk-on” extreme and that as good as it can perform when the market is friendly toward high levels of risk, I also know how ugly it will get when the market climate changes — and I have seen too many p123 users, including most of the 1st generation Designer Model community, ignore warnings about it, and we all know what happened.) The challenge is one of timing. We can never know how long the pendulum will swing. Sometimes, its fast. But we are seeing the business cycle is lengthening; ditto the interest rate cycle, etc. And external factors always intervene to mess things up because the world as an irritating way of always changing. (Considering where interest rates have been for a while now, we should be experiencing rampant inflation; that we aren’t is testament to how much the world (external variables) has changed since traditional notions came into being.

So to choose one model vs another based on an assumption of regression to the mean, one would have to follow and understand the markets; what they are doing and how things develop, Modeling is useless. We’ve already busted all the relevant historically established precedents so we’re all in new territory.

If one isn’t/doesn’t want to become proficient in understanding market dynamics, the most prudent course is to aim a model at central tendencies that are likely to perform badly under some extremes, wonderfully under others, but on the whole, more or less middle of the road. (This, by the way, is why there’s such a Quality bias in the Invest models I built, even though its hurting relative performance now. In that business/advisory model, its not feasible for me to swap models in and out all the time based on what I think of the market, so I push them to a place better likely to tolerate auto-pilot. I aimed to do likewise in Designer Models; my goal is sustainability, not to win a performance contest in any particular month.)

So I have to disagree about well-designed models performing better. Well designed models are those that are more proficient in producing stocks consistent with one’s goal, but no amount of model design can influence what the market wants to favor or not favor in any particular time period.

Moving up to question 2, it’s a similar answer. Always pick the model that best delivers what one wants. One can always want the best performance, but that often leads to anger or disappointment because it often doesn’t match what the real world can deliver. And the longer it takes for Mr. Market to switch gears, the more shocking it is to those who got caught. (This is why a generation ago, when mutual fund families got big, they worked so hard, often in vain, to discourage the public and fund newsletters, from chasing the highest performing funds.) It’s the goal that counts. Here’s a link to a year-or-so-ago article in which I actually wrote-up and chose a model that tested badly (and explained why).
https://seekingalpha.com/article/4185747-screening-shield-reit-yield-hogs-butchers-knife?source=all_articles_title

As to the first question, item “d” alone seals the deal. If that occurs, ditch the model without even bothering to define over-optimized. Items “a,” “b” and “c” are things that should prompt serious thought about over optimizing, although none are silver bullets. Even slippage: For liquid stocks, we need not even bother, except insofar as we use slippage as a proxy for commissions because bid-ask spread differentials vaporize over the course of even weekly hold periods. For illiquid stocks, real-world slippage can be much higher than any p123 member would ever assume in a model. In the real world, any alpha above zero makes the portfolio manager a hero, so in sim, where even the best designers inevitably have some 20-20 hindsight, I’d say any alpha above 3-5% is suspect, the higher the alpha, the bigger the danger. Small numbers of holdings are definitely a danger factor (but we have to be sure to define “small” differently between stocks and ETFs; for the latter, 5 positions can be too many.)

I defer to Marc and others on this as mean reversion is a Finance term.

And Marc is making no claim that he is using any statistical theory to shape his opinion here. I think he would agree with me on that.

-Jim

Hi Marc, I’ve seen you mention this several times. I tend to express this sentiment in terms of lower liquidity (“illiquidity premium”), but the end result is very similar to tilting models towards smaller caps. Based on expectations, I would expect larger companies to be priced more efficiently, and it seems to make sense that individual investors searching for alpha would generally have an advantage focusing in smaller companies where larger amounts of money are relatively disadvantaged. I register what you say about considering and managing risk, but I also feel like it makes sense to focus the search in areas with reduced liquidity.

Regarding small caps and microcaps, I agree with Michael (but you already knew that). My reasons are spelled out in this article https://seekingalpha.com/article/4078881-invest-microcaps and Jim O’Shaughnessy’s are spelled out in this one https://osam.com/pdf/Commentary_TrueMicroCapStrategy_Mar-2016.pdf . . . Small caps are far riskier than large caps. If you’re throwing darts or letting a monkey choose your stocks, by all means invest in large caps. Over the last three years or so small caps have fared terribly compared to large caps, and especially over the last twelve months. These periods of underperformance happen, and they happen a lot, and the pain can be protracted. But that doesn’t invalidate my reasoning, nor O’Shaughnessy’s.

Regarding choosing models to invest in, I think we should do so probabilistically. Marc’s advice/example about an “all-weather” model is a great example. Given the high probability that we will be wrong in prognosticating about the future market environment, it’s best to invest in a model that will weather a variety of different market environments.

But we can take that probabilistic approach and apply it to backtested returns too. If the backtested returns of strategy A are superior to that of strategy B when tested across a variety of market environments (insofar as that is possible) and on a variety of partial universes and with a variety of position sizes, and if the two strategies are equally well-designed, then the probability of strategy A outperforming strategy B is higher than the probability of strategy B outperforming strategy A. That is, unless we believe that strategies are like mutual funds or sectors, in which time periods have negative correlations–i.e. those that perform the worst in one time period will outperform in another time period.

No matter what we believe, we must acknowledge that strategies can be well or poorly designed, and that we want to avoid poorly designed strategies at all costs. We must also acknowledge that backtesting strategies is not the only way to tell how well designed a strategy is. That requires a sound knowledge of finance–and a lot of other considerations besides, including how the strategy was created, whether each of its components is sensible, and whether it takes into account the kinds of problems that automated strategies often fall prey to–excessive churn, for instance, or taking for granted the ability to quickly purchase and dispose of illiquid stocks.

My own tendency is to let backtesting answer too many questions. It’s a failing of mine, and one of the reasons for this post is to try to cure myself of it.

This is fine IF . . . and I really do mean IF with all-caps . . . you understand and really choose to take on the higher risk.

My experience is that things like this are easy to say in conversation, but a lot harder to live with when the you-know-what hits the fan. This, in fact, is probably at the root of the implosion of the original designer models – everybody was excited to chase the returns at the high risk end of the spectrum, but when the market favored risk off and the models imploded and the subscribers ran for the hills.

P123 folks are not alone in having accepting the full reality of their choices. I know seasoned pros who say they want low correlation to the market, but when the portfolio goes down on market up days, they point fingers and start carrying, forgetting that low correlation means low correlation, not perpetual positive returns. Human nature can be brutal and unravel almost anything we do in modeling.

So if you want to go to the high-risk high-reward end of the spectrum, go for it. But make sure you accept that outsized losses are part of the game. That’s an easy thing to have lost sight of given that so many years’ worth of falling interest rates made the market very tolerant, for the most part, of risk, resulting in great returns. And with POTUS breathing down JOP’s neck on rates, the prospect of reversal does not appear immediate. But keep following the markets because you really don’t want to be caught by surprise if the market takes a different stance.

Correct. A fancier phrase I’ve heard that describes what I’m leaning on is “domain knowledge.”

Agree 100%. Glad to have your experience (not to mention the degree).

And considering regression-to-the-mean is a totally different concept, I have to say I am glad you are not part of the cult that claims to believe in statistics while denying anything taught in statistics 101 is true.

-Jim

Here’s a cure:

Create this screen and backtest from 4/1/16 to 1/1/18, test it out of sample from 1/1/18 to 11/1/18, and then assume you invested real money.

Ticker(“aapl”)

Is that a stupid screen? Yes. Of course it is. But we can’t tell from the backtest. We can’t even tell when we follow an in-sample test with an out-of-sample follow-up. The only way we can know the model is dumb is by looking at and seeing its bad.

Nothing changes just because instead of Ticker(“aapl”), we use a ranking system with lots of factors each one of which looks credible on its own. The backtest can only tell us what happened during the sample period(s) we examined and Lady Luck may allow an out-of-sample follow-up to also look good. There’s no substitute for looking at a model and knowing if it makes sense.

I’ve actually implemented a more bar-belled approach. I don’t know if it’s the best approach, but what I’ve allocated. I’m using preferreds and utilities as the expected lower volatility side of things; and mostly p123 models for the remainder for equity. The equities are a mix of large and small caps (I’m constantly evolving models and have holding spread across several - so it’s inaccurate to say I’m following a single particular system), but the underlying learning from working with the “books” tool is it seems like a “Mix” of a specific lower volatility portion of a book with a more aggressive equity portion (including a small cap model designed to have relatively lower vol) tends to produce better sharpes than an all purpose equity model designed to have similar volatility. I’d be curious of your thoughts on this. I don’t know if it’s sound thinking or not, but in testing the barbell approach seemed to have merits (one part designed for lower risk, another part focused on high sharpe/alpha). I admit it is frustrating to hold the lower risk stuff when the market is ripping, but it’s nice during periods of drawdown to see the lower risk stuff bouy the overall portfolio.

Strong—albeit anecdotal—evidence for regression-to-the-mean (the statistical variety).

Marc, your “domain knowledge” (and more) takes you a long way.

-Jim

Yuval, Specifically to your points I don’t have much good input because many of my answers would be “I don’t know”, but I’ll share this:

  • I do try to look into the companies the models are selecting to try to understand the business realities. This includes reading the conference calls which I find valuable. Sometimes I really can’t understand why my models pick some companies and can try to adjust to get more of what I want going forward in model iterations or new designs. Perhaps I’m trying to incorporate some of my own conference call “sentiment” into the the process itself. Different conference calls definitely leave different impressions on me.
  • Currently I note that many of my models have a current tendency to rate highly both staffing service companies (TBI, KELYA, KFRC, and others) and office furniture companies (no matter what I do it’s hard to make stuff like KBAL, MLHR, etc go away - they just don’t seem like very good businesses to me). It makes me realize maybe I have some economic cycle peak concerns I could try to address this going forward, and I suspect these industries would get slammed hard in an economic downturn. I also tend to get lots of companies (more than I would expect) with lower ROA, ROI and ROE. (High ROI or ROE in and of itself is not a good predictor, but I find it odd so many “meh” companies that I wouldn’t want to hold for long periods can be decent trading vehicles; I still haven’t come to an understanding of that one way or another). I suppose this may not be an indication of over-optimization problem (maybe it is though?), but is a recognition of the deviation between the type of company I’d like to see the model selecting and what it actually selects.
  • so I guess to answer question #2 on which system would I prefer - I guess I really would prefer a system that picks stocks that resonate with my idea of investments with more characteristics I’d like to buy absent the model itself. I find myself actively trying to get away from potential dissonance. There’s lots of ways to put factors together to get good models, and I’m finding alignment of model output with internal preferences is apparently an important part of it for me.

The problem with this is that one of the main reasons your models are selecting these companies is because of their share price (unless your model doesn’t include value factors). And the reason their share price is so low may be the same reason that the conference calls are not very impressive. Here’s an experiment you might want to try. Choose thirty companies that your model would have chosen a year ago and read their conference calls shortly before that date. See if those calls will give you any indication as to whether they were good companies to buy or not, and compare that to the actual price change over the next few months.

I get a lot of the same companies, Michael, and I think it’s because a company like KELYA is simply a superb value no matter how you look at it. It’s massively underpriced by every measure. Whether or not it’s a value trap is a different question. One of my long-term projects is to design industry-specific ranking systems that would tell me whether KELYA is indeed a better buy than the rest of the companies in the GICS=2020 industry group. As for ROA, low ROA is a leading indicator of high earnings growth if a company is good at converting assets to earnings. I actually look for companies with low ROA for that reason. They’re more likely to give high earnings growth numbers going forward than high ROA companies are.

What a great response. I like that idea a lot.

Perhaps the best strategy is Loss Avoidance, and otherwise just keep up with the benchmark. That’s why you need a good market timer.

A simple 4-model market timer will do. When 3 of the timers are in stocks then invest in stocks, otherwise go to fixed income. Here are links to 4 timers which are easy to reconstruct in excel. We also report them weekly/monthly at iMarketSignals. Attached is the model RSP-UST showing a 20% annualized return with a max D/D of -20% and only 15 realized trades from 2000 to 2019 (how boring is that?).

Inflation:

Golden Cross:

Coppock Indicator:

Google Trends:
https://imarketsignals.com/2018/timing-market-google-trends/


A good reverse engineering play exercise I’ve found is to take a company I really like or admire and see if I can build a robust ranking system from the ground up that it puts it in a really high percentile bucket. It really helps you get your head around what makes it a great company and how you can find similar ones like it. I recently did this with LULU. I mean, you start with a basic look at the snapshot and see their astronomical ROA and ROI compared to industry. So you just start from there and keep adding factors to the rank and building it out and see if you can keep it in the 99% of your bucket rank as you keep adding more factors. In LULU’s case, I can built a pretty extensive 20-25 factor quality based ranking that keeps it in a 95% rank … you start to look at other companies ranked similarly from that rank, and you notice the market has really been rewarding similar companies in the last few years. Checking the performance of the ranking I would have done very well with 20 holding port of these companies over the last 2-3 years. But then if I start adding value factors into the rank, and it all goes to pot. :slight_smile:

The staffing companies tend to be great cash generators and I generally like them as businesses - with the caveat that they will be the head of the spear during an economic downturn. I worry perhaps that my models may be picking up on these economically sensitive sectors and overvaluing them vs. a market that’s perhaps more appropriately discounting them based on stage of economic cycle. For example: industry leader RHI Robert Half has a history of 50-60%+ declines during recessions, and @40% or so even during minor slumps like 2015-2016.

Normalizing factors over longer periods of time might help smooth these type scenarios, but I generally find longer term data far less useful when modeling (it can hurt more than it can help - more recent data is usually more powerful signal). So I just kindof keep these kind of things in my head, but not sure what to do with it.

I will have to try that. Do you recall if the model worked well over longer periods of time? Did the factors you selected also make “sense” to you, or was it more like “why does the market care about this, but not that?”