backtests, mean reversion, and factor persistence

If you believe that your ranking system should reflect the most recent developments in the market, then it would make sense to run your backtests covering a relatively recent period, say just the last five years. If you believe that some factors revert to the mean and others stick around, then it would make sense to run your backtests covering the entire seventeen years of data available at P123, or perhaps even exclude the last five years.

I did a little experiment. Using 30 factors that I know work in general, I developed a ranking system optimized for just the period 4/2006 to 4/2011 and another one optimized for the period 1/1999 to 4/2011. They were quite similar in many ways—each contained 16 or 17 factors, and the most heavily weighted one was growth in operating income; they also both used the same universe; but some of the factors were quite different between the two systems. I then tested both systems to see how well they functioned since 5/11. Both performed about half as well they had before that time, but the one optimized for just the prior five-year period outperformed the one optimized for the more-than-twelve-year period by almost 5%.

I was wondering if anyone else has performed similar backtests to ascertain whether it’s better to backtest just a few years in the recent past or whether it’s better to backtest over a longer time. I’ve noticed a huge difference in the efficacy of certain factors over the course of the last seventeen years. And a ranking system optimized for the last five years will look very different from one optimized for the last seventeen. What I worry about is that a factor that worked very well in, say, the 1999 to 2008 period and not very well since will, because of its efficacy in the earlier period, be a major part of a system that is optimized over the long run, and will simply fail to work in the near future since it hasn’t worked in the near past. After all, the market today is in many ways more similar to the market of 2013 than it is to the market of 2001, considering the rise in ETFs (especially leveraged and inverse ones), the number of investors using quantitative approaches rather than relying on analysts and brokers, the ready availability of market data, the ease of making quick and cheap transactions, the increase in volatility, the shorter-term investment horizon, and so on. Perhaps most significantly, I think market behavior has fundamentally changed because of whatever lessons folks have taken away from the dot-com and housing bubbles.

Yuval - how does no optimization compare to your results? i.e. equal weight all 30 or so factors…
Steve

Good question. Much, much worse in every period. That’s in large part because I have a lot more “quality” factors than growth or value factors. The growth and value factors seem to work well in all periods but the quality factors seem to vary greatly over time. In other words, an even balance will render the most effective factors insignificant.

Yuval, you raise some good points. As you also said, I think it depends on the factors you are using.

Personally, I tend to feel more comfortable with factors that make lots of sense and have worked over a very long period of time, like various value factors. I think the risk of whipsaws is too big, if you see a factor that “has worked” recently, and it stops working for a few months, how do you distinguish between a normal drawdown or a strategy that no longer works?

With regards to value factors, I have not found a reliable way to time them, other than looking at the spreads, but even this is not that helpful other than for very long periods of time.

For other factors, such as momentum, I think it makes sense to use them after it is clear a bull market has started, and almost until the end, when you start seeing divergences, weaker market internals, etc. as momentum stocks tend do well until the end. In a bear market, or early stages of a bull market though, I would not use momentum since it does not make much sense to me. I see momentum’s logic as human psychology of crowds, buying what has gone up, in a confident environment.

This is why I like combining various factor models and applying them tactically based on market conditions. For value strategies, I prefer to be invested all of the time though since I have not found a reliable way to time them, hedging the market risk when appropriate (like now).

This – using factors just because they “worked” – is necessarily a dangerous game unless they are supported by sufficient intuition to have allowed you to know they would work even before you run your first test. That said, if one is determined to be work in an empirical manner, you tilt the probabilities in your favor if you stick with recent sample periods and stay nimble (continually test and refine as you go along). This way, at least you reduce the probability of getting burned by data mining. Empirical models must, will and do fail when the market environment changes. If you stay recent, you at least improve the odds that your live environment will resemble your sample environment closely enough to let you get by for a while. (It’s like the difference between driving drunk in an empty marking lot versus a busy Interstate.)

See attached pdf for a more detailed discussion of the nature of investment oriented testing.

When you go back to 1/2/99, you raise the risk that your conclusions will have been infected by environments significantly and structurally different from the present. While I have presented sims going back to 1999 for presentation/publication purposes, I will never green light a model for real-money use if I’m not satisfied with what I’ve seen over the past 5 and fewer year periods.

A series of models I’m working on now underperformed dismally in 1999 and the early 2000s and so be it. The models make sense and post 2003 testing indicates the specification is reasonable. The market back then didn’t make sense, and those who outperformed back then got brutalized shortly afterward and many never recovered.

The MAX test is really best suited for academic types looking to discover universal truths ands who don’t really need to think about forward live performance. On that level, if I were not going to commit funds, then yes, I’d have to account for the early period nature of the market. But if real money is on the line, no way.


Effective Testing on Portfolio123.pdf (110 KB)

Marc,

Just wanted to say I agree. I like different wording. Admittedly, my wording would not be as likely to be published–as you have done.

I would call this a Bayesian Prior with a high probability.

This is just Bayesian updating.

I read a book on backtesting once. I could probably find the book title: a long time ago. Actually some pretty sound math but be forgot use Bayesian Priors. He seriously concluded that the market was related to the phases of the moon. Really good p-values. But what would be the Bayesian Prior on that?

Also, if he continued to update his statistics, as you suggest, he probably would have abandoned his theory. Hopefully, sooner rather than later.

That having been said, I do not know what factors Yuval is using (with one exception). This is no comment on his work.

Marc - this is ENORMOUSLY helpful, particularly with regards to how far back I should test, and disregarding the earlier years. And I love the PDF.

I certainly don’t even TEST factors unless I know they have a good probability of working from things that I’ve read and researched. And then, with a couple of exceptions, I don’t include anything in my ranking factors unless I can see from using the performance ranking ALONE (without other factors) that it works–i.e. that breaking it into deciles results in a generally upward bar graph with a decent standard deviation.

For me, at least, this is really one of the most helpful things you’ve written. Thanks a million.

yuval - if you have the time and the desire, you should try walk forward testing. Start by optimizing the period 2000-2004 and test 2005, then optimize 2001-2005 and test 2006, and so forth up to 2015. After you do this compare the yearly results to no optimization, or optimization over one sample period.

Steve

Hi Yuval,

I have done similar research a few years ago on factor persistence.

What I have found was that:

  1. What worked in the beginning of a bull market (after the first bounce off the bottom) tended to keep working for the rest of that same bull market.
  2. However, the correlation between what worked in one bull market and the next is a little weaker. This is especially true for momentum factors.
  3. The correlation between what works in a bull market and what works in a bear market is very weak. In fact, very few factors “work” at all during bear markets; and even if a factor happens to predict which stock will do better during one bear market it does not necessarily predict the next bear market.

Something else to keep in mind is that the data pre-2004 was a bit different in a few important ways such as short interest, prices (which were sometimes estimated), and some fundamentals were NAs.

Chaim

Jim - back in the '90s there was a three-year methodology showdown including some of the biggest names in technical analysis. They included George Lane (Stochastics), Glenn Neely (Elliott Wave or “NeoWave”), Gary Smith (Tape Reading), Gerald Appel (MACD/Time Trend), and Peter Eliades (Cycles). This competition was for trading the S&P 500 index only.

Note the name of the winner.

Do you know what his methodology was?

Astro Harmonics.

Take care
Steve

If one confines oneself to backtesting just the last five years, the value factors don’t work nearly as well as if one takes into account a longer period. From Marc’s and Steve’s remarks, it seems that for strong performance in the near future, we should concentrate mainly on growth and quality, with only a little value in the mix. However, if we extend our backtesting to ten years, which includes the huge gains made in the early period of the bear market that started in 2/2009, value becomes a major factor. Any thoughts on this?

Yuval,

IMHO it depends. Here is a very simple 4-function value momentum strategy. You probably created it (with different weights) at some point yourself. I think it might be characteristic of what you are talking about. I’m not recommending it.

It did very well from the bottom in 3/2009. Rising 900% up until about 7/14.

Depends upon whether we are coming off of new relative bottom. Could do well from here if it was a bottom–or not. Depends.


Yuval - it could take years for value to recover, or value could start outperforming tomorrow. There is no real way of knowing. That is one reason why I suggested walk forward testing. It will give you some idea how well a periodically re-optimized system could perform.

Steve

Why not concentrating on factors that worked (almost) all the time since 1870?

The Thing is for example value works and will work in the future,
same for Price momentum, earnings momentum and size (small cap
outperforming big caps).

All this has been tested by academia and it outperformes, yes there where
times when momentum tanked, but all in all it worked just fine.

And all those factors makes sense too!

So why just not Combine the stuff that worked and
do not bother about the flow of the market?

That is the way I do it, KISS is important, Keep it simple.

I just do not believe that most of us can figure out the market,
we might be right about the flow, we might be wrong, and I believe
that we are wrong more ofthen then we are right.

Yuval,

Interesting to see that you found about a 50% drop off out of sample. That is about the same performance that I found across 30 different multifactor models. Data summary can be found here:

https://www.portfolio123.com/mvnforum/viewthread_thread,9680#52134

Factors cycle. That is just the nature of the market. Sometimes value works better, sometimes growth works better. My experiment was to see if factors/systems that are working well continue to work well and for how long. Based on my work, I blend 50% 6 months and 50% 4 years. That works best for 4 week timeframes using P123 data, but the predictive power still isn’t great. To put numbers to it, the best system in my study is only in the top half of performance 60% of the time and is only in the top 10% of performance 17% of the time. It also spent 12% of the time as the worst screen of the 30. Using 6 month/4 year performance increases performance from an average of 2% to almost 3% per month. That sounds great until you see that the top 3 systems return 12% per month! Plus, even though you are averaging 3% a month under that scenario, you still lose money 35% of the time. Isn’t the stock market great?

The moral of the story is build a good system based on fundamentals, test it over a decent amount of time, stick to it, and just be ready to go through periods of under performance because its going to happen.

Well, you see, I keep running across or finding factors that improve my results. I hadn’t thought, for example, to try year-to-year free cash flow growth earlier, or capital turnover (sales to equity ratio), or a percentage decrease in shares over time. All those work well. In addition, using the accrual ratio improved my performance a lot by making more “turnaround” stocks (stocks with negative but improving earnings) come into my portfolio. So with all these developments, I feel like I need to set the best parameters for my backtests. Yes, the usual growth and value measures work and will always work in the long run (and I’m constitutionally allergic to momentum). But improvements in quality make a huge difference. And for some reason some of the factors that have worked very well for quality over the last five years (return on capital, for instance) did not work as well earlier. So time-frame parameters are hugely important to me. Anyway, thanks for all the opinions and advice.

This is an area in which quant techniques may be elbowing into a field without truly understanding it.

Value, buying stocks priced favorably relative to what they are worth, always works. It must work. It is the only thing that works. It is the only thing that can possibly ever work. It is a 100% certainty.

That is not the same as a strategy that buys low PE stocks, or low PB, low PFCF, low EV2EBITDA, etc. This is a completely different thing because it looks at ratios without reference to what the stock is actually worth. It might work. It might not work. Whether it works or doesn’t work depends not on whether PE (to stick with one ratio) is low but whether it is low relative to 1/(R-G). Fama French and other quant researchers measure the former. They do not address the latter. That’s why these confusing conversations (value worked when xxx and not under yyy) tend to arise. What the big-sample Fama French findings really say is that there’s an information arbitrage opportunity. Other academicians fleshed it out and demonstrate the market’s tendency to excessively extrapolate; i.e., they presume high G in the past will mean high G in the future and vice versa; ditto with other factors. Low PE strategies work when it turns out the market was unjustifiably pessimistic about future G.

It might be supposed that as PEs fall across the board, low PE stocks should still be better than high PE stocks. But we have to remember that low PE presumes weaker companies (higher risk and/or lower G). The risk component alone may be causing lower PE stocks to move more out of favor. (We can never assume the world impacts one factor and leaves everything else alone.)

So where are we now? Will low G companies really remain low G and/or become even lower-G companies? Or will rates rise due to strong economic activity that causes G to strengthen, leaving now bloodied value investors looking down from the top of the mountain two or three years hence laughing at everybody else?

That may be going too far. But even though Value stinks now and may do so for a while longer, I’m not sure buy into Mr. Market’s assumption that G and Risk are getting worse for low PE companies than for others. So I have not taken funds out of any of my strategies that are being pulled back by Value’s recent and current traumas.

The challenge, though, is how to evaluate tests of Value models. The benchmark is the problem. We have some S&P Value Indexes but if you use those, make a mental adjustment to add the impact of dividends to the benchmarks. What I’m doing instead is creating my open value benchmarks and as Custom Series’. And using them – they have the added virtue of being equally weighted, just like our models, so the size effect is removed. For now, I download the custom series and do my comparisons in Excel. But I’m saving the custom Series because this is likely to be the path to our being able to create on-platform custom benchmarks.

It’s probably unrealistic to expect a value strategy to beat the market now. But I should, at least, expect it to do well versus a proper value benchmark. Beyond that, watch (to guard against seeing things that would suggest that G and Risk are deteriorating more for lower PE stocks) and wait (no legitimate strategy can win all the time; sometimes you just have to suck it up and go with convictions – so long as you do the watching to verify the ongoing reasonableness of the convictions).

And Marc has given us an idea of what it depends upon.

Marc-I cannot say how many times I have though what you just said. I know I am in the habit of saying it is a value company just because it has a low value ratio, too. But one has to determine whether that low value ratio is justified–or not. It could be a really bad company, be overvalued still and be headed to zero. I think I am rephrasing part of what you have said accurately.

BTW, when I read Fama’s papers he seems to always mention the DDM in the introduction. He certainly seems to agree with it even if his implementation of it is not a complete work-through of the equation.

I suggest that this should be amended as follows: change “what they are worth” to “what they will be worth.”

Not to knock value–I believe that one should buy and sell based only on fundamentals, not on market movements–but I look at it slightly differently.

Look at the stock market for a moment without regard to dividends and buybacks and mergers and acquisitions, without regard to a buy-and-hold-for-years strategy. A stock’s price goes up when investors are buying it and goes down when they’re selling. So a stock’s price is based entirely on investor sentiment, and it changes when that sentiment changes. And what will cause that sentiment to change? A lot of things, but one thing is actual company performance that is higher or lower than expected. And that is what I am trying to get at through judicious use of the information in earnings statements that P123 provides. A strategy that doesn’t take into account the price at which a stock is bought in comparison to some measure of value can indeed work in the long run if it’s based on strong fundamentals including measures of growth and quality, and if it sticks to low-volume stocks. How is this possible? Because there are lots of really high-quality under-the-radar companies with strong and stable growth, and the strategy is to buy stock in those companies before mainstream investors do, regardless of their price. Now the performance of this strategy, in my opinion, won’t be nearly as good as one that DOES take some measure of value (price) into account. But it’ll still work in the long run if you’re a small investor like myself (investing millions of dollars in this might cause slippage to overwhelm it). It will outperform the R2000 + dividends, which is a good benchmark for under-the-radar companies. This (including a measure of value) is the strategy I’ve been using since Halloween, with real money, and I’m up 3.5%, compared to the R2000, which is down 4.2%, or the S&P plus dividends, which is down 0.7%. That’s an annualized excess return of 14.2% over the R2000 and 7.2% over the S&P. And that takes into account all my expenses, including my P123 subscription costs.

The question I’m trying to answer is how to weight growth, quality, and value factors in my portfolio, and which of those factors will work best in the near future. And that depends more than anything else on the time period I use for backtesting. The conditions in 2008 and 2009 were extreme. Do I take into account what worked during those years or exclude them? The devil’s in the details.

I heavily agree with this sentiment. Borrowing from Greenblatt, I believe the best factors are ones that don’t work all the time but demonstrably work over an extended time period.

Think about it: If there were factors that worked all the time they would already be exploited and arb’ed away. The only real edge is staying the course with factors for which you have strong intuition/bayesian prior (nod to others in this post), through good times and bad.

Another way to say this is there are larger scale structures (aka “regimes”) in the market that causes certain factors to come in and out of fashion. I’m planning to investigate these a bit more but it’s tough to draw any empirical conclusions as there aren’t enough multi-year samples to conclusively test this intuition empirically.

If anyone has any ideas regarding this, I’m all ears.