2 Bad Years: Proven Statistically

If you have had a tough time for the last 2 years it is not just you.

The Designer Models are inferior to their benchmarks to a highly significant degree. Enough so that one could get this published in a peer-reviewed journal showing the folly of retail investors.

Obtaining a single-sample t-test on all Designer Models with 2-year excess returns shows that the models have (as a whole) underperformed their benchmarks:

t-score: -9.06 WOW!!! That is close to the odds of getting attacked by a polar bear and a regular bear in the same day. And it feels like that.

p-value < .00001. 1/100,000 chance that this is just bad luck.

t-score =( n^0.5) * Ο/σ

Îź = -16.34 mean of excess returns

σ = 22.31 standard deviation of excess returns

n = 153 number of models

Edge over institutions? No. Not for the last 2 years. Were the models ever really ahead?

Maybe we could catch up using bootstrapping which Yuval (and the whole rest of the world) thinks works. But not just bootstrapping. It will take the community using all of the modern tools available (e.g., through Python)—and the combined work of the community providing modern models to each other, I think.

I believe there is a market for that. And new people would bring new techniques and ideas to P123.

Yuval, as good as he is, cannot do everything for us. We cannot all pile into his micro-cap model.

We cannot continue to take as long with the other modern tools as we have with bootstrapping (which we are not even using for the most part).

Maybe I am wrong. Maybe we can wait and see if the t-score gets worse—and hope it gets better. Maybe you are special and none of this applies to you.

Maybe you can show me a cherry-picked model—which will, of course, prove that we can ignore the numbers. Prove that we have an edge over the institutions after all.

-Jim, a.k.a “The Grocer” because of all the bagging and stacking I do.

out of curiosity what were you using for your benchmarks?

Each designer has chosen their own benchmark. Presumably the benchmark is, for the most part, a good benchmark or one that puts the model in a favorable light. I used excess returns: the return in excess of the benchmark they chose.

Thanks for the good question!!

-Jim

The question in my mind is to what extent this is driven by the performance of two factors - SMB and HML. If you look at the cumulative performance for each of the factors in the published data from AQR since 01/02/2018 these two factors are just abysmal, and I expect a large percentage of the designer models in the ecosystem load heavily on exactly those two factors.


Why does the financial investment industry use different benchmarks?
My thoughts: 1. Portfolio theory likes the idea of using different investment vehicles to give higher risk adjusted returns for a whole portfolio. OK, that makes sense. 2. Cherry picking benchmarks makes it easier for the industry to show added value over just buying a low cost index fund.

I have always thought that the only benchmark that really counts today is using SPY/VOO or some other SP500 index fund.

Jim, if you ran your analysis using SPY as the benchmark for ALL of the DMs, I wonder what kind of Information ratio you would come up with for each?

https://corporatefinanceinstitute.com/resources/knowledge/finance/information-ratio/

SMB (small minus big)–yes, absolutely. HML (high minus low)–no. That’s a measurement of book-value-to-market-cap and I doubt many designer models relied heavily on that factor.

I think the dire performance is a result of optimization, which I am as guilty of as anyone–and guilty of advocating it too! Not every designer model was built on optimizing–certainly Marc Gerstein doesn’t optimize–but most of the rest of us were doing it with great relish. Marc has told us over and over again that optimizing is a fool’s errand–if only we had listened to him.

I recently did an experiment. I took the AAII screening models and created ranking systems out of them. I then backtested the ranking systems over a ten-year period. Then I tested the out-of-sample performance of them over the subsequent three years. I then combined the best-performing ranking systems. I also optimized the best-performing system and also optimized the combination of the best systems. Here were my out-of-sample results:

Best AAII ranking: 9.12%
Second-best: -3.18%
Third-best: 13.52%
Average of top three: 6.49%
Average of top six: 7.33%
Combination of top three: 11.00%
Combination of top six: 12.25%
Best optimized: 10.78%
Combination of top six optimized: 9.94%

By average, I simply mean the average out-of-sample result; by combination, I mean averaging the weights and including all the factors of the various ranking systems in one new ranking system.

Now I admit this is only one sample period. It would take many many hours to repeat the experiment over other sample periods. But the results indicate to me that combining top-performing ranking systems will give me better out-of-sample results than optimizing a ranking system. The decisive thing was when the combination of the top ranking systems was optimized, it gave me worse out-of-sample results than using the raw combination. The absolutely best in-sample result was “Best optimized,” and the out-of-sample result was worse than either of the combinations.

This goes well with an article I’m writing about correlation and probability. It turns out that if in-sample and out-of-sample performance has a very low but above-zero correlation, your bets should be placed not on the best-performing model but on the second or third or fourth best or on some combination thereof. This came as a complete surprise to me, but makes a lot of sense. (And there’s always the possibility that there’s no correlation at all, or a negative one. I don’t believe that myself, but it’s possible.)

Now all this doesn’t mean that all my optimizing was for naught. I still ended up with a 32% CAGR in real money over the last four years. I’ve lagged the S&P 500 severely this year, but so has practically everyone, and at least I haven’t lost money (I don’t count a drawdown as losing money since it followed a “drawup”). Optimizing can enable you to discover factors that you never suspected might work, and test combinations of factors, and so on. But you have to be careful with it. And for all I know, bootstrap aggregating may just be another way to optimize except a lot more time-consuming. You’re still trying to find out what worked best in a past period. And maybe that whole line of thinking is somewhat poisonous. I don’t know. I’m not going to pretend to have all the answers, or even very many of them.

Anyway, this is just my stab at understanding the low performance of the designer models. It may not be correct at all, and it certainly doesn’t represent the opinions of the rest of the P123 staff.

Is the above an example of what you are calling bootstrap aggregating (bagging)? Or are you doing something different that you call bagging?

Either way good stuff but perhaps definitions matter. Perhaps this explains why I cannot do bootstrap aggregating, by my understanding of the definition, with a spreadsheet

It is clearly model averaging (aggregating). Is there randomization in there somewhere? Anyway, I do not see anything that could be called bootstrapping.

IMHO, we will not be able to use much bootstrap aggregating at P123 unless we can use Python within the P123 platform. P123 should not be asked to reinvent the wheel and program that for us if Python could be used. I am not sure I would like what P123 ends up doing anyway. It is fine that we may never have this. Being able to use bootstrap aggregating is just a request if it is even possible. True bootstrap aggregating generally uses a lot of data and downloads are the main limitation. Not impossible now, just limited and more than a spreadsheet is required when I do it.

I am going to take a little bit of a leap here and guess that when Patrick O’Shaughnessy talks about bootstrap aggregation he is doing what is defined in a standard textbook. And that is what I would like to be able to do more of.

-Jim

Yuval… agreed that Price to Book didn’t have an explicit heavy weight in the designer models, but I would wager quite a bit that if you ran a regression model on the returns for the designer models they would still load quite heavily on HML. Despite Price to Book’s performance decaying relative to other measures of value, I expect the performance of each of the value factors is still highly correlated month-to-month.

For reference here is a regression of the performance of the new Core Value Ranking System applied to the SP500 Universe against the AQR Factors. As you can see it loads heavily on HML with a t-stat of ~8.9.


[quote]
Obtaining a single-sample t-test on all Designer Models with 2-year excess returns shows that the models have (as a whole) underperformed their benchmarks:

t-score: -9.06 WOW!!!
[/quote]Yes. WOW!!!

Does a t-score of -9.06 prove that the models are picking random stocks (the null hypothesis)? I don’t think so. But what does it mean?

[quote]

Proves we are different than the benchmark when looking at the last 2 years. The null would be that the models have the same performance as their benchmarks which we can reject, I think. Does anyone think the null is true—that we are doing as well as the benchmarks for the last 2 years? With or without statistics?

-Jim

Jrinne,

please, encourage us be positive.
statistics are for novice investor.

we are focusing on 1% to 2% of the stock universe which intended to perform good. the news and statistic for general market.

AAPL returns 50+% since Jan 2019
TSN and SBUX are turn 50%+ from Jan 2019.
AMAT value stock doubled from Dec 2018 melt.
KL, TEAM and SHOP stocks are doubled in last 12 months.

Eventually, our model should pick this kind of winners. i just ignore the stats and market news.

Thanks
Kumar

I am positive about your results.

Someone investing equally in your Designer Models for the last 2 years would be up 1% over the benchmark (excluding any fees).

My apology if that is as positive as I can get. I will look at some of the other Designer’s results and maybe I can highlight their results. Your results are the only ones I felt comfortable discussing up until now.

-Jim

Guys… I don’t speak for professionals here…but as individual investors, the 2 edges we have over billion-rich quant funds are:
1/ accepting more volatility
2/ fixing multi-year constraints and objectives.
We must accept the good and bad sides of our 2 edges.
We know why P123 models lag: most have a principal component in quantitative value combinations, investing styles are cyclical, and value has not been the right place last 2 years. Period. The previous years were not so bad, so let’s enjoy life lagging the market and take some vacations. Half joking.
There are a few interesting posts on Alphaarchitect’s blog by very smart people about that. Their quant ETFs also lag seriously by the way. When a bunch of people smarter than me don’t have a solution, it probably means there is no problem.
I am also going toward ML, but the main reason is I enjoy the intellectual game. I have a feeling that bagging is not the way to start with, but it may be part of it. Ideas often come on the way without looking for them.

Small minus big may help explain why some of us are having pain.

It does not explain why we are underperforming the BENCHMARKS so badly, I think.

Most designers in small-caps will use small-cap BENCHMARKS.

The large-cap models seem to be underperforming the benchmarks too. Whatever the story, there is more to it.

Daniel (dnevin123) may be on to something with the correlation of our models to HML. I believe Frederic is saying something consistent with this. Both may be right.

But one thing is certain: THE DESIGNER MODELS HAVE NEVER BEEN EVIDENCE FOR US HAVING AN EDGE.

Also certain: The P123 system is a wonderful thing. I had never seen or dreamed of such a thing before I came here. I was (still am) in awe. I remain very much the student. It did make money for me (luck a factor?). But incremental improvements on this wonderful system–with spreadsheets–will not help much going forward.

The fact that the Product Manager at P123 is recommending bootstrapping with a spreadsheet as a possible way of keeping an edge over the institutions is a concern of mine. An edge that is debatable at this point.

It is true that we used to build jets with slide rules so I am not saying older technologies do not deserve credit for getting us here. My father used to navigate commercial airliners with a slide rule and he hated computers. But even he liked (and adopted) inertial navigation. He liked to tell a story about his navigator having problems determining their exact location while (somewhere) close to the the Aleutian Islands. After moving in the wrong direction for some time the slide rule (fuel consumption calculations) said he wouldn’t be making it home that night. Yea, he ended up liking some of the newer technologies. But the question now is are we (at P123) going to make it.

There may be a reason that Boeing survives while “Slide Rule Aviation, Inc” went bankrupt. Am I wrong?

It goes without saying that I am hoping for a turn-around in the Designer Models like everyone else.

-Jim

Here are two interesting papers that came to my attention in the past few days.

On Value:
https://www.aqr.com/Insights/Perspectives/Its-Time-for-a-Venial-Value-Timing-Sin

On Quality:
https://www.researchaffiliates.com/documents/717-What%20Is%20Quality.pdf

These are by very well respected quants with impressive pedigrees who have big followings. But . . .

On reading both, it strikes me as obvious that:

No author has ever looked at a 10-Q or 10-K
No author has ever had a conversation with management
No author has ever listened to a conference call or read a conference call transcript
No author has ever worked up a said of projections for a company
(If any of the above statements are not strictly accurate, then I suggest, in the alternative, that such authors feel compelled, for professional reasons, to suppress any such knowledge or understanding they may have gleaned from such activities and to avoid allowing any such knowledge infect their professional efforts.)
All authors are exceptional when it comes to math and statistics
All authors are extremely conversant with Fama-French and their successors

Can I make a suggestion to the quants here: Identify a successful company that interests you, preferably something that interests you. Then, go to the company’s IR (Investor Relations site) and download (1) the most recent 10-K, (2) the most recent 10-K, (3) the most recent Investor Presentation. Then, go to Seeking Alpha and get (1) the latest conference call transcript and (2) some recent articles.

Study the material. Think about it. Set up your own earnings model. (The company will likely have given guidance as to upcoming sales and EPS – you fill in the blanks and project everything else to see how those #s are derived, and to judge whether they are believable.) Then, decide how you should value the stock. And then BUY it or SHORT it (or buy puts). It’s important that you take a real-money position; that will motivate you to keep your expectations re: projections and valuations in line with practical reality, as opposed to textbook stupidity (such as detailed 10 year discounted cash fluff . . . I mean flow . . . projections).

By way of comparison, do the same with UBER and BYND.

However good a quant you are, if you can’t bring “domain knowledge” into your work, your models are DOA and must be regarded as such even if, as has happened to many over the years, luck bails you out. (And for a long time the Fed has been pumping out the luck.)

Bringing domain knowledge into your work will (1) help you create more sensible models and (2) help you develop more sensible expectations (Hint: running a perpetual footrace against a benchmark is not it.)

“DOA.” That’s not good. I do not have a lot of “domain knowledge.” Call me: The Walking Dead.

With regard to the designer model, if you invested equal amounts in a designer’s models (pick the designer) how would you be doing now? Or maybe the designers can tell us ahead-of-time which models will do well.

And, as Marc suggests, you can try becoming a discretionary trader. Just make sure you share your out-of-sample results with us if you feel so inclined.

We now have two P123 staff members questioning whether something else is needed to supplement the present P123 “FinTech” methods. Each taking an entirely different approach: Bagging vs. Discretionary methods.

I am in strong agreement with P123’s position: I have advocated additional methods myself.

Thank you, Marc.

-Jim

1 Like

Has anyone considered the simple hypothesis that using ranking systems (and the potential problem of over-optimizing) might–I emphasize might–be inferior to screening (where there might be less of a problem of over-fitting data)? A while back, Marc look at AAII investors screening records versus the S&P (a much harder hurdle than the average DM) and if I remember correctly, the performances lagged over the last couple of years, but not the way our DM 's have. In my opinion, this screening versus ranking is an important topic that I don’t think is even discussed here.

Another reason I bring this topic up is that I would like a feature that imports my screens into a portfolio without having to change it into a ranking system (which deteriorates its performance) and doesn’t seem to allow for a wide variance in the number of stocks in the output. Someone had suggested this feature awhile back, but I didn’t see a response from Portfolio123. Should I put in another request at this point?

As a result, I dropped my subscription down to screening only. In addition, I can’t implement a DM. Portfolio123 is losing revenue (at least from me!) by not having this feature.

Caveat: I’m still relatively new here so I might be wrong on the features or otherwise off base.

Thanks,
Doug

The title says it all:

We, as traders/investors, should not be thinking like this at all. There is no magical truth or trading system that will work through all time frames, and for all eternity.

Anyone who steps back to take a look at the forest instead of the trees will recognize that we are not in the early part of the business cycle, but the twilight. We are not in a market where company valuations have been driven into the ground (like 2002 and 2009), and any value play is pretty much a guarantee of success. We are at the opposite end of the spectrum. The business cycle is long in the tooth and we are in a trader’s market, not an investment market. Yes, you still find good investments but they are a needle in a haystack. You should not be surprised to find that your value strategies are not working, simply because most of the value plays are gone and the vultures are picking over the remains.

Once you recognize the above, then you must understand that it isn’t a question of “2 bad years, statistically proven”. We are in a place that designers weren’t designing for, a late business cycle trader’s market. Value investing will surely make a recovery, but not until a bear market is unleashed and the poorly managed companies are washed away, while the remaining companies are potentially excellent investments.

We are in a strange place right now because politics and fed reserve agenda are superseding the business cycle. Once upon a time, recessions were perceived as a necessary evil. The weak companies should be cut down so the strong can prosper and gain new heights. But for whatever reason, politics now dictate that recessions as a bad thing, even though they are a necessary evil to ensure the long-term fitness of the economy. Thus we see the length of the business cycle pushed out, the weak are allowed to carry on indefinitely, at the expense of the well-managed companies.

Once you realize that one trading system can’t fulfill all business cycle conditions, whether it be early or late, then you also realize that investment strategies are not static. We must adapt to the changing conditions. There are two ways of doing this of course. The first is to become a superhuman analyst, a lot of work fraught with human error and no guarantee of being right. The second way is to use modern technology at hand to identify the current trends and attempt to capitalize on these trends. The second approach is considered by some as black magic, and perhaps it is to some extent. But the first approach (superhuman analysis) can only be theoretical, there are no metrics to tell you whether your analysis is correct either now, in the past or in the future.

Instead of the superhuman approach, I prefer to look at (1) how do I identify current trends; (2) how long can I expect the trend(s) to last (i.e. persistence); and (3) how do I recognize when the trend is coming to an end. If you start thinking along these lines, then you may begin to understand the importance of optimization and the importance of backtesting optimization strategies.

One other point of note I would like to discuss is the concept of Discounted Cash Flow analysis. While it is an interesting analysis, it has been pretty much established that it is NOT a useful analytical technique except for companies with an economic moat that have very predictable forward growth figures. One also needs to have a host of bull/bear narratives and use the bearish narratives to come up with a tradeable present value for a stock. Morningstar is a leader in this approach and has demonstrated some moderate success with the MOAT ETF (see attached image).

Keep in mind that the Morningstar approach is supported by hundreds of professional analysts. I don’t believe that P123 can compete in this venue. We simply don’t have the human manpower to accomplish what Morningstar is able to achieve with the MOAT ETF. BTW You can also subscribe to Morningstar and see their valuation figure for any stock.

However, everything is not lost. There is no reason why we can’t stand on Morningstar’s shoulders. P123 now has the stocks held by MOAT over time and in fact all ETFs for that matter. There are other ETF strategies that P123 subscribers may be interested in. So P123 management, if you are listening, please give us the ability to select an ETF as a guide for an underlying stock universe. It has to be adjusted over time, the same as the underlying ETF swaps out companies over time. This feature will allow model designers to not only build upon the great work being done at Morningstar with economic moats and stock valuation, but also add our own additional P123 strategies on top of that. We get the best of all worlds… stock valuations by a company with much larger human analyst resources, and our own custom strategies that may supercharge the results.


Hi Grocer,

the same was true two years ago, three years ago etc

In fact, most models on P123 lag behind their benchmarks over a longer period of time. Often these take their gains from high turnover, while we all know that investing in stocks is safer with a long-term “buy-and-hold” strategy.

Maybe we would do better by just using a combination of index funds and simple market timing rules (Andreas Himmelreich proposed some simple rules in this forum).

Also, it is always healthy to look into other asset classes and emerging trends outside of P123.

Just my 2 satoshis.

I have been using this strategy for over 5 years now using the PIT holdings of USMV and VDIGX updated quarterly, and selecting 12 and 10 stocks, respectively, from each of them. This strategy has easily outperformed the funds.

However, it is difficult to get the funds’ historic holdings. I don’know where P123 could obtain them from.