Evaluating models

After consulting jrinne, wwasilev and the Grinold/Kahn text on Active Portfolio Management, I have tentatively decided to rank my ports based on information ratio, t-stat and CALMAR ratio, focusing on the trailing 3 and 1 year.

For those who don’t know, the information ratio is annualized excess returns divided by annualized standard deviation of excess returns. If I were looking at a spreadsheet of monthly returns for a model, I would average the excess returns over the last 36 months and multiply by 12 to get an average annual excess return. I would take the standard deviation of the last 36 months’ excess returns and multiply by the square root of 12 to get annualized standard deviation of excess returns (this particular std dev is also known as “tracking error”).

Once I had the 3-year information ratio, I would multiply by the square root of 3 (since it is a 3-year measure) to get the T-Stat. I am looking for T-Stats in excess of 2, as that is the proverbial 95% confidence interval. Meaning that at T-Stats of 2+, the chance random luck is driving returns is less than 5%.

(Dividing a T-Stat of 2 by the square root of 3 really means I am looking for 3-year information ratios in excess of 1.155).

The CALMAR ratio is the 3 year annualized total return divided by the 3-year max drawdown. This can be calculated for the benchmark and the models. Obviously, you want to see a model CALMAR ratio greater than the benchmark CALMAR. For my purposes, I will use Benchmark CALMAR * 1.5 as the bogey.

Since multiplying the 3-year T-Stat by the 3-year CALMAR ratio will not change much week to week for the models, I will also multiply by the one year information ratio calculated from weekly returns. I am looking for a one-year info ratio of at least 1. In this way, a model going out of favor (such as value or ADRs recently) can be quickly seen in the numbers.

So: 1 * 2 * 1.5(Benchmark Calmar) will be my cutoff for model investment. Any score higher than that is investible. Any score lower than that, and the model gets benched. Currently, the cutoff is 2.97.

Thoughts?

How do you select your benchmarks?
Why did you use just 1 and 3 years? Why not at least 10 years (or longer)?

SPY is the usual equity benchmark. Unless it’s a small cap or nasdaq model, etc.

Most of my models have been running for 2 years give or take a few months. I would prefer to examine (mostly) actual performance than mostly back-tested.

Evaluating models is a challenge at the best of times. I know of sites that have struggled to find an appropriate scoring system that has some relevance to future performance. Collective2 comes to mind in that respect. I haven’t visited that site for years but at the time they were using “number of subscribers” as a major factor in scoring the systems, the rationale being that if people were paying money for the stock picks then the system must be good. This scoring strategy, of course, has problems, one of which is the biased feedback system… More subscribers results in higher score which results in more subscribers.

How do Calmar, I.R., etc. work in an environment where 8 models are being “harvested”, 7 of which are underperforming the benchmark while the 8th is exceeding expectations? Let’s call these models “Model A”, Model B", and so forth :slight_smile: At some point in time the 7 underperforming models will be canceled and replaced by new models. With luck, 1 or 2 of the new models will outperform and be kept as showcase models. From this scenario, it should be clear that one cannot rate a system in isolation. One must rate the model designer him(her)self and that can only be done if all models ever created are maintained and monitored for eternity. But is that even fair? Market outlook changes over time, the computing platform evolves, the intended strategy may only be appropriate for one market segment over one time period. For example, SMS Aerospace & Defense was designed with the idea that a Republican president will boost defense spending. What happens if, after the next election, the government swings to the democrats? The model may not have the same performance after that happens. Another example is SMS Cloud Computing which targets businesses involved with the cloud. One would have to be crazy not to be invested in such companies right now, either via stock holdings or ETF such as SKYY. 5 or 10 years from now, the cloud will have matured and such a designer model won’t have the same bullish performance after that. But for now, investors not into the cloud are missing a once-in-a-generation opportunity.

Another issue that I have is the choice of benchmark. At P123, there is a limitation on benchmarks, although there has been a great improvement on variety. My opinion is that the only relevant benchmark is one that is constructed from the custom universe that the model picks stocks from. This is the only benchmark that is meaningful for measurement of system performance. Once this is understood, then the investor can design his/her holdings using a top-down approach. The investor starts by choosing benchmarks (custom universes) based on industries with high future growth potential or with overall risk-reduction in mind. The benchmarks should have minimal overlap. Once a diverse set of benchmarks are chosen, then the next step is to design ports around these benchmarks, or the investor can find designer models that have a strong probability of outperforming the benchmarks. This top-down approach to portfolio design is much more interesting to me than the alternative, which is to choose a handful of designer models with little insight of how each model is diversified from the “pack” and with performance measured against a generic benchmark such as the S&P500 with little or no future predictability.

This also is also EXCELLENT for comparing 2 of your ports.

Instead of the usual “benchmark” one of your ports serves as a “benchmark.”

If you get a t-stat of 2–showing one of your ports is that much better than the other —it is time to consider moving some of your money to the better port, I think. You might consider keeping some ports on automatic for comparison. You would begin to fund an automatic port that has done well or move a funded port to automatic if it has underperformed.

Simple accounting (without any statistics) shows this has worked for me.

-Jim

Or perhaps you are jumping ship to an overachieving port that is about to mean revert.

A definite possibility!!!

But if you wait until the t-stat is 2 you are not jumping from ship, to ship to ship very often. And when you do the odds that it is actually a better port are pretty good.

I wish I could say it was 100%. But if it were guaranteed what would be the fun in that?

-Jim

What about rolling tests? I test my models with 1 year periods and 1 week offset to capture all possible outcomes.

Steve,

The formal way to deal with more than one comparison is with an ANOVA. And you are right: there could be a survivorship bias with the Designer Models. Here is an example of 5 ports/sims. Is there a difference between the ports/sims? If so which ones are clearly better/worse?

This it actually the Friedman test named after Milton Friedman. It is a nonparametric version of a within-groups ANOVA.

So the p-value is 0.041 and shows that at least one port is different from the others, statistically. Connover’s Post Hoc Test then looks at comparison between 2 ports/sims is significant—keeping in mind that there are 5 ports/sims.

In this example Yellow is statistically better than Violet (t-stat 2.949 and p(bonfiorrni) = 0.033). None of the other ports/sims show significance in one-to-one comparisons when considering 5 models.

If you compared just Violet to Green it would be significant (p =0.014) but the p (bonfiorrni) takes into account that you are looking at 5 models. p (bonfiorrni) = 0.136. Not significant.

In practice, I pretty much do it Parker’s way. I do not wait for the ANOVA to show significance to start moving some money or beginning to fund a model that is outperforming the benchmark: it is a demanding test.

-Jim


While 3 years of data might be somewhat valuable, 1 year is going to be useless, in my opinion. And combining your out-of-sample returns with your backtested returns so that you have a six-year or longer lookback is going to get you better results.

I have never seen a study in which drawdowns in one period were predictive of or correlated with drawdowns in another period. I would advise you to abandon this method. Using the information ratio is far more reliable than looking at drawdowns.

For what it’s worth, here’s how I choose which system to use. I use a modification of the Omega ratio (see http://backland.typepad.com/investigations/2018/07/ultimate-omega-the-best-risk-adjusted-performance-measure.html ) to look at backtested results over four different half-universes (I divide my universe into 2 using evenid and also by random subindustries) over the last 6, 9, 12, 15, and 18 years. I end up with 20 results for which is the best system, and I implement the weighted average of them all.

[quote]
For what it’s worth, here’s how I choose which system to use. I use a modification of the Omega ratio (see http://backland.typepad.com/investigations/2018/07/ultimate-omega-the-best-risk-adjusted-performance-measure.html ) to look at backtested results over four different half-universes (I divide my universe into 2 using evenid and also by random subindustries) over the last 6, 9, 12, 15, and 18 years. I end up with 20 results for which is the best system, and I implement the weighted average of them all.
[/quote]Yuval could you please explain?

The way I understand this you test 6, 9, 12,15, and 18 (five lookback periods) for evenid, !evenid, subindustry group A and subindustry group B. This gives us twenty different combinations of lookbacks and universes. Each one of these twenty combinations is tested for each ranking system that you are using. Then, you weight each rs in proportion to their average omega ratio across all twenty tests. Is this accurate?

Another question: How good does this system work for you compared to picking a singe RS that had the highest omega ratio across the entire period?

I’m going to commit heresy and claim that only performance metric that matters is dollars earned.

Everything else is a refinement of that basic metric. There’s rarely any additional information contained in any metric that is not already included in a time-series graph of account balance (the exceptions are in cases of co-movement metrics, like correlation, beta, and–yes–information ratio).

But even then, the reason we have co-movement metrics is to separate luck versus skill on the bottom line (which is still measured in dollars).

Moreover, dollars matter more than percents due to capacity constraints. Also, dollars can always be expressed in percent returns, but percent returns cannot always return dollars.

I look at annualized returns, and look at how close the log scale equity curve looks to a straight diagonal line.

I don’t look at sharpe or sortino or information ratio, because every time I look up the formula, I soon forget them.

I don’t look at max drawdown because it is only one data point, and it is extremely unlikely to repeat in the future. Your future drawdown is going to be much better or much worse. Besides, does it really matter if your portfolio is down 50% vs 60%? If you trust the model to survive -50%, then you can probably stomach -60%. The only time I look at drawdown is to verify that a constant hedge is working properly.

Real-time performance is the only true out-of-sample. I don’t bother splitting the historic data into “training” and “validation” test sets. Historic data is still historic, no matter how you slice it.

Using real-time performance as the validation, I look to see how close it is to the backtest. Unfortunately, there is no other choice than to wait months or years before you gain confidence in the model. We all do this when we look at someone else’s Designer Model. “This model looks good, but I’ll wait longer to see if the performance isn’t a fluke.”

I never do rolling backtests. Maybe because most of my models are high turnover, so changing the start date makes no difference.

But I do lots of tests changing the holding period, and changing the liquidity, because I want to see how the alpha decays as you change those parameters. I find that for my low liquidity models, you can actually crank up the liquidity quite a bit and see gradual alpha decay, but then when you increase the liquidity beyond a certain point, the model just kinda falls apart very quickly.

BTW, Parker is being kind here. I first became aware of the information ratio in one of his feature requests a while ago.

My enthusiasm for the information ratio is based on the fact that the information ratio and sharpe ratio are derived from sound mathematical theorems. It is not surprising that many people would have the same opinion about a mathematical theorem. If there were legitimate criticisms it would not be a theorem any more.

And it has a minimum number of assumptions that can (and should be) questioned. It does not assume linearity, homoscedasticity, sphericity ets.

And for the life of me, I do not understand why people do not normalize their regressions and why they put any special meaning on zero returns (look at alpha). Zero has no special meaning on a number line with infinite real numbers. And the meaning in the real world is purely psychological. Zero return (over long periods) is a very unlikely for the benchmark: exactly zero is impossible in fact. Most modern literature normalizes things first—making alpha a moot consideration. A relic?

But Parker found this before me and developed his ideas independently. I invented none of this and he had already found it when when I read his post about the information ratio. I hope I have not highjacked this thread or presented too many ideas that he does not endorse.

-Jim

Don’t be silly Jim. Your contributions are invaluable.

“…the information ratio and sharpe ratio are derived from sound mathematical theorems…”

The issue with any of these theorems aren’t mathematical soundness, but lack of evidence that there is a connection to future performance. Now it would be nice to do an empirical study but even if the results of such study came out positive, there would be nay-sayers.

Steve

I actually agree with this—without equivocation. Well, except for there not being any evidence at all. Maybe it is only anecdotal based on my personal experience. But I think there is other evidence—Fama and French to start with.

Can there be a compelling pattern on a P123 output and absolutely nothing when you look at the statistics? Isn’t the excess return a statistic and essentially half (the numerator) of the information ratio?

The day I have serious questions about there being patterns that can be recognized with statistics or by just looking at the returns on a P123 output is that day I stop paying P123 any money.

On that day you will be able to walk the fine line of why you pay money while believing none of it–without any dissenting opinion from me.

BTW, nice graphic!!! You have set a high bar if you are going to try to duplicate StockMarketStudent’s accomplishments;-)

-Jim

“Can there be a compelling pattern on a P123 output and absolutely nothing when you look at the statistics?”
“Maybe it is only anecdotal based on my personal experience.”

Jim - you will be very surprised when you compile statistics on something that you believe to be quite obvious. For instance, Buy/Sell signals marked on a graph, signals that seem to nail the market swings. Then when you compile the hard statistics you find out that the results are actually insignificant when using the actual entry/exit values. I have done this many times in the past, usually very disappointed with the hard facts.

“…Fama and French to start with…”
Fama and French (momentum/value) has a lot of empirical data to back it up. Even with all of the empirical data, there are still doubters, people with studies showing it doesn’t work.

“…The day I have serious questions about there being patterns that can be recognized with statistics or by just looking at the returns on a P123 output is that day I stop paying P123 any money…”
P123 has a set of wonderful tools. But the wisdom for how to use the tools comes from the user. Mathematics unto itself cannot be the judge as to whether a system is good or bad. There has to be a narrative, and the investor has to believe in that narrative. Mathematical formulae can only play a support role, not provide the golden egg.

Which is why I use the statistics.

But, “Inspector,” I no longer work for the “Statistics Police.” Good statistics should all give the same answer if done correctly. And besides working for the statistics police did not pay well.

People do not always have to use statistic. Other statistics work well. And people are free to use the statistics they like. I hope my posts are positive.

Even bad statistics can work extremely well as a machine learning tool, IMHO. I now work for the “Machine Learning Police” and that does seem to pay.

-Jim

Jim - you are using statistics for past performance. The problem is that there are no statistics to support the assumption that future performance = past performance. This is where you need to do a massive empirical study to support that thought. I’m not saying it is wrong, just that there is no evidence to back it up. You can of course claim it is a narrative that you strongly believe in, if that makes you feel better :slight_smile: