All that glitters is not gold: Comparing backtest and out-ofsample performance on a large cohort of trading algorithms

https://blog.curtisnybo.com/comparing-backtest-and-out-of-sample-performance-on-a-large-cohort-of-trading-algorithms/


All that Glitter is not Gold.pdf (532 KB)

Thanks, James. I found it interesting and informative.

Thanks Bob, I am glad you find the paper interesting and informative.

In fact, due to the very weak correlations (R2 < 0.25) between the in sample (IS) and out of sample (OSS) performance as mentioned in the paper based on Quantopian’s data. The scoring system at Quantopian for submitted algos now includes a 6 month out-of-sample evaluation period.

For Quantiacs, another crowdsourcing quant site, their scoring system automatically calculates the Sharpe ratio to evaluate your submitted algo which will make up the 1st score. They will then simulate your system for three months with live data, which makes your 2nd Sharpe ratio sorce. The lower of the two is your final score. The reason for the arrangement is to rule out overfits (in the backtest) and lucky wins (in the three months live data).

James

Hi James,

I made a New Year’s resolution not to discuss statistics on this forum. But this seems to be a topic of interest for at least one person reading the forum today so I may make a small exception (while my neural-net-training runs on Python).

That is an extremely high R^2 for what we do!!!

For comparison, take something that we all can agree upon (probably). We all agree, I think, that the AvgDailyTotal is correlated to the slippage.

When I look that is there is a correlation, as one would expect, but not THAT MUCH of a correlation. Also any papers about slippage do not have that high a correlation.

Still, I need to digest this. The average Alpha is near zero in the study. The IR correlation is negative and of course the annualized returns are negatively correlated. A high Sharpe Ratio, by itself, is not what I am interested in.

This all leads to one rhetorical (but serious) question: will this help me make money? Rhetorical because I do not want to be put on the spot of having to answer my own question.

Maybe the lesson is find a low variance portfolio (with a higher Sharpe Ratio that may persist) and leverage the portfolio. A strategy the led to disaster for Long-Term Capital Management which was low variance until it wasn’t. But as I said I am still thinking about what the paper means.

Thank you for the post and perhaps my resolution should have been to not post statistics unless the thread is about statistics;-)

-Jim

Hi Jim.

You are right about not focusing too much in the Sharpe ratio.

As you can see In the abstract of the paper, they acutally find that Sharpe ratio (by itself) offer little value in predicting out of sample performance (R² < 0.025).However, it also mentions that the latest year (IS) sharpe ratio is one of the better evaluation metric based on Quantopian’s data.

I believe they found the overall predictivity including all the backtest evaluation metrics is R2<0.25 which implies less than 25% of out of sample performance can be predicted by the in sample data. I think this level is not high at all.

James

I disagree. I think it is very high.

Quite a bit higher than the designer models annualized returns for example (which is duplicate by the study you link to). Negative correlation for the Designer Models too.

For me it is just that you COULD FIND you have a high Sharpe Ratio buying bonds that pay 2%. High Sharpe Ratio but I would rather be in the market long term. BTW, Long-Term Capital Management thought leveraging the bonds was a winning strategy (which it was not).

-Jim

Jim,

If you disagree, I guess you are not disagreeing with the the findings in the paper and with Quantopian’s data. Not with me.

I just believe their view to put more focus on out of sample evaulation period when comparing trading algos.

James

Actually, not disagreeing.

Just asking if you are aware of anything with a higher positive correlation than what is cited in the study. I guess I could get a more negative correlation than the annualized returns—cited in the study—by burning my money.

But, personally, I cannot think of a study (in finance) with a higher positive correlation. As I said the correlation of ADT with slippage is an example of a study with a lower positive correlation. I sincerely cannot think of one that had that high of a correlation.

Probably just me.

-Jim

Instead of leveraging on bonds which led to the collaspe of LTCM, another way to achieve high return/high sharpe is to follow the footsteps of the Medallion Fund (Renaissance Technologies) and run a leveraged stat arb stock trading strategy (statistical arbitrage) which continue to remains a secret sauce at RT.

Yes! That is a VERY interesting method.

My takeaway:
Quantopian

  1. (Robustness) Limitation of Quantopian:
  • Price only,
  • mostly no fundamentals,
  • mostly single instrument systems. My experience: if you only use price, momentum / trend following and mean reversion are the only way you have to outperform (but only on a big set of instrument portfolio). Very limiting! You can make money this way (@thechartist does this since 30 Years), but not more then 15% - 18% ann.
  • AND MOST IMPORTANT: no ranking! the beauty of ranking is, that you do NOT NEED an assumption about the distribution of returns or fundamentals. The ranking encapsulates this complexity and that is the reason why P123 (+ its PIT Data) for me is the only place right now.

I am looking at different platforms and after an hour of research I usually find out they have no ranking (on price or fundamental data) and that is the point where I stop researching, bc I know I need ranking to outperform because it encapsulates the (fat tails) complexity of the data and it does not base on a distribution assumption.

Sharpe is the absolute worst! BC from what I know it measures the vola of the downside the same with the vola of the upside. BUT WE WANT VOLA ON THE UPSIDE, if you would have only vola to the upside (e.g. for the sake of an example the “perfect trading system”) you would have a bad Sharpe Ratio! Sortino is a bit better, but it also assumes data is normally distributed!

  1. They stripped the outliers, not good. Basically, the same as if one would take all fat tails out of IS Data and do backtests on it. If only 5% of the systematic traders are successful, you would strip them out you do not catch their results.

  2. NO NEWS that everything that uses the normal distribution assumption (CAPM, Alpha, Beta, Sortino, correlation, regression and also statistical tests that are based on ANY Normal Distribution assumption) not working for OSS predictability, because your assumption is wrong (e.g. prices and fundamentals have extensive fat tails, left and right and correlations change constantly!). If your method needs an assumption of the distribution of data, you are dead in the water from the start!

A. We simply do not know the distribution of price in the future! The SP500 can drop 30% and more tomorrow, not likely but possible and all possible will happen (that’s just matter of time and that is also the reason why leverage is always in general a bad idea).

B. If you want to have an assumption of the distribution, make sure you assume big fat tails that have almost no limitation to the left and the right. And if you do this, I do not now any statistical methods that can handle big fat tails.
4. If you want to find out what works, you must talk to successful people that have skin in the game (ask me perhaps in 30 Years ;-)).

What works in my opinion (other stuff works too, just my framework)?

  • PIT on fundamentals and price
  • Ranking on fundamentals and price on stocks and industries (BC it encapsulates ALL the data set without an assumption of the distribution of the data)
  • Macro I: Demographics is 90% of the game (you live or invest in a “free” country, let’s say the USA)
  • Macro II: Invest in a “free” shareholder value culture dominated country (just compare China actual performance of the stock market and compare it to the USA in relation to GDP growth)
  • Backtest every factor you use back to 1926 and further or find academics that have done this (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html)
  • Make sure you can explain the factor by an emotional factor of humans (e.g. its hard to buy 52 Weeks highs”, “its hard to buy micro cap value momentum stocks”, “its hard to trade volatile trading systems”): Software based trading system use either emotional or structural (e.g. big institutions can not invest in micro caps etc.) FAILURES (“oh, I can not buy a stock that is up this much”) of other market participants. It is a competition, hack other people’s feelings (or institutional structural competitive disadvantages and outperform or do what all do what everybody else is doing and do not outperform
  • Or (or better AND) make sure you have legal insider, alternative data (which is PIT bc. Only very few use it!)
  • Robustness Test, as much as possible: Change your parameters, if the capital curve changes big time to a small change (of a parameter, a market, amount of stocks in the port etc.) and is “all over the place”, put the system in the trash. Change of markets and volume universes (SP500, Russel, small and mico caps, Canada, in the future international markets). Change of sectors (all of it!). Divide the dataset (EvenID stuff), Change of stocks (1,3,5,10,20,50,100,200,400). Do not look at statistics (that have a nonfat-tail assumption) if you do your robustness tests, look at the capital curve with your eyes only. And if you have a DD and are in doubt, go back to your capital curve and look if something similar happened.

That is also what they say, IS DDs and Vola has predictive power to OSS.

I can not judge their AI stuff, bc I do not know the methods used and its a black box.

If I can use this kind of AI stuff and new statistical methods in the future, I would still allways use the above rules, I just need to see a very
long backtest and the esplanation of the factors need to be behaiviourly / institutionally explained.

Best Regards

Andreas

OK , so its a paper proving that most backtests are curve fitted? Yes, we already knew this. It’s very easy to fall into the trap.

Interesting info on sharpe ratio and how they used AI. Maybe DataRobot has a API we can use to bolt on AI to P123 instead of doing it ourselves. That’s a much bigger project for later this year.

My main take away for simple, low hanging fruits to improve P123 :

We should add Tail Ratio and Last Year Sharpe ratio to our risk statistics and DMs. Look to be the most powerful predictors for OOS performance

Any other low hanging fruits ?

Andreas

Thanks very much for your reply.

I am sure you have your preference to P123 (and probably financial interest in P123) but Quantopian probably has a larger investment community than P123. P123 has its advantage since no python coding is required but Quantopian can definitely do fundamentals as one can access a lot of data libraries and coding already done and shared by others. Regarding the Sharpe ratio, it is definitely not a perfect metric but Sharpe/Sortino/Upside Potential/Information ratios are the most commonly used risk adjusted measurements and they need some kind of of measurements to evaluate the submitted trading algos (including 6 months out of sample evaluation period) togther with other statistical evaluation metrics for their investment contests.

As far as I know, Quantopian is backed by Steve Cohen and the winning algos in their investment contests are allocated real money from hedge funds for investment. Furthermore, the mains findings in the paper is that backtest from in sample data is not a good predictor of out of sample performance and you really need to see how the algos perform with live data.

Regards
James

We discussed the underlying article back in 2016 when it came out.

Hi Marco,

Why not look at Amazon AWS also?

Of course, you would pass the cost for the use of AWS to the member.

As far as doing it yourself I think you can. AWS does some things like support GPU (graphics card and not CPU) for Tensorflow which you may or may not want to deal with. Tensorflow uses a lot of vectors that are best handled by a GPU. My understanding is that parallel CPUs work fine—just not as well as GPUs. Parallel CPUs certainly work for a single user (works for me).

My Tensorflow (neural net) just crashed. I may need to reload everything. So there are some programing hassles: probably nothing new for you but perhaps a reason to look at AWS or DataRobot.

I will say that building an artificial neural net is not hard. The Scikit-Learn MLPReggressor (not TensorFlow but a neural net) worked out of the box with the defaults. Something to take into account when considering the market. Tensorflow adds options for the high-end user and is not hard (if the program does not crash).

FWIW. You have probably already considered AWS.

Thank you!

-Jim

Hi James,

I do agree on that you have to wait and see OSS Results. Definitly 100%!

I do not agree on any statistical method that has the assumption of normal distributed data applied to stock prices or stock fundamentals.
That the whole industry is using it, does not mean it is right to use it.

Out of sample Sortino is kind of o.k, bc. it only measures the upside vola (which we want!)

“Risk adjusted” is another concept that does not get into my head (I tried!), all the big traders / Investors where able to withstand big
DDs (Buffet in 1999 50%!) and a strategy that has DDs but has zing to the upside (e.g. recovers much faster then the market) after a DD is actually a robustness test I am using. If I see a strat with 50% ann. and it has not 50% DD in the last 20 Years I am not trading it.

And I rather trade a strat that makes 50% ann. and has 50% DDs then a strat that has 25% ann. and 25% DD. Total return buys porsches (Turbo S! ;-))not risk
adjusted return (e.g. 12.5% ann. and 12.5% DD).

Put me in a room with 1000 Traders and I will be the only one with this opinion and that is nuts and cracy. But anything else I can not get into my head. I rather invest in my mental thoughness and beeing able to withstand a huge DD but capture the risk premium at the same time. It will come anyway, not matter how good the strat.

Sorry to be opinionated.

Best Regards

Andreas

Andreas,

I think you have mix up Sortino ratio and Upside Potential Ratio.

Sortino only factors in the downside or negative volatility (not upside volatilty as you stated).

Regards
James

Has anyone considered why the Sharpe Ratio has a positive correlation while the information ratio is negative in the paper?

Has anyone read the paper? I cannot download it.

The most likely explanation (without reading the paper) for why this would be is that some of the algorithms use hedging to reduce the volatility (standard deviation in the Sharpe ration and the IR). If the same benchmark is used, they are fully invested and it is all equities (no bonds) for all of the algorithms then I think this is very likely the reason.

James, if this is the case your comments about pairs trading and leverage are important.

I will not go further as I have not read the paper and there could be another reason.

But you should read the complete paper before changing too much. If you have read it I would be interested in whether the paper explains this.

-Jim

From what I gleaned, the correlations were both so close to zero that whether one had a negative or positive correlation was pretty irrelevant. I have an extremely low opinion of the Sharpe ratio for the same reasons as Andreas does. The lesson I gleaned from the paper is never to expect correlation between in-sample and out-of-sample performance over very short periods. If I remember the 2016 paper correctly, it looked at in-sample and out-of-sample performance over periods that were less than two years long (maybe even less than one). I also think that the regression lines in the illustrations are comical at best. One shouldn’t try to draw linear regression lines using OLS methods through data that looks like a hornet’s nest. This is all just my opinion, of course, and I may be misremembering the paper, and maybe someone will convince me that I’m mistaken here.

I just wanted to say that you’re not alone here. I recently read an excellent book, Behavioral Portfolio Management by C. Thomas Howard, which makes the same point.