Proposed Study Framework - a mechanical model for predicting OS System performance

Tomyani · December 12, 2013, 1:17pm

So…the purpose of this thread is to see if there is sufficient interest to find 10-20 participants to invest 5-10 hours each to conduct a study aiming to find a mechanical model for better system selection; i.e. predicting out of sample (OS) benchmark relative system performance.

I will lay out the case and proposed framework below.

There seem to be three major stages to ‘beating the market’ with quant. strat’s.

Generating good looking sim’s.
Generating a process for reliably choosing among those sim’s.
Generating a reliable process for weighting, combining and allocating to that basket of sim’s.

This thread deal with issue #2.

Why a simple mechanical model?
A significant body of research shows that mechanical models nearly always do as well as expert forecasters and in a significant number of cases and contexts significantly outperform them by a wide margin. Some of the specific reasons hypothesized for this are identified as follows:
Humans fail to identify the most salient variables that influence outcomes.
Humans suboptimally weight decision variables.
Humans are unduly influenced by the most recent events.
Humans fail to gather / monitor outcomes and therefore fail to learn from decision and improve decision processes.
Humans have emotions, get tired, etc…and these moods affect decision making.
A sample meta study from the research is attached. There are many other studies in this area.

Expert forecasters tend to do most poorly when input variables are numerous. Experts often do worse than naive forecasters in these situations, but are much more confident in their forecasts (for example in clinical diagnosis, experts do worse…if they have conducted an interview with the patient).

Can a simple mechanical system be developed for the question of ‘How likely is a system to outperfom a benchmark out of sample?’

I think yes. This question is very analogous to the following questions that have been posed and answered rather well within the field of finance, by simple mechnical models:
A. Can we predict if a firm is likely engaging in financial reporting manipulation (accruals; Richard Sloan and ‘The Detection of Earning Manipulation’).
B. Can we predict if a firm is likely to experience financial distress (In search of distress risk).
C. Can we predict if a firm is likely to experience bankruptcy .

All of these questions have been answered with a very similar methodology. The predictive formulas created by the studies have held up reasonably well for a decade of so, in many cases, out of sample.

How would the study work?

The DATA SET. Half of the data set, from 1/2000 to 12/31/2006 would be in-sample (IS). From 1/2007 to 12/25/2013 would be out-of-sample (OS). Within the in-sample date range, we would look for 100-200 sim’s to test. Half would be used for training. Half for testing resulting models.

These sim’s would be (ideally) drawn 50% from public sim’s and 50% from private sim’s. No one would see rules on sim’s. Ideally half-the sim’s would have continued to work in some way out of sample. Half would have failed. Ideally sim’s would come from a variety of liquidity ranges…although they may all focus on micro/small caps. That’s TBD.

THE PREDICTED VARIABLE. The predicted variable would be ‘outperforming the SP500 (or Russell2000)’ out-of-sample (OS). I would suggest measuring this Out of sample outperformance by (SIM_AR%*SIM_Sortino)-(SP500AR%-SP500Sortino).

THE INPUT VARIABLES. We could debate this endlessly. But…I would initially suggest (IS=In sample):
a) IS AR%
b) IS Sortino
c) Total number of factors in the ranking system
d) Ranking system weights optimized or not (1=optimized, 0=not).
e) Number of positions / holdings in the sim
f) Min. Liquidity of the system
g) Annual Turnover
h) Estimated annual trading costs (run sim once with zero slippage and once with variable slippage and then take the difference).
i) Number of buy rules
j) Total number of market timing / hedge rules
k) Total number of sell rules

How study would work.

Agree on a list of the systems to include and divide them up based on people participating.
Each person codes the above in an excel or google spreadsheet ‘data set’
One person runs analysis to find an IS ‘prediction equation’ and presents the top 2-3 equations back to the group. Either SPSS (statistical analysis software) or machine learning ‘predictive analytics’ software can be used. I know how to use these, but no longer have access to the software since my licenses expired.
Each person then runs an analysis on their systems OS (out of sample) and reports back the predicted SP500 out-performance vs. the actual.
From this, 1 person combines the results to see which of the top 3 predicted equations worked best, if at all, out of sample.

Incentive to participate. Only the people participating get the study results and final equation.

QUESTIONS:

Can any one suggest flaws, or improvements or simplifications to the above?
Does anyone want to volunteer 5-10 hours?
Does anyone have SPSS or predictive equation generation software they can volunteer to use on the data set?

Best,
Tom

MECHANPREDIC.pdf (1.04 MB)

hemmerling · December 12, 2013, 3:44pm

Good ideas.

Another idea could be to have 4 time periods which a system can have static variables optimized over. Each period gets an optimization of weights followed by OS performance of the other 3 periods. You can average the difference between IS and OS to see what effect optimizing has. Compare this vs. equal weighting of factors.

Better yet, restrict the development of the model factors to be limited to one random period only to prevent optimizing over entire history which will create a look ahead bias. This would include ranking system development. It would be improper to have created and optimized a system from 99 - 13 and then slice it up for IS and OS optimal weighting tests afterwards.

But maybe I am missing the point of this test as well.

dwpeters · December 12, 2013, 5:42pm

Hi,
I agree that mechanical models are generally better at prediction than humans and you certainly have a worthy goal. I’m curious how you will deal with the hindsight bias that is inherent in most P123 strategies - that is, you may have difficulty finding strategies developed prior to 2007, and any strategy developed or updated since 2007 has already been developed if not optimized for your out of sample time frame. So it’s not really out of sample. This is one advantage of the AAII dataset, as I believe most of their models have been around for a number of years.

Research has shown that stock performance over 3 to 12 months is predictive of the next month’s performance. My tests on the AAII data set were longer term so perhaps when I have a chance I’ll go back and test a more active approach. Have you seen any research quantifying the nature of market regime shifts (eg. between small cap and large cap or between value and growth)? Such shifts may have a significant impact on what models are best suited for that regime.

Tomyani · December 12, 2013, 6:17pm

Kurtis,

Thanks for the suggestions. Those are also worthwhile studies clearly.

Kurtis / Don,

I guess the point for me…is to see if/what factors matter and how much. There’s a lot of talk around more factors is better, fewer is better, optimization is good, optimization is bad. Would like to see if I could quantify this for my own decision making. Right now, I am looking at a sim converting to port, and not sure if I should run the live version with more buy rules and more holdings or not. So…would love some studies that give my choices more grounding.

Agree 100% on lookahead bias having to be avoided. My thought was to just take systems that existed at a given point in time. Many of the public sim’s can be run with dates attached (can see the date the sim was run at…and in many case find data around when it was created in threads). Can look up what rankings existed then. Can also comb through old P123 threads and find 'top sim’s referenced at various points in time. I have some folders from 2009 after I joined. People who’ve been around, likely have older folders. So…we’d start with those.

But…if it’s just me, the study will be less ambitious than I outlined. Probably 10-20 sim’s looked at. At least until sometime in Feb. next year.

I agree more sub-periods would likely be better. But…not sure that the periods are really long enough to avoid random drift and style regime shifts overwhelming all factors. So…I still think that 2 periods to start might be best.

Don,

As far as regime shifts. My recall on these studies is poor. I’m far from an expert. I’ve seen lots of research, but don’t recall titles off-hand.

I remember (barely) a study from Alliance Bernstein on Growth-Value. I wanna say that they showed roughly 5 year cycles in growth and longer cycles in value (going back to 1960’s or so if I remember)…but fairly large standard deviations / variations in both - in terms of cycle lengths and regularity. They weighted growth at 40% or so and value at 60% in most of their clients equity sleeves on client port’s because of this. And rebalanced annually or with a 5% or so drift from target weights. But…this is all from memory. That meeting was in 2009 or so. So…I might be way off. If I find it, I’ll post it (if I legally can). But…could fairly easily look up the underlying style indexes for large cap growth and large cap value…and then create a chart of your own on this (relative performance). Recently saw some paper showing that from 1980 or 1990 till now the value premium has declined significantly. May be random variation…or may be case…that most published ‘premiums’ lose a fair amount after publication. There are papers on this. I wanna say, like 30%-40% of pre-publication alpha. But…many still do exhibit alpha. Can definitely find these studies on SSRN. There was a really good meta study on this I have somewhere.

There is a ton of data / studies on small cap / large cap. There are also available style indexes going back to early 1900’s. So…again. I don’t recall specific papers on this…but may try to dig them up next week…but small cap’s generally win over most 10 year cycles. Over shorter cycles, there are 1,3,5 year cycles where large cap dominate. Rarely, but occasionally, even for a decade.

But the variations in these cycles is very large. So…most professionals I know just pick a static weight target and then rebalance to those (i.e. 60% value, 40% growth - X% small, X% big). At most, they add a ‘tactical overlay’ that involves core 1 year forecasts to a ‘core static’ portfolio. But the tactical overlay piece is typically pretty small. (No more than 50%, more typically in the 10% or so range). But…most pro’s are trying to avoid significantly underperforming the SP500 or a 60/40 port. They are looking for 1% or so alpha with smaller risk.

Best,
Tom

olikea2 · December 12, 2013, 11:09pm

I think that it is important to indentify, when alpha is created, what is the source of the alpha, and can it continue?

For example.

If a “value” type strategy has outperformed in a backtest, the question is why, and will it continue in the future.

The answer to “why” is because of reversion to the mean in earnings growth. A fact that on the whole is under-appreciated by the market.

The question of will it continue is more difficult to answer. “Value” shares do have lower growth rates, and deserve lower valuation. It’s just that they end up with a lower valuation than they deserve, historically. If too many people run this backtest and start bidding up value, then soon people running backtests are going to realise the secret is “Growth”, that buying companies with a high EPS growth rate is clearly the way to go, as stock prices are going to go up in line with EPS growth. And that strategy too is flawed for the same reason.

Ultimately I think there is a flaw in the approach of ranking logic, which is simply to say “more is better”, “lower is better” and so on. The question which would be more appropriate is “how much”. Or more specifically, we know that EPS growth might be good, “higher is better”, but how much are we willing to pay for each percentage increase.

I did attend a class in decision analysis recently, and this is one of the things that came up - how much are you willing to pay for an improvement in factor x?

At the moment we just shove together a bunch of factors into a ranking and hope for the best. I think this is not the ideal approach. To be even more quantitative, one should ask “how much”.

E.g. Take a model looking at earnings surprise. Well, betting on stocks with positive earnings surprise may be a great backtest - suggesting investors are not appreciating the effect of the surprise sufficiently immediately. Will this continue? Again perhaps a better approach than the usual “more is better” is to consider, what is the average gain for a particular suprise, what is the outperformance after the fact, and therefore, approximately, how much is such a suprise actually “worth”? Namely if the market has been underreacting to them in the past, how will we know if and when in the future, the market reacts correctly, or even over-reacts to them. This would give better, more robust logic than “more is better”.

I know I may be taking this thread in a slightly different direction, but that is also kinda the point!

o806 · December 13, 2013, 3:13am

Ultimately I think there is a flaw in the approach of ranking logic, which is simply to say “more is better”, “lower is better” and so on. The question which would be more appropriate is “how much”. Or more specifically, we know that EPS growth might be good, “higher is better”, but how much are we willing to pay for each percentage increase. … At the moment we just shove together a bunch of factors into a ranking and hope for the best. I think this is not the ideal approach. To be even more quantitative, one should ask “how much”. … E.g. Take a model looking at earnings surprise. Well, betting on stocks with positive earnings surprise may be a great backtest - suggesting investors are not appreciating the effect of the surprise sufficiently immediately. Will this continue? Again perhaps a better approach than the usual “more is better” is to consider, what is the average gain for a particular suprise, what is the outperformance after the fact, and therefore, approximately, how much is such a suprise actually “worth”? Namely if the market has been underreacting to them in the past, how will we know if and when in the future, the market reacts correctly, or even over-reacts to them. This would give better, more robust logic than “more is better”…

Oliver:

I’d love to have answers to these types of questions. I could do so so much with such answers. It would be the next best thing to a cystal ball for stock prices.

But I think they will be impossible to answer without 1,000 to 3,000 years of detailed economic and stock data. I’m pretty sure we will not get any meaningful answers by just looking at how much aphla comes from an earning surprise of a particular size over tha past 20 years. That is not enough data to “normalize” our results taking into account the current interest rate level, consumer confidence, the bullish/bearish sentimate, etc., etc. How many years of data would we need to get 30 periods which have the same levels of interest ratres, unemployment, productive trendsetc. My guess is 2,000 years give or take 1,000. If we can’t control for other major factors, then we’ve no way to calculate the statistics for “typical” alpha that would result from erarning surprises.

Brian

o806 · December 13, 2013, 3:49am

Oliver:

I respectfully disagree. I would not call the “relative” approach taken by P123 ranking systems to be flawed. That’s its “genius”.

Of course, it is not the only way to invest in stocks. Another major way to invest is to use an “absolute” approach. Both have their strengths and limitations. The strength of one is the limitation of the other, and vice verses. Which approach one prefers I think is based on a person’s temperament — and whether they have the time to do all the leg work that the “absolute” approach requires.

Warren Buffet is a great example of the “absolute” approach. I must admit that I really like Warren’s approach. And if I could do it, I would be doing it. Warren focuses on those parts of the market which has companies he can understand well enough to predict their future earnings. He looks at a wide range of things ranging from the quality of the company’s management to the medium and long term prospects for the company and its industry. And he stays away from companies, like tech ones, which he does not understand well enough to confidently predict their future earnings stream. Once one is confident of one’s estimate of future earnings, then it is simple math to calculate the “present value” of that future stream of earnings. If the stock is below that number, then buy. If the market is over valued such that no stocks are selling below the present value of their future earnings, then don’t buy anything. Just sit on the cash. Be willing to sit on the cash for several years if the market’s overvaluation goes through a bubble phase before correcting. At times Buffet has sat on a pile of cash for considerable periods of time because he could not find companies selling below the value of their predictable future earnings.

Since I can’t do Warren’s indepth approach, I’m glad to have the option of using P123’s “relative approach”. In fact, I am very glad to have it available. To be sure, this relative approach can not be used to tell one when the market is overpriced or underpriced. That’s a “limitation” to be sure, but I would not call it a “flaw”.

Best regards,
Brian

Tomyani · December 13, 2013, 2:21pm

Oliver,

I think what you are saying, if I try to apply it to my specific ‘how to select a system test’…is that you would like to code a single variable:

Clarity / focus of original system design. Generalist system gets 0, ‘focused idea system’ gets 1. We might need to educate / have rules for coders and make sure general rating would typically agree.

That could be a valuable addition. Thank you.

For me, I am very interested in a ‘data set’ and ‘tests’ that will better allow me to have a more systematic framework for ‘system selection’ among competing system choices. When I develop a sim., I might have 50 versions of it from backtesting (or more). I have a process for choosing now (involving lots of sensitivity studies) - and trying to find a system with a the greatest risk adjusted returns for the greatest number of versions tested. I would like to see if I could make that process better by a) building a good data set that is survivorship bias free of ‘good systems’ point in time and then b) designing and c) running experiments on different system selection processes. That’s my goal. Both creating the ‘bias free’ data sets on systems from P123 so studies can be run…and testing some various selection choices.

When Fama/French and the other pioneers did the basis Small-Big, Value-Growth work they were bluntly testing ‘single factors’ to get at major ‘premiums’ that could beat a market index. There work was groundbreaking within the academic community and has had major impact on the ways professionals invest.

More broadly, I am trying to see if there is any community interest in smaller groups and shared research / studies. I take it most have ended badly in the past? Most research in general learns little. The sum of research learns a lot. That’s kinda the history of science. It’s why I left psychology. My mentors were working on very ‘thin slices’ of knowledge for many decades…and it wasn’t clear it made any difference to anyone.

But…I would potentially be open to collaborating on other research projects.

Now…Let’s take your example of a study around a single company.

Yes…we can design experiments to study a single factor and analyze ‘decile’ performance or hard ‘constraint’ limit studies on factor by factor to try and understand if Earnings upgrades >5% are better than 3%, or for what stock universes that is the case historically (small cap value or tech growth, etc). Those are all good / great ways to learn about the data set.

Here’s sample tests:

A stock that has had a falling stock price and missed earnings four quarters in a row…and had near term stock weakness…and maybe even a rise in short interest…but then has a 5% surprise is likely to do better than a stock that has had 4 quarters of 10% earning surprises and has had near term price run up before earnings and then has a 5% surprise.
A stock in an industry with falling earnings that has a big surprise is likely to do better than a stock in an industry with rising earnings, where the earnings surprise is smaller than the average earnings surprise of the rest of the industry.
A stock with heavy near-term selling is likely to do better around a surprise than a stock with heavy near term buying.
Market expectations can be measured by VMA movements around earnings. Particularly with options activity. A stock with near term options VMA rises, that then has a 1St.Dev. or more movement in ‘earnings surprise’ in the opp. direction is likely to have a major near term stock (and option) price move.
A stock with an earnings surprise that is 2 or more STDEV’s above (below) it’s most recent 4 (or 8) quarters movements…can be expected to have a major move.

Using option data around announcements to predict expectations and their relative size…would likely be much better. We don’t have that data here…but it’s said to be ‘smarter money.’ We could then measure the ‘earnings surprise’ against that predicted in options price moves.

But…for Brian’s question on wanting to invest like Buffett. I definitely don’t. I actually don’t believe in the Buffett model. It’s too hard to model any single stock and the interactions of all the market players in the game…and impossible gets raised to a large exponent as the time frame rises. I do believe in ‘clusters’ of stocks with similar properties over short periods, because we can measure what the market players, in sum, are doing. That’s why I’m a systematic trader with short’ish holding periods.
Etc.

Some issues -
A. I want to study the system selection issue:

Build the data set of systems (possibly).
Construct some tests from the ‘best community’ systems at the time.
Run these tests on the data set, both 1 factor at a time, and with clusters of factors.
B. Any interest in the community in collaboration on this project (or other projects that are specific, that people can post)? Fame/French collaborated. Shiller/Case collaborated. Science and research is very often a group effort.

I have o issues with these ‘how to build a better system’ experiments. But they need to be defined. And…it’s somewhat different from ‘how to better select among systems.’

So…Rather than being abstract, are there specific tests you would want to run with community collaboration? Post them, please. I will definitely let you know if I have time / interest/hours to contribute.

Or it may be that you are saying…that my study should probably be called: 'A Mechanical model for choosing among competing systems designed by the same designer to try and get at the same ‘return driver’…and using only the same factors in the ranking system.

That would clearly help narrow it…would it make it ‘better’? I don’t know.

Best,
Tom

Jrinne · December 13, 2013, 2:41pm

Clearly this is an important question. One of the few basic questions in fact.

How much am I willing to pay is a question many of us ask and answer each and every trading day. It is explicitly answered when we place a limit order. We are saying I am willing to pay this much and no more.

Often our limit order is even an implicit answer to the question of how much am I willing to pay for an “improvement in factor X.” Assuming factor x is in the ranking system, a person’s willingness to buy the stock at a certain price is an answer to this question.

One could argue the only other question is when to sell.

How to answer these questions is the hard part.

DennyHalwes · December 13, 2013, 10:11pm

Oliver,

Although I don’t think that “the ranking system approach to selecting more is better or lower is better” is a “flaw”, many factors can be significantly improves by limiting outliners from being ranked. below is an approach to how to answer “how much”.

Back in late 2008 when there was a lot of discussions in the Forum about factors that have “failed”, much of the single factor ranking system tests were based on running the ranking system performance tests using 20 buckets. However, I found that was flawed. For example; look at the factor Pr2SalesQ lower is better. Before the recession, we were limited in data starting in 03/31/2001. If you run the ranking system performance for Pr2SalesQ on that period through 10/01/2007 using 20 buckets you will get an annual return in the top bucket greater than 20%. If you re-run it using 200 buckets, you will get -3.7% in the top bucket! (I used 200 buckets, weekly, and a filter; ADT > $200K, Price > $1). The rest of the top 5% buckets do average over 20%, but many of these top 0.5% outliners are just bad companies. This is because the companies near or in bankruptcy have very low values of Price/Sales. They tend to have a low price and high sales, but are losing money.

Back then I created a ranking system (private) that started with an existing public system that I changed by adding a Boolean limit to each factor that showed that it failed in the top 0.5% bucket during the recession. The approach I took was to test each factor of the system as a single factor ranking system for the dates 10/01/2007 > 03/09/2009, (the dates from the S&P 500 high to the low). For each factor that showed worse performance in the highest buckets than the lower buckets I added a Boolean limit function to give an NA ranking to stocks that had a value outside of the limit.

For example, the factor Pr2SalesQ “failed” during that period with an annual return of -70% for the top bucket. By adding a second Boolean function; Pr2SalesQ > 0.1 to the single factor ranking system, the annual return for Pr2SalesQ was improved to “only” -57%. I retested it today to see how it performed out of sample from 03/09/2009 through yesterday. I got an annual return of 47% with only Pr2SalesQ and that increased to 61% with the addition of the Boolean filter. Over the full time from January 1999 to yesterday the top bucket annual return of Pr2SalesQ increased from 6% to 19% by adding the Boolean filter. This shows that, at least for that factor, this approach for limiting bad outliners is robust. I have also tested out of sample my private modified ranking system using the above approach and it is also very robust significantly out performing the original unmodified system.

Denny

Tomyani · December 13, 2013, 10:32pm

Denny,

Sounds a lot like…‘cleaning the data’ or trying to ‘throw out trash’ that a lot of quant firms do. They might eliminate some % of accruals for example. Or companies with other ‘iffy’ finances. They then add another stage on top. But…typically I think this is done at the ‘universe’ creation stage. No reason it has to be, it’s just how I’ve seen it most. People will create universes that attempt to eliminate companies where fraud, misrep., or total loss of capital is most likely (or some other factor…if for example…it’s a breakout universe…might eliminate all companies valued at more than 10X PEG or whatever).

They will then run some ‘simple’ish’ form of ranking on this universe. Often there are multiple stages in this. I.e. find top 10% of small cap growth from this ‘clean’ universe. Then rank those ‘clean’ small-cap growth companies on some other factor (like down momentum over 24 months or whatever). So, I’ve seen it usually done in stages…where each stage is fairly simple.

So…it’s interesting. I think it’s slightly easier for me to wrap my head around what’s really happening if I do it sequentially. Eliminate these companies. Rank remaining on this group of factors. Then rank the top X% of that universe on Y. When multiple factors are ‘lumped’ in the ranking and some portion of rankings are eliminated…gets hard to figure out the factor interactions for my small brain.

I think I’d rather remove ‘companies’ at the universe stage…not the ranking. Thoughts?

Best,
Tom

Jrinne · December 14, 2013, 12:18am

3 observations:

I’m pretty sure we are not talking about the same thing
everyone is making a good and very true point
as usual, Denny’s point (others too) will make you some money if you listen

DennyHalwes · December 14, 2013, 1:48am

Tom,

You really, really need to reread my post and try “to rap your head around what’s really happening”. You can’t remove the outliners with universe filters unless you first run the ranking system of each factor to determine what value to set the Boolean limit to, and then what’s the point? If you add the limit rule to the universe, it will remove any stocks that have that limit from further consideration. In the ranking system it only assigns a low rank value to that stock (same as all NAs), but still ranks the stock for all other factors and functions in the ranking system. So a stock that may be set to NA for one factor may rise to the top of the rankings for many other functions and still be rightfully selected by a Sim. Think this through. You might pick up on some ideas. That is what you started this thread for isn’t it?

Denny

geov · December 14, 2013, 3:21am

Guys,
Let us not complicate matters when there is no reason to do so. The principle is KISS - keep it simple stupid. I still say count the number of estimated parameters in a model. Apply a scientific Model Selection Criterion (MSC) which punishes models with too many estimate parameters - models get a score depending on performance and number of estimate parameters in the model. The score will then point to the models which are most likely to be the most robust ones. This is a a scientific way to do things. We do not have to re-invent the wheel.

Reason I am getting no support for this idea is that most people have stuffed their models full of parameters in order to show high returns. The MSC would punish these models no matter what returns they show. One can get any desired return by introducing more and more variables into the algorithm. Then, if some of them break down the model implodes.

That is my view on this matter. I will soon provide the the rules for a model with only 7 estimated parameters which has a CAGR of about 20% from 2000-2013. I have tested the algorithm in an excel spreadsheet and am pleased to report that the signals from the spreadsheet matched those from P123. This model is not likely to break down and should be very robust. Obviously it does not make 100% each year, but what is wrong with a steady 20% considering it only had 14 realized trades since 2000, all of them winners.

Georg

DennyHalwes · December 14, 2013, 5:38am

Georg,

Although I agree with most of your posts and the great info on your website, we are just going to have to agree not to agree on this. I spent 42 years running thousands of statistics models on Billions of engineering data points and some statistical approaches just don’t work very well on many forms of data. We must be VERY careful in what statistic we select and how we apply it to our data. I have been bit bad by miss applying statistics to my data. When lives are at stake in aircraft design you can’t afford to misuse statistics.The KISS and MSC are a couple of approaches that sound good and they are for many things but not necessarily in this case. I have made a LOT of money over the last 8 years running Port with a large number of parameters. I have had over 100 live ports running on automatic for years and NONE of my simple Ports have come anywhere near the performance of the more complex ones.

With MSC and to some extent with AIC, we want to find the model that best represents the “true” model, that is, the model that accurately and realistically describes the data. Those approaches work well when there is true causality between the parameters in the model and the data set. If you have a classical data set of 1000 data points and you can approximate the true model within only 5% with a half dozen parameters you will have lost little information. If you try to add a dozen more parameters that have little or no independence from the original 5 then you may be able to “fit” the data within 0.5% but there will be significant lost information. That is the basis of MSC and AIC ability to identify potential over fitting.

However, we are dealing with 8000 stocks, over 13 years, 240 days/year, & 5 values/day (O,H,L,C & volume) (that’s greater than 100 billion samples). And from this we are simply trying to find WHY some things go up. Now that data set is scattered all over the realm of possible outcomes. It will truly take hundreds (if not thousands) of independent parameters to come close to a true model (that is “The Stock Market”). The scatter and noise of the data set is what requires additional parameters to come closer to representing the market. Remember, the scatter and noise ARE REAL DATA POINTS. We just call them scatter and noise because they mess up an otherwise pretty chart. And with all that data we are left wondering how much the future will be like the past.

We have at our disposal over a 1000 factors, functions, indexes, ratios, statistics, indicators, and estimates to choose from as parameters to select a subset of in order to create a model that represents the true model. Now I feel that it is IMPOSSIBLE to represent the true model with a small number of parameters, and in order to have any chance of getting a system that gets even close to representing the true model it may take hundreds of truly independent parameters.

We can come close with fewer parameters by first defining a specific universe of stocks to design a system for. And in my example above we can narrow down the things that go up by adding functions like Pr2SalesQ > 0.1 to our models. If the data set is much smaller and that data set of stocks performs more similar to each other, then a dozen parameters can get closer to representing the smaller set. But I contend that 2 dozen carefully chosen parameters will have a better chance of finding diamonds in the noise. At least it works for me.

Denny

MisterChang · December 14, 2013, 12:30pm

Denny, any idea what the annualized return for this perfectly-fit or “true” model of the stock market would be? Right now the top R2G’s have AR in the range of 120%+. I can totally see why some people simply ignore these as “too good to be true”. I wonder often about this myself. But compared to the true model, it’s probably quite low.

aurelaurel · December 14, 2013, 12:57pm

Mister Chang, check out some of the public simulations. Some have >2000% AR.

olikea2 · December 14, 2013, 1:24pm

I think the problem comes when this is over-used to the point of extrapolating previous trends into the future, and assuming that is how it was going to continue.

I shall illustrate some examples that show how a system could “backtest” great but fail totally in real time.

Imagine we are in 1999. First question, stocks or gold? Our “backtest” would have shown that for the last 20 years, stocks had done very much better than gold. The S&P 500 was floating just over 100 in 1979, and by 1999 it was over 1200. Gold on the other hand stayed around the same $350-$400 per ounce, with mild decline in the 1990s, even worse if you take into account inflation. A terrible investment for sure.

We could even run lots of “robust” tests, was it just a few stocks, what if we pick a portfolio randomly, what was the sharpe and sortino and statistically analyse all the results to come to the conclusion that yes, stocks are certainly better than gold, and to maximise the return you want to allocate 100% to stock - if you were to do this in the strict “no discretionary” way.

The result since has been quite the reverse, now the S&P 500 is at 1700-1800, meanwhile gold has soared to over $1100. In fact, it did the exact opposite of what the “backtest” told you.

Similarly, if you had run a stock backtest at the end of 1999, it would have overwhelmingly told you “internet stocks”. If you had piled into those, well you would have probably had a 90%+ drawdown without recovery.

Part of the problem is that during the period 1980-2000 there was a revaluation upwards in equities. While the “E” was going up, so was the P/E ratio too, so there was a double whammy, and many people had not realised that the latter was a “one off” transient thing that was not going to persist into the future.

Now, as we sit in 2013, I am a bit concerned about the same thing happening in “value” shares. As the world gets more quantitative, and “value” is picking up, the gap between value and growth is narrowing. Similarly, a lot of spectacular returns of “value” should be taken with a pinch of salt, because that too may be a “one time” event not to be repeated. In fact, at some point investors may indeed may re-discover the merits of selecting companies with superior earnings growth prospects, and thus the cycle comes around again.

Right. And more importantly, he is the one with the billions of dollars. I think billionaire investors are underrated, I find it amazing how they give such good advice that is so little listened to.

As Buffett said though, “better to be approximately correct than precisely wrong”. I don’t think we want to try and build a pin point accurate model. In fact, in the previous example, running a backtest repeatedly up to 1999 shows exactly what “precisely wrong” means. You could apply all sorts of sharpe and sortino type analysis to it, different baskets of stocks, and it would all be “precisely wrong”. He did in 1997 express concern that equities were “very expensive” - that is approximately correct.

We want to select our assets that are going to perform the best in the future, if we can identify what is likely to outperform in the future by quantitative analysis, then if we have a basket of those stocks that is sufficiently large, e.g. 50, then it is likely that we will get the result we are looking for, without necessarily having to be completely accurate on a case-by-case basis. I am not sure if the latter is possible given the random and chaotic nature of the world and markets anyway.

I am not sure to what extent these ideas can be integrated into a system for estimating robustness in a mechanical way.

I still advocate (number of trades / degress of freedom) as a good number to consider, number of trades gives statistical confidence, and lower degrees of freedom is arguably about being “approximately correct”.

Tomyani · December 14, 2013, 2:57pm

Denny,

My first thread post language may have been too imprecise or felt flippant. Wasn’t my intention, though.

Agree that clearly we have to either run 1 factor screen tests or 1 factor ranking tests in advance to decide where to ‘set the value.’ I do have plenty of systems that do both (some put the rules in the universe, some in the rank). The difference is in what conditions, if any, might we want a hard constraint on.

You recently added some hard buy rule constraints to one of your R2G’s. Allowing the system to only buy Canadian and US stocks for example. That ‘rule’ could have been easily added to the ranking as a country preference. But…you didn’t do that. You felt a hard constraint was more appropriate. I am guessing there are many other systems where you have buy rules that can easily be in rankings (say minimum 1 year change in earnings growth or sales growth or liquidity or market cap top side constraints). If that’s the case, why aren’t they ranking system factors to you in these systems?

So…that’s the question…to put factors in the ranking…or in the universe…or in the buy and/or sell rules. To use ‘hard’ or ‘soft constraints.’ It could be interesting to hear how people think about it. Don is someone who has a clear preference, I think. All rules in the ranking almost. Apart from market timing. But…for others…Is the only decision criteria, maximizing sim results for ‘ideal settings?’ Or are there other reasons we make these choices?

I’ve recently been talking to a quant fund that only puts some of the rules you mentioned in at the universe stage - because they want the hard constraints - they never want to own certain types of stocks…it’s part of both a) their belief system and b) how they sell themselves to clients. They believe that this hard constraint is more effective in shifting the distribution of possible outcomes…if the underlying universe has a hard constraint that it helps cut off the left tail of resulting possible ‘extremely negative’ outcomes. They’ve backtested back to the 1970’s and been vetted by family offices. I haven’t done either.

The difference is clearly if we want a hard constraint or not…or just relative factors. In a 100 factor ranking system, for example, it’s very possible for stocks that we might not want to own for some specific reason(s) to still pass.

For example: Maybe a fund never want to own the top 2% of the most shorted stocks in a system, no matter what. Or maybe a fund never wants to own a stock that is entering merger talks. Or never wants to own stocks that have falling earnings for 10 quarters in a row and still cost “a lot” relative to their industry and have accruals that indicate financial trickery is likely.

So…What’s a ‘Universe factor’ and what’s a ranking system factor and what’s a ‘buy’ rule factor to you? Or is it just the ‘art’ of things and system to system choice? Or only based on what produces the best sim on a case-by-case basis? That was what was meant in the earlier post.

Best,
Tom

Tomyani · December 14, 2013, 5:08pm

Oliver,

You believe that number of trades / number of parameters should likely be added as a test variable. Thanks. I agree that’s worth looking at. Many of the existing variables will be combined if a ‘machine learning’ type approach is used to generate prediction models. But…any ratios people think definitely should be included are welcome. I recognize I will likely be doing this study myself.

Beyond that…as to your post…I agree that no system predicts the future perfectly. But…we’re building systems because we think they’ll help us better manager our investments. A lot more institutional money has flowed into trading systems over discretionary trading in the past 2 decades as well. That may be good or bad, but it is the ‘smart money’ speaking. We should all listen to Buffett. Agree. But…often hard to apply to what we do here…beyond general rules of 'buy good companies…with ‘moats’…and good management…and hold them ‘forever’…or until their value proposition falls a lot. Buffett does not place much value in systematic trading systems. At least what I remember him saying. But…yet we build them. There are billionaire systematic traders (Paul Tudor Jones…though his new fund is getting killed, David Harding, etc). They might hold additional clues for us.

And…many people still use systematic methods to avoid most of the specific issues you are mentioning. The specific examples you mention are more at the portfolio construction level.

Sample rules used in these cases, for example:

US large cap equities can range from 5% to 15%.
US hi-quality bonds can range from 10% to 20%.
Pattern trading systems can range from 5% to 10%.
Gold can be from 0% to 5%.
No single system can get more than 1% of capital.
No ‘type’ or class of system can get more than 2% of the ‘total risk’ of the portfolio, etc.

There will be additional constraints on the portfolio. i.e. total trailing period volatility and correlations and predicted forward period CVAR will be used to dynamically allocate within allowable ranges. People build complex systems to manage this.

They then either have a) rules based systems that alter allocation weights, b) discretionary ‘meetings’ where they make their allocation decisions or c) some combination of the two.

Some people also use things like trailing X-period correlations to set allocations. They have hard rules like:

50% of all systems must be below 0.1 correlation.
95% of all systems must be below 0.3 correlation.
No system or ‘style’ or country can get more than 5% weight.

The combinations are close to infinite. The programming is likely more challenging than what P123’s done so far…as there are more constraints and predictions…and model feedback into other modules.

But…for me…the basic question of the thread initially was a specific methodology to try to develop a study people might want to participate in to make better choices when choosing among similar P123 systems. To make this criteria better, I would limit the ‘choice’ to systems that are in the same basic ‘style bucket.’

I threw out a study I am thinking of undertaking to see if people want to collab. or could suggest specifics to make it better. From your comments, I would agree that it’s better to focus more on limiting the ‘comparison’ systems to similar style systems. So…initially…equity small and microcap systems…likely with some value tilt. Are there other specific thoughts you have on a study of how best to choose systems? Or do you feel the study is pointless / hopeless from the get go?

Best,
Tom