FACT: Simulations results are no different than out of sample results

There’s been a lot of unrest about simulations generating different results.

It’s true. There have been algorithm changes in some ratios (see posts that start with IMPORTANT), there have been bug fixes, and there have been data fixes in Compustat too. Since we do not snapshot the majority (90%+) of the point-in-time ratios, rather we re-calculate them on the fly using the latest versions of the data, simulated results ran in the past change when run today. There are arguments for and against on-the-fly ratios, but we are sticking to on-the-fly. It is the only manageable way to do it, so this problem with changing sims will persist. I’ll come back to how we’ll address this…

I have noticed another disturbing trend: out-of-sample results are being elevated to gospel.

This is just not so. Out-of-sample results, just like simulations, are nothing more than possible outcomes of a strategy. Lets remember what we are doing: constructing a portfolio of 10 stocks where the decimal precision of 0.01 in the rank determines the order. On top of that ranks are based on relative orders of financial data full of N/A, pre-announcements, and outliers. It’s no wonder that any small perturbation can cause the top 10 to be completely different.

In other words: a ranking system’s worth can only be judged on large groups of ranks, deciles or even quintiles. But systems of a handful of positions are being created in P123 using fractions of 1% rank! It’s absurd. And on top of that , layers upon layers of buy & sell rules are applied, and market timing to boot. This makes falling into the curve fitting trap very easy and natural. I’ll go as far as saying that engraving past ratios in stone would make curve-fitting even worse.

So what to do? It’s our presentation that needs to change. Simulations will no longer be presented the way they are now. Rather as a multitude of 3 year simulations with varying start dates and random noise. The true worth of a trading system is not the past results (whether sim or “live” it’s indifferent, it’s the past), rather as a probability of outperforming a benchmark in a three year period by x%. Or something like that. No more “Portfolios”. We should also rename Portfolio123 to System123

Thanks

Marco,

I applaud your effort to generate more realistic ‘ranges of outcomes.’ (and generally agree about the folly of precision and microprecision on tiny ports that are highly optimized - although to each his own and we should let people trade how they want). I have long supported (and often posted) on my desire to have the the ability automate the creation of ‘envelopes’ and run batch tests of these with a bunch of computer generated ‘system changes’ on variables I want to change.

I have also posted on things like ‘randomizing the returns’ in backtests to see how the ‘order of wins/losses’ worked for/against us historically - this is a backtesting best practice.

But, I have suggested limiting these ‘batch tests’ to off-peak hours - although I’d love to run them anytime.

However, this should only be launched on a beta site first. It’s a very large undertaking. I am not sure it should be the ‘default’ new P123 interface. I think, perhaps, it should just be another ‘advanced’ section in testing available to people building systems - and perhaps limited to more expensive memberships, as it is likely to be CPU intensive, right?

And just adding things like ‘ranges of transaction prices’ is not going to do much at all to change the underlying issue. At the end of the day, people need to invest a lot of time learning to use P123 and to understand system building and backtesting and financial theory. I don’t think there will ever be a great short cut for that.

Longer term, for better systems, developers have to be able to automatically have the system vary a lot of variables (I’ve posted on this elsewhere; way back in R2G early days) - but things such as position size and turning on and off all rules and ranking weights, and start and end dates etc. before we begin to get meaningful ‘envelopes’ of possible performance ranges. And, if designers are committed enough, they will still be able to game this on R2G’s.

My main desire in all this is to get to better tools that let me get more reliable alpha with my own portfolio. I would still love for that to be the core of P123. R2G is helping P123 find issues with it’s tool set and/or data set, that’s great. But, please don’t make huge changes to P123 interface - just put all this new functionality on a beta site and consider eventually rolling it into a set of advanced tools until it’s been around for 6-12 months and we find out the issue with it.

Best,
Tom

Why keep the optimization tool?

To compare the noise of one sim variation to another?

Really, add noise?

Jim,

I’m not suggesting more testing or optimiziation is good or bad. I just view it as another tool. But Marco’s section of this note on ‘changing the default P123 interface’ completely to do this, really scares me. But the tool is one I would welcome. That’s all I was trying to say.

I don’t want to wade into all the other threads on ‘changing historical results’. I get that sometimes there are errors and sometimes data gets revised. We, as users / system builders should however, be able to get a sense of to what degree:
a) Data. Changes are coming from vendor data revisions of certain factors and which factors and market spaces are most problematic (for example, a historical study of the range and size of any vendor changes by factor type / category and market cap would be very valuable in building new systems) vs.
b) Errors / learning. P123 internal changes of formulae, and how many are happening annually. I haven’t attempted any study on these. But, does formula complexity change error rate? Or certain type of factors more likely to be revised later on?

Any data that is constantly revised or has large non PIT elements, should likely be clearly labeled in the factor description, or kept in another section entirely (like the economic indicators currently - they aren’t PIT, but no new user will know that and it’s easy to produce faulty systems with them). The documentation on this seems very messy and it’s likely P123 has not data on the underlying data revisions right now.

Knowing what’s causing the changes more exactly, will help us all design better systems, beyond just saying - let’s all build large number of factor value quality systems with 50 positions. That is one solution and it’s not wrong, but we should have a little more understanding of size of issues.

I have found several ‘glitches’ over the years myself and had these fixed. I clearly threw out any systems using them. In my case, they all happened before live money was invested. My view is that P123 is a ‘blunt tool’ that can help us make better choices. We don’t all know exactly how blunt, or where all the changes come from.

Best,
Tom

Tom,

It scares me too. This was not achieved by looking at changes in noise: real edited results. I guess I will stay as long as I can still run the port. It is making me money and I have P123 to thank.

I have no complaints about P123. I’m not sure that the assumption that my statistics or the way I do my backtests is poor is a correct assumption. There is more than one way to do something well.


Not achieved with noise.PNG

“I have noticed another disturbing trend: out-of-sample results are being elevated to gospel. This is just not so. Out-of-sample results, just like simulations, are nothing more than possible outcomes of a strategy.”

So while out of sample results are not “gospel”, they are what designers’ reputations live and die by. There is always probability that OOS will under-perform its promise, but if you make the statements you are making, then what are we here for? The bottom line is that each and every designer has a set way of designing their models. The designer can’t determine the outcome of any individual model but he can control his underlying development process. And that development process is really what it is all about. The process will ultimately show up in the designers’ track record. Statistically speaking, some models will under-perform, some will outperform. But comparing OOS track record to backtest simulation is simply ludicrous.

I would like to suggest for the benefit of P123 users, that the future of R2G not be swayed by personal agendas, or statements to the effect that there is one and only one investment strategy. Keep the next generation of R2G fair for everyone. And I have to repeat what I said earlier about how rolling backtests are conducted. The startup of both sims and ports are an issue and if not done properly then some models will be unfairly put in a bad light.

I would also like to know where noise will be injected.

Steve

As “Tomyani” suggests, please be conservative when making changes so that those of us who have become comfortable with p123’s current toolbox can manage future transitions smoothly.

Thanks!

Hugh

I fully support Marco’s desire to change the way R2G Ports are presented. However, PLEASE allow those of us that understand the pitfalls to use the Sim tool as it currently functions. I don’t need to know the “possible range of outcomes” of every Sim run. I can figurer that out for myself. And please don’t add noise to my private Sims.

Maybe all public Sims & Ports, R2G or otherwise, need to be presented as a range of possible outcomes, and allow private ones to be run as they currently are.

Jim,

I only use the optimization tool to determine the “possible range of outcomes”. After I have a final Sim developed I use the optimization tool by changing all the rules and start dates by 5 or 10%. That gives me an good estimate of the expected range of returns. I never use it to find the best combination of things that optimize the Sim. That is a formula for guaranteeing data mining.

Marco,

Please figure out a way to let us use the current Sim algorithms to develop new systems.

Thanks Denny.

I was mainly wanting to make the same, request really, as you. Please don’t add noise.

I think Andreas says this: “P123 rocks!” It is making me money. The sims the way they are are responsible for that. And I have never complained about data: indeed, I depend on that data. I do try to understand how it is best used.

Marco,

All the talk about sims and ports being probabilities of outcomes rather than solid results reminds me of quantum physics. Perhaps the act of just observing a sim can change the result?

I concur with Denny, Steve, Jim, Tom and other members. Please don’t let the vocal minority cause you to remake p123 into ‘s123’ or something else. While showing the range of results would be a nice feature, perhaps it should be a checkbox option or a beta site for quite a while.

Having been with p123 since the beginning, like others, I am worried that the tool I have been making money with for a decade will be changed for the worse. Baby steps, please.

Hi Marco,

I agree with Denny. It is perfectly fine to incorporate a range of possible outcomes, perhaps as an uncertainty envelope, but please do not introduce any noise!!!

Also, not all model designers are curve fitters. I spent almost all of my free time in the last 6 month to develop ports with universal factors which work independently of the market conditions. At least the simulations so far show promising results (see capital curve below); but by introducing noise, how would one be able to judge what the actual potential of a designed strategy is?

Perhaps we could avoid the whole wrong focus on simulated performance by just eliminating the simulation portion in an R2G. That way people will just judge a model by out-of-sample performance, similar as it is the case for any mutual fund.


steady_returns.jpg

I am with Marco, the p123 “Pros” know how to “read” a sim and add all kinds of tests in order to find out the probable outcome.
“Beginners” and maybe the “middle of the pack” do not.

I go so far, that my Major Modell, that I trade only works well (it works with slippage, but not as well at all) with the Assumption that I do have no slippage at all (and I do not have slippage trading this model!), but it Needs
a lot of experience to understand this.

So I think it is good to Change this and it can be done by leaving “the today red line of the sim” as it is but show other probable outcomes (different
start Dates, different assumptions about open, Close or between open and Close Price, maybe even different amount of stocks, whatever etc.) in the same graphic (maybe with a Switch that is by Default on and can be put to “off” on the personal Option page, maybe no Switch and we just deal with it) with different colors.

I think Marco posted that in a different thread based on open, Close and in between Price.

Such a graphic would help the Pros as today, but also the “Beginners” to judge the sims and r2gs in a more realistic way.

Win-Win (Performance Impact can be dealt with), right way to go…

(Sorry to be so opinionated pro p123 all the time but I do exactly what p123 is doing in my real Job (figuring out the demand and channel it to design a good Software product) and I know how hard that is, because even if you “go above Water” it is never enough and that can be tough, so I am almost on the very supportive site almost all the time.
I Profit big time and I want a “happy” p123. And I also believe p123 goes absolutely in the right direction, I trust that, there
are not a lot of decisions in the last 4 years, that I did not like and if, I see the Overall Picture. I do not care about the
1%, I look at the 99% that are almost perfect for me and I build a big part of my future on p123, so that might explain my Position.)

Regards

Andreas

Andreas,

Don’t forget about the stats. what variations of the Sim are going to be included in an “average” of the various runs to calculate the stats with. I make most of my decisions based on the stats, not on the pretty charts.

Marco & All:

 I agree with Denny.  Please do not change the simulation tool.  It's totally fine the way it is.  I use it to make money.

 I understand your "curve-fitting" concerns, but changing the Sim tool would be unhelpful.  As an example, if I want noise in a Sim, I'll introduce it, but I don't want to be stuck with it. Thanks.

 Bill

I’m very happy with P123 and how it has developed over the past years. But I heartily agree that we don’t need more noise in backtests. Please leave sims as they are. I’m fine with a little variation caused by Compustat data. I’m not fine with variation deliberately introduced by the platform.

It’s like making the best knife in the world blunt on purpose: “You can’t cut yourself as easily anymore!”. Right… but you also can’t use it anymore to cut your delicious steak.

Hi Denny,

could stay as it is and refer furhter on to the “old red line”.
Gee, I look only on the pretty Chart (at least at first), a bit on mdd and the statistics of the single years.

Regards

Andreas

Unfortunately I feel compelled to make another post on this issue and I’m sure everyone is tired of my point of view.

In my last post I was specifically concerned about R2Gs. I will get back to that in a moment. As for what Marco is describing as a development tool, I am absolutely for it. But like others, I too am concerned about what it will do to CPU resources.

Noise is a very necessary component of this feature. Without noise generation, there is no point to it. I think what may work is to make it an option of the simulation optimizer. If you want this to be a permanent port feature then make the prerequisite for generation of a port the simulation option, including noise. I think this is a very viable solution as people can do most of their work without the added burden.

As Marco has indicated, this is a change in presentation, so it should not affect the results of any portfolio. If I am wrong on this point then someone please correct me. And as per my suggestion above, I propose that the presentation change be permanent to portfolios, not simulations as was suggested. This will reduce the strain on CPU resources.

Now to R2Gs, this is where the majority of my concern is. Marco is correct in that many models have a great deal of back-test optimization. But lets put the blame where it belongs. This is a direct function of how R2Gs were presented for the last two years. The answer lies not in providing a new form of back-simulation, but complete elimination of back-test. Providing a new face to back-test will only put a new face on an old problem. We will have new salesmen knocking at the door, but they are still selling snake oil.

Showing off a back-test cannot provide a level playing field, it never will, it will always favor one investment style over another. We have endured this problem for 2 years, I, for one, do not wish to suffer through this for another 2 years.

As for “lots of NAs”, well this is the first time I have heard of this as an issue in over 10 years at P123. Why is that? It wasn’t an issue before, so lets not make it an issue now. As for the model only representing 1% of the ranking system, ta da …, well guess what? That is where the designer earns his money.

Now I am going to be blunt. P123 management needs to seriously reconsider what they are saying. If the P123 position is that OOS performance is no different than simulation results then you had better cancel R2G today. Subscribers are paying good money for performance not presentation of a probability graph that shows a possible moderate profit over time.

In addition, P123 needs to understand who their customers are. Many of my subscribers have many ports and subscriptions to models. Going to the effort of creating a “system” at the port level doesn’t make much sense. More effort needs to be put into the book level, and perhaps that is where the noise and probability of outcomes needs to be performed.

Steve

Steve, [quote]
Now I am going to be blunt.
[/quote] You can always point to this later not to be inconsistent in your argumentation. [quote]
P123 management needs to seriously reconsider what they are saying. If the P123 position is that OOS performance is no different than simulation results
[/quote] Lesson learned? “OOS results is no different than simulation performance” any one see the difference? Better to call it simulation non-performance! You can stop, others will continue to call things their names.

For everyone to understand, the problem not how to call it, the problem then you compare.

[quote]
…then you had better cancel R2G today.
[/quote] Armageddon. I see no need in any action here by P123 staff, this is self executing process initiated by dsigners. [quote]
Subscribers are paying good money for performance not presentation of a probability graph that shows a possible moderate profit over time.
[/quote] Subscribers are paying for future results, not for past performance. “Presentation of a probability graph” is just a mean to pass “performance” to subscribers. Do you expect to pass it by holy spirit or telepathy? Can I be blunt too? I know, “let OOS do the job” and here we come back to the beginning. You do believe in OOS performance, I am, as investor, not too much. Instead I do believe in SOME consistency of OOS vs. sim, so I want this to be presented. Not to estimate how much I will get in return but to get a clue for how much I am f$%^^ by designer [quote]
Marco is correct in that many models have a great deal of back-test optimization.
[/quote] Now tell me, your prospect investor, you want to hide this from me!? [quote]
29 R2Gs have at least 365 Days Launch. Still OK having in mind Liquidity. Out of them:

  • 7 R2Gs are free, have on average 17,06% Annualized Combined vs. 18,38% Annualized Launch outperforming by 7.7% with total 583 subscribers and
  • rest 22 R2Gs with average cost $42/month, have on average 31.16% Annualized Combined vs. 16,82% Annualized Launch which is still OK in absolute, but less than free R2Gs and underperforming by 46.0% with total 270 subscribers.
    [/quote] no matter how much PIT mess is in the data and how the calculation algos have changed! This is that I call robust.

All. Here my, as investor, understanding and “blunt” thoughts.

I think Marco proposal was about publicly presented performance, not the sims so see no reason to worry about. There are not so much public offering so not too much outcomes calculation, not an instant task so no problems with cpu consumption.

If someone want to fool himself I have no objection to this, let sims and any other private stuff to be as is.

Optimization tool → Robustness tool. Not to fall into another round of optimization and save cpu consumption robustness tests should be ordered and performed overnight then cpu is low on consumption. This will give designers a clue how their models can look offered publicly and presented by outcomes probabilities presentation. Designers should have no control over public presentation process, this is direct conflict of interests. P123 should take care of investors.

P123 → S123. Not exactly what Marco means but to my understanding it is natural. Current portfolios have nothing to do with its name in broader meaning understandable to investors. This is mostly focused trading systems as I see this. The real portfolio is a current Book and obviously needs upgrade and public offering in line with R2Gs (probably instead of them). So Book123 → P123 and P123 → S123.

Konstantin - if you have an issue with what I am saying then send me an EMAIL. I don’t care to be attacked in this manner anymore.
Steve

Marco & All:

 I'm not an R2G designer or subscriber, and I don't have a particular view on how they should be vetted or performance described.  That being said, my comments above about changing the Sim tool stand:  It's totally fine the way it is; it shouldn't be changed.

Bill