RE: Historical results changing? Please look at this...

Dear All,

Inevitably, as a result of our recent change , several users reported differences in their simulation re-runs. It is a re-occuring theme and almost always it involves re-runs under-performing the older results, for sims with 5 positions or less.

I’ve come to the conclusion that P123 simulations are the problem: P123 is presenting simulation results with incredible detail and double decimal precision, as if they are scientific experiments. They are not.

Sims are based on many assumptions:

-you followed a system to the letter, never a sick day
-you executed the orders with exactly the same prices as the simulations
-you did not curve fit
-financial data is well behaved (ie: low PE = undervalued, etc)

For those reason they are just one possible outcome. I think we should re-do our simulation presentation. Here’s an idea… When you run a simulation, P123 will run three: one using next open price, one with (hi+lo)/2, one with next close. The simulation will be presented as a chart of the three together to illustrate a small set of possible results. Below is an example the P123 GARp model. In this case the terminal value of the sim ranges from 400 - 600, a whopping 50% difference. I think a picture like this is really worth 1000+ words.

PS:. I run this with 5 positions. A sim with 20 positions will show a much tighter envelope of results


Marco - I think this is an excellent idea.
Steve

Hi Marco, I’ve often wondered what a sim would do if the model parameters were slightly perturbed. Like if the ranking weights were varied by small (+/-2%) amounts, for example, and the model run many,many times. The family of equity curves may be interesting.

Here, you’re perturbing the inputs. Without seeing more example, I guess it may be useful.

Are you just altering the opening price and leaving the closing price at the EOD price (or whatever it currently is for sims).

Walter

Running a set of simulations with slightly different parameters is a good idea, but we first need to have the servers / processors upgraded to be faster.

Otherwise, it will take far too long for each run (some single runs are already often a bit of a wait).

Even if the “multi-run” is an option e.g. a checkbox, I suspect everyone will use it by default since it is a more interesting output

Jerome

One more thought: another possible way to get a “range” of results would be to add a Monte Carlo simulation.
i.e. running a simulation now gives a result where the dot-com bubble blows up first, then a recovery till 2007, then the 2008 crisis, then massive recovery in 2009-2010, some volatile years in 2011-2012, then massive ramp-up in 2013, etc…

It won’t be perfect but running a Monte Carlo simulation using the daily or weekly returns would emulate to a degree market events happening in a different order or not happening at all (It is already possible to do that outside of P123 with some additional software or even using excel, so I still much prefer to have European data first…).

It is not perfect as it uses as a starting point the dataset of the daily/weekly returns obtained in the original sim i.e. it assumes that the last 15 years contain the typical system’s responses to the complete set of conditions that markets might create. But it is still better than nothing and would give additional confidence.

Jerome

Nothing bad about this. I suspect most people that have been here for a while have already done this on their own. Many have looked closely at their real trades–although everyone looks at different things. I think I know how my fill prices compared to the opening price within a small confidence interval. So, fun to have in one graph but not a lot of additional information.

I like it but don’t spend a lot of time or computer resources.

I like the graphic with 3 results based on the different prices.

Will the other sim metrics also be multiplied by 3?
Ie, 3 average trade numbers, 3 maximum drawdown numbers? Or is the idea just to present 3 equity charts and a single set of sim metrics.

Personally I already do this with my sims, and using other vars too, its normal sensitivity testing.

The problem to me is not so much P123, but the some users, who have limited understanding of what they are doing and live under the ilusion that a backtest of a 5 stock portfolio is reliable. I would rather have some warning if someone uses less than 10-15 stocks in the sim, so that the user is more aware of this.

If you were to develop something like this, which I really hope would come after all the important updates of Euro data, variable pos sizing, real daily sims, etc… I would prefer to see it in a separate section showing sensitivity analysis, where these and other variables are changed, like slippage, start date and portfolio size, etc. and the various results are visible in charts and data.

Can you obtain historical VWAP prices and add that to the mix please?

Marco,
My comments in the thread you referenced were directed towards backwards compatability. As I’ve spent nearly half of my life and all of my professional carreer developing and debugging software, backwards compatability is near and dear to my heart. The portfolio simulation I was referencing was a 30-stock port, not a 5-stock port. I request only that as the definition of stock factors evolve, that the old logic be retained and listed as depricated - as has been done in the past. That would give those using these depricated factors an opportunity to test the new definition in their ranking systems and portfolios, simulated and live.

With regard to the new equity curve, created by three embedded simulation runs… Three simulations means three completely different set of stock selections and all the statistics that go along with them. I worry this is going to be very confusing to new users. I worry this is going to be very confusing for me. I think it may stymie users from doing their own sensitivity testing. Would you consider rolling this out as a seperate tool? Before replacing the existing simulation interface? Also, please reconsider before removing all of the decimal points as well as the digits that follow them from the simulations. I see the precision availalbe on P123 as a tremendous strength, not a weakness.

iavanti,
These forum’s are for P123 users, many of whom are still learning - I count myself in their numbers.

“The problem to me is not so much P123, but the some users, who have limited understanding of what they are doing and live under the illusion that a backtest of a 5 stock portfolio is reliable.”

As Chris said, the repeatability problem is not limited to 5 stocks. One can’t speak of reliability without addressing the issue of repeatability.

Steve

I totally agree with repeatability being necessary. It makes perfect sense.

I was addressing Marco’s point of P123 being a problem.

:slight_smile:

To do a sensitivity study or monte carlo, why limit yourself to: next open price, (hi+lo)/2, and next close?

Why not vary a near limitless amount of other factors, some of which is brought up in this thread?
-Vary the weights on the ranking nodes
-Vary the day of the week it trades
-Vary any parameter in any rule or rank node (example: AvgDailyTot(60)>200000 vs AvgDailyTot(60)>210000 vs AvgDailyTot(50)>200000)
-Randomly remove or add stocks to the universe
-Randomly move stocks up or down the rank

As you know, if you truly want to do a monte carlo it quickly becomes a monumental task in order to capture the all the variations.

If you vary only some inputs, but hold other inputs fixed, then show the graph of the variation in the output, it gives a false sense of completeness.

In this case, if your model picks low volatility stocks, then the three equity curves for next open price, (hi+lo)/2, and next close will be very close together, yes? If your model picks high volatility stocks, then those three curves will have huge variation, right? You are testing for sensitivity to a very specific input, and holding all other inputs fixed.

Buy the way, we have all seen R2G models with extremely detailed sensivity studies underperform severely in real-time. Some of those those models you can’t see any more because they are removed.

What exactly does an incomplete sensivitiy study prove? A newbie who does not understand the statistics will think that the upper curve is the absolute upper bound, and the lower curve is the absolute lower bound, which is not true at all.

Yes, as Chris suggests, lets not confuse the consistency of a P123 portfolio performance (i.e. does out of sample profitability of live trading meet backtest profitability?) with the P123 software platform “consistency” (i.e. after adding a new feature, or fixing a bug in one section of code, will the REST of the code perform identically as it should, not inadvertently breaking another feature, same inputs, same outputs?). Platform consistency has been achieved through “regression testing,” re-running a collection of test vectors that exercise every corner of the software platform after every modification. For P123, the test vectors could be a diverse collection of sims and ports in conjunction with a static frozen in time, test database that exercise all factors and functions. Depricated factors must be included for this to work.

Of course, wrt the profitability of a portfolio, there is no time invariant “regression test,” as the stock fundamentals are fluctuating with time, the economic climate changes, database corrections are made, etc.

It’s usually uncool to post double, but the following, which I just posted to the Industry Factors there, is equally relevant here given instance on repeatability. So here goes:

Folks,

Repeatability, or replication as the idea is expressed in other research disciplines, is for us a good thing, but it is not an end in itself. To quote the great Kevin O’Leary a/k/a/ Mr. Wonderful from ABC’s “Shark Tank,” our goal here is to help you “MAKE MONEY!” And frankly, you can’t earn a nickel from a simulation, even a great simulation with perfectly repeatable results. You make money by applying your ideas in the real-world stock market using real money stakes. That means live performance, out-of-sample.

Given that, consider what you are asking for when a demand repeatability is so vigorous as to criticize a decision we make to replace a questionable algorithm with a better one. Do you REALLY REALLY REALLY want us to refrain from replacing bad number with good numbers simply in the name of repeatability? Really?!

You cannot succeed here by blindly applying scientific method in the abstract. Scientific method works only if it is combined with and properly applied to the domain in question with due consideration of the latter’s unique characteristics. This domain, investment strategy, depends very much on modeling based on factors that can reasonably be assumed to be persistent into the future. (That’s the genesis of the SEC’s “past performance is no guarantee” mantra.) Ex that, you have no defensible reason to expect to make money from a strategy regardless of what the sim shows.

I this context, go back again and review Marco’s explanation for the change in our approach to Ind. The old approach, the one now being described as bad, was actually developed by me in an effort (and a reasonably successful effort) to come as close as possible to reverse engineering the long-lost algorithm created by a one-time South Carolina based contractor engaged by a company called Market Guide back in the 1990s. (Market Guide was bought by Multex, which was bought by Reuters which merged with Thomson – now you get sense of how lost that spec is). But when the point-in-time charts were being developed, Marco noted some massive and erratic jumps in the Ind numbers over time due, as he explained, to the CapAvg approach. That means you might have a PEGInd figure of, say, 0.85 if you rebalance today, but five days later that same figure might be 2.14, and three weeks late 1.44 and then a month or so down the road, 0.50, and the back up to 3.00. If you have a model that worked with that old version of PEG but not with a repaired version, that’s a sign you need to rework your model, because if you continue to use it, you’ll either get immediate bad out-of-sample performance or a bit of luck for a while and ongoing vulnerability to luck running out.

So would you really prefer to use such a datapoint in a model designed to help you make money in the real world? Really?

Come on. You know PEInd figures should not run that way over time. I have long made heavy use of Ind factors and I invested a ton of man hours developing the now-discarded reverse-engineering algorithm, and I did my share of whining when Marco showed me the then-in-development point-in-time charts and vehemently stated that the Ind algorithm was $&*!@ and that we’d have to fix it. But I agree. No matter how many man hours by me or anybody have been invested, we cannot consider leaving it as is, once we discovered the problem, unless we change our mission and start telling subscribers that we only care about good sims and could care less whether you can make money with them in the real world.

We’re not going to change our mission. Making real-world money is still what counts. And toward that end, our priority remains to give you the best information and tools we can that will assist you to accomplish that goal. And that means we cannot and will not allow a problem we discover to persist.

We believe most subscribers are happy about that approach, especially since it’s not standard in the business world at large and in our area in particular. I still recall back when I was at Multex when I assisted Steve Liesman, the CNBC senior economic reporter, on a major project. Back then, I was working on FactSet (Multex had no platform like p123.) And after a huge number of man-hours (Steve was and for all I know be a major-league a** hole) I thought I was done . . . until Steve called back screaming and cursing about all the f***** up data I gave him. On investigation, it turns out that the great FactSet ($60,000/year for the platform and then you pay separately for a data license) had not been correctly lining up fiscal years for the companies in the Multex Database (and who knows how many others); in other words, if PG’s fourth fiscal quarter ended 6/30 and CL’s ended 12/31, FactSet would simply show PG on a calendar tear basis and compare it’s 6/30 # with CL’s 12/31 # as if they both applied to the 12/31 quarter. How’s that for a royal screw-up! But here’s the point of the story. Obviously, besides having a bunch of our own data guys doing a ton of manual crunching so Liesman could meet his deadline, we reported this as a major bug. After a couple of years of checking and nagging, I finally gave up. I have no idea when they finally got around to fixing it; in fact I’m not longer on FactSet so I have no idea if they ever did address it.

One more note about precision re: the past. Don’t assume from what you may know from other disciplines that it’s the case here. Financial statements are a model of a company’s past performance and a database is a model of a collection of financial statements. We model the past just as we model the future. And any model means difference of opinion. (The, even among the most reputable of accountants, and they’ve got some hot topics on the table now at FASB that could badly upset what all investors do with financials, so badly I offered an accounting professor I know who attends their confabs that I’d be willing to appear their as a “consumer of financial statements” to make a case against the proposed change.)

When it comes to recreating the past, we go above and beyond what most others do in the major must-have matters relating to survivorship and look-ahead bias. That is important. Preservation of algorithms discovered to be flawed is not in that category.

One thing, by the way, that is replicable, is the underlying theory of investing and approaches you can take to applying it. That’s why I so vigorously argue learning this stuff and I’m here to help. I’ve seen and made use of academic ideas based on 1960s-90s data samples that STILL WORK! And even where I’ve encountered some that don’t, I can identify clearly visible changes in the nature of the markets that explain why they don’t. That’s what should be revered.

As far as we know now, all of our data items and algorithms are good. That said, we remain open minded and aware that there is always a possibility we may learn something somewhere that shows us otherwise. So please, let’s not argue or advocate for preservation of anything discovered flawed, not now, not ever.

If circumstances develop such that we would feel commercially compelled to add such capabilities, I would hope that members whose goal is to MAKE MONEY in the market would be savvy enough to refrain from using it. I cannot imagine anything worse, for the purposes of MAKING MONEY than wanting to take extra trouble to test ideas against algorithms subsequently determined to be inadequate.

Folks, I urge the vigorous-repeatability advocates step back and really think about what this is all about. P123 is about developing investing (equity investing) strategies. It’s not about mathematics. It’s not about physics. It’s not about engineering. It’s not about medicine. It’s not about operations management. Etc. Attempting to import practices from one field to another without considering and adapting to the unique characteristics of the new filed would be about as effective as a major league baseball player showing up on a tennis court and swinging the racket exactly the way he swings a baseball bat. (I know. Even though I’m not a major leaguer, I did swing a tennis racket like that my first time on a court and the outcome really sucked.)

Depricated factors are depricated for a reason and the idea of wanting to revive them for any sort of strategy testing, yikes!

I think we all need to keep the big picture in sight.

First, many of the ideas presented in this thread do have some merit, but not as a general presentation of the results chart or an average of the stats in the tables. The speed of the analysis is of primary concern when running Sims. During Sim development we can’t afford the delay of 3 times the current time it takes for a Sim to run before we can make a change for the next iteration. The effect of a 3X delay for every run of the automatic optimization of 20 permutations would make that tool run at a crawl. A Monte Carlo study tool would be a good future feature, but with the number of iterations necessary for statistical use it has no place in every Sim run.

Second, while developing a new approach, I run one Sim at a time, adding or changing one rule at a time that I feel should improve a Sim. If it doesn’t, I try to find out why by carefully comparing the charts, stats, and transactions. That detailed analysis of a Sim leads me to the next change that I feel should improve it (I actually never use the Sim optimization until I think I have a final system, and then only to examine the variability of the system). If every Sim was the result of 3 (or more) runs, it would be VERY difficult to determine the cause of why a change I thought should improve a system failed.

Third, too many members feel that there is something exact about a Sim’s results. It is nothing but an approximation of what could have been achieved IF EVERY ASSUMPTION IN THE SIM WERE ACTUALLY CARRIED OUT IN A LIVE PORT EXACTLY AS THE SIM DID IT. That is impossible so stop worrying about the changes P123 makes to the functions, factors, or data corrections affecting the results of a rerun of an old Sim. The new run of an old Sim is the new normal, so accept it and move on from there.

I do agree that depreciated factors should be available for comparison checks only to give us a warm fussy about what changed, but if you really want to keep the old factor (and I have no idea why you would) then it would be best to convert it to an Aggregate() function. One thing that would be very helpful would be a continuously updated list of changes by date of change, a description of the change, and the rational for the change. Currently that information is buried in a thousand posts.

If we really want to present a less accurate picture to the new or novice member, then all we need to do is add, (+ - 10%) to every Stat. Or should that be, (+ - 30%)? :smiley:

Marc/Marco - you guys absolutely bewilder me. If changing industry factors should not impact a well designed model then why are these factors being offered at all? And if you can’t make money from simulations then why are they being offered? If that is how P123 feels then the best thing you can do is stop providing them. And at the same time stop providing the in-sample data for R2G models. Also. don’t use in-sample data to “frame” post-launch performance. The arguments being put forward are completely inconsistent with P123’s practice. Do as you say. Walk the walk.

Now, back to the real world.

“Platform consistency has been achieved through “regression testing,” re-running a collection of test vectors that exercise every corner of the software platform after every modification. For P123, the test vectors could be a diverse collection of sims and ports in conjunction with a static frozen in time, test database that exercise all factors and functions. Depricated factors must be included for this to work.”

What Chris is alluding to is a fundamental part of any successful software company. Last time I checked, P123 produced S/W. When a change is made you must have some level of comfort that you have not affected anything. This is what NOT LOSING MONEY is all about.

Repeatability and bug fixing go hand in hand. From what I can see, variability in results is being used as an excuse many times when a problem is observed, without any attempt to get at the root cause of the issue. Sorry, but that is what I feel.

“Depricated factors are depricated for a reason and the idea of wanting to revive them for any sort of strategy testing, yikes!”

You are missing the point and that is probably because you don’t have an engineering, or mathematics, or scientific background. Deprecated factors are needed not for strategy testing, as you seem to think. They are necessary to get to the root cause of an issue. All evidence needs to be gathered and scrutinized, not swept under a carpet. Issues need to be investigated, solved, and put to rest. If not, then they will come back to haunt you, sometimes in a big way. This is what NOT LOSING MONEY is all about.

Now back to the issue of repeatability. Yes you can make money at some gross level without having repeatable results. But repeatability is the foundation of a stable platform. The issue isn’t about how much profit a sim claims to make. If you can’t provide a tool that gives repeatability results then the tool is not particularly useful.

Steve

And if this were an engineering, mathematical or scientific platform, that would be a huge problem. But it’s not. It’s an investment-strategy platform, and if we were to not understand how to approach that, then you’d have cause to worry.
[/quote]

The “platform” is impressively stable and I’m really baffled by why that is being called into question. If Compustat shows TTM EPS of $1.50 and if our pricing data provider shows a close(0) price of 18, then the TTM PE will compute on the platform of 12.00 and that will repeat no matter how many times you re-run the model.

This whole topic has ABSOLUTELY POSITIVELY NOTHING TO DO WITH REPEATABILITY as it relates to platform stability.

Going back too the example, if you run the TTM PE 35 times and get 12.00 but then, starting on the 36th time you get 11.4 and keep getting 11.4, this has nothing to do with platform stability. The platform correctly does what it always does. What changes is one or both of the inputs (let’s say a different EPS number). You don’t have to go on a 007 espionage mission to uncover the change/ We tell you when we’re making big changes (as with Ind factors and the switch from Reuters to Compustat – heck, i even posted a 20-plus white paper explaining it) and Marco has often described the reasons why little changes may occur as datas gets cleaned up.

So please, stop talking about platform stability. That’s not in issue. This platform is stable.

What you are really demanding is factor stability. And for the reasons I discussed above, in investing, where everything is a matter of opinion including the past (if you think debates here can get hot, put a group of accountants or database designers into a room and settle in with your popcorn for a helluva show), complete factor stability is not something you can properly insist on. Nobody anywhere in this area gets perfect factor stability. It’s why results get restated. It’s why economic data gets revised. And it’s why better data algorithms replace lesser algorithms when shortcomings of the former are discovered.

Now obviously, nobody would want to see all the numbers jumping all over the place all the. It’s important to recognize what factors must stay stable and which ones would not. Fortunately, we have backgrounds in investing, not in engineering, mathematics, or science. So we have the skill sets to know how to make good judgements in this area.

Yes, that right! And bear in mind this comes from Denny who, I believe, has an engineering background.

I guess at the end of the day, it all comes down to proper recognition of the issue and finding an appropriate answer; that’s about as inter-disciplinary as it gets.