The P123/Python Array

I think what I am suggesting is much easier than anyone realizes.

Doesn’t P123 create an array when in does a sim? Matrix or DataFrame may be other terms.

The rows are the stocks and the dates (probably a hierarchical index).

The columns are the ranks of the features and functions. There is also a “label.” The label is the future returns for the ticker and date (the index). The label is just another column in the array.

[color=firebrick]You will still use ranks in the data. You will still use ranks in the array.[/color]

I just want to move that m x n array over to a server with Python on it. I want to move an array that I think must already exist.

I just want to manipulate that array with a Python library. It is quite amazing what can be done to that array using the already existing libraries. These libraries are as easy to use as the programming at P123 for a screener—once you get used to it.

I think the linear manipulation of that array–that we do now–is limiting. Limiting and it causes a whole host of problems. There are other ways to do it. But when the present linear manipulation of the ranks works—and it often will–stick with it.

I guess my question is does that array already exist? Is it that hard to move it to a server with Python on it?

ARE RANK DOWNLOADS EVEN LIMITED BY THE DATA PROVIDERS?

My ignorance of programming is showing, no doubt.

[color=firebrick]BUT I DO KNOW THIS: BAGGING THAT ARRAY IS TRIVIAL WITH THE PYTHON LIBRARIES.[/color]

[color=firebrick]I ALSO KNOW THIS: AT THE END OF THE DAY YUVAL IS RIGHT. BEING ABLE TO BAG THAT ARRAY IS A GOOD THING.[/color]

[color=firebrick]AND THIS: BAGGING IS JUST THE START OF WHAT CAN BE DONE.[/color]

[color=crimson]AND THIS: THE ARRAY–AND PERHAPS ALTERNATIVE DATA–IS ALL THAT SEPARATES US FROM AQR CAPITAL MANAGEMENT.
[/color]
I do not want to turn off any abilities P123 has now. I think I will still use what we have now, too. Thanks.

-Jim

Jim - I’m not sure that P123 is interested in this as it is quite a deviation from their product offering. You can do what you want to do yourself using a third-party vendor such as xIgnite https://www.xignite.com/ which allows you to access Factset data via an API. Ranking is a simple algorithm that you can probably find within a Python library or can be easily written, the advantage being that you can control the properties of the ranking function (linear vs. nonlinear, etc.).

Since you are not a programmer, you would need to hire one to get you started. I could do it but I don’t come cheap :-).

There are options other than xIgnite, such as Quandl, which provides a driver for Python, R and other s/w. With Quandl, you won’t get FactSet data but there are a ton of options for low-cost data and many different kinds of data available.

Steve

Will look at that.

Was planning on looking at going to FactSet directly or ClariFI if the market gets back to normal—or what is was. FactSet advertises what I recommend for P123. I get that there is a $24,000 per year license fee (or about that I read).

May have to become an LLC too as FactSet only takes “institutions.” Yea, I self-identify as an institution, that is the ticket. Will be nice to have a “Prop Shop” though. As the boss I will not have to ask anyone’s permission before I do a little bagging and even stacking (another ensemble technique). Yea, The Grocer, LLC. We will see.

You probably have a better way and I will check it out at some point. Thanks.

-Jim

In that case, you will certainly need to brush off your programming skills. With a small business, you would be better off with an intermediary that provides you with a clean easy-to-use API.

Thanks Steve,

“Not cheap?” Would pay a little extra to have you around :wink:

-Jim

It is worth re-coding a screen or a ranking system on another platform if you need to do serious ML or statistic manipulations on it. Quantopian looks a good option for that: they have statistics modules, sklearn, a module for portfolio performance analysis (Pyfolio) and a module for factor analysis (Alphalens, similar to rank performance analysis in P123). It is possible to create any Dataframe you want with ranks calculated from Factset and Morningstar fundamental data. I’m trying to find the best way to do it on a few simple examples (not for bagging, I just want to play with classification algos). However, P123 is unique for the ergonomic interface, fast modelling, fast backtests and simulation, execution. Compared to that, the interface and backtest performance on Quantopian are horrible.

Federic,

Agree completely, I think.

  1. There is more than bagging.

  2. None of this is rocket science. It is, in fact, simple for someone like you who has training in the field.

We are literally 6 weeks into a freshman 101 course on this stuff on the forum. We are debating stuff that has been around for 40 years—using even older technology (spreadsheets) to kind of simulate the 40-year-old technologies.

Is it just me? Does it seem like S&P 500 does not want us using this technology? They tried to shut down even a small trickle of data that would allow us to use some modern methods. P123 is making every effort to keep the data stream without interruption. Kudos to P123 for what it has been able to do. I truly hope it is enough for a while.

-Jim

Hi Jim,

Just out of curiosity, do you think this increased level of sophistication will be the difference between negative excess returns and positive excess returns? Or do you have something that is already working well, that is producing positive excess returns, and that you are looking to improve?

Hi Philip,

I hope you understand that I am limited in my data. Perhaps, you got that already from my posts but I want to emphasize it so you can take this into consideration.

So, let me tell you what I know. I started here in 2013 and have true out-of-sample data on some stuff since then. Edit. I mean P123 ports using no other methods by this.

I have been able to cobble together some data from elsewhere—sorry my NDA does not allow me to expand on this.

My VALIDATION set from 2013 forward (using the cobbled together data) beats every P123 method I have out-of-sample data on soundly. I do not have enough data to do a holdout test set on the cobbled together data.

If someone were to say a validation set and an out-of-sample set (or holdout test set) are not the same thing then I would have to agree. In fact, let me say it first: “A validation set is not the same as an out-of-sample set.” I have even included the quotes if anyone wants to copy and paste it elsewhere in the forum. It is the best I can do for now.

But let me also repeat that it beats every P123 system I have out-of-sample data on soundly. And it should, theoretically.

It beats the P123 system in ways one expect given the limitations of the P123 method. I have said before and I will say it here: There are reasons that we like the 5 stock models that have to do with the limitations in the method. Good method but every method has its limitations.

The most simple and complete answer I can give for now is I am not sure and that is one reason I am not buying a $24,000 license today. I do know we can do better and even Yuval has come to this view.

Thanks for the question.

Edit. I should add that validation sets are less prone to overfitting that a backtest say but there is some overfitting, especially if there are a lot of hyper-parameters. So better than a backtest but should not be taken as proof of anything. That would be a legitimate criticism (if I were not asserting it myself).

-Jim

OK thanks. Just to make sure I’m following correctly:

  1. Since 2013, you have invested real money in the stock market following a strategy that uses data from alternative sources.
  2. That strategy has outperformed a relevant benchmark since 2013.
  3. You think you can improve that strategy and make more money by putting the P123 data together with your alternative data?

Sorry about my unclear explanation. But a clear no on this.

The real money part would be the out-of-sample part which I clearly do no have now. I have started the real money part recently. Albeit, with a very small amount in a FolioInvesting account. Something that has real slippage: worse slippage than what I would get using Trade at P123, I think.

Yes, I think. That is my belief. But I clearly do not have proof of this.

If it did work out I would be the first (not the only) person to make my models available to the P123 community. Albeit, S&P 500 models that would not conflict with my personal models (from me).

-Jim

OK thanks, very interesting. Well if its not a high turnover strategy you could manually pull the ranks of each stock in your universe for each rebalance date using the screener on each date and compile it all into a python compatible file?

Right.

An m x n array

M = number of rebalance dates x number of stocks in the universe for each rebalance date. So monthly for S&P 500 year 2000 to now, 500 x 12 (months) x almost 20 years of data soon = 120,000

N = number of factors or functions + 3. The there is the additional 3 for the ticker, the date and the “Label.” The label is for the future returns for a month (monthly rebalance). Some people like 20 factors? I will count a Marc’s system if anyone wants but lets assume 20 + 3.

m x n array is 120,000 x 23. 2,760,000 individual data points!!! I will leave it up to you to decide how long that will take you.

This is certainly more than I like to do and some universes are larger and I like to rebalance weekly.

Now bag that correctly with a spreadsheet.

Again, find your own number.

Mine is never, I will not try to do that. But plug into Python, have a cup of coffee and it will be waiting for you on a screen.

-Jim

You can pull the entire universe, all at once, on each re-balancing date. So its actually 12 x 20 = 240.

You can also pull up to 10 or so variables in one pull, so if you wanted to do 23, it would be 240 x 3 pulls = 720.

I could do that in a day NO PROBLEM.

Not bad.

Let me ask you this (at true question, just trying to learn). The label is required. That is just the returns for the next month in this example. I do not do a lot of screens so I will assume the ticker and date columns are downloaded. You could just add the date easily enough (each download will have the same date).

How do you get the label from the screener? Does P123 want to help if you do not have an easy answer for that?

Good thing the S&P 500 is not reading your workaround. Hmmm…. Maybe they already figured it out and do not want us to have even this ability.

Maybe I am being paranoid and P123 and FactSet will facilitate/allow this.

But I like the way you think (whatever FactSet might allow). Do you have some coding experience? A rhetorical question: you must. Anyway, I am impressed.

-Jim

Its not a workaround, the screener lets you pull 500 companies at once.

“Label” is easy, just set your end date to 1 month and the Pct column reflects the forward one month returns… maybe spend less time on the forum and more time using the P123 features?

So, I wonder if P123 could help expand on this a little. My understanding is that ranks may not be part of the data download limits but I do not know.

Each column requires a separate screen.

Not really that easy for 20 factors (20 columns in the array) and a weekly rebalance. I get your point that it may be doable.

-Jim

showvar(@WHATEVER,frank(“node_whatever”,#ALL,#DESC))

Give me a challenge please

Callenge?

Please understand that I have gotten data where I can and done what I can and shared much of what I do with you.

Why do you think you have done better than I have. Assuming you have done anything with this at all.

Arrogance?

-Jim

Hahahaha.

You’re being sarcastic right?