One (small) example of machine learning marketing

P123,

P123 is the premier platform for machine learning.

P123 is integrating machine learning into its platform—as I understand it. I have little specific infromation on this but I anxiously await the roll-out.

In addition, Dan Parquette is continuously improving the API that some members use for machine learning. Or a whole host of algorithms that may be entirely separate from machine learning (I guess). Obviously, bootstrapping is entirely separate from machine learning. Obviously.

Yep, P123 is the premier platform for machine learning for retail investors (and some professionals).

James (ustonapc) made me aware of this site: numerai

I probably will not participate in this but it may be interesting for several reasons:

  1. Probably easier that Quantopian because the data is closer to that array I keep requesting: Array with columns of features and target (excess returns). Numari sees the wisdom in this. Heck, upload it into SPSS (or JASP), run a multiple regression and see if you win. Total time required: 10 minutes. Better yet, just do it in Excel. Probably the ultimate goal for P123 as Dan seems to be progressively moving toward this—recently providing Google spreadsheets with easy loading into Colab.

[color=firebrick]Yep. From the image: “# Training data contains features and targets”[/color] This is definitely just a csv file with factors (features) and targets. What a great idea! And thank you Dan for moving in that direction.

  1. Probably similar to how a Kaggle competition handles data. Kaggle may be a large target audience for P123 marketing.

  2. Useful for P123 to gauge interest and find its market for machine learning.

  3. Supports what Steve Auger (InspectorSector) has said and what P123 MAY be doing with machine learning. Namely, Boosting is often an optimal or near optimal machine learning tool. Or, at least, MAY belong in marketing literature. Image loading XGBoost at numari below.

More importantly, Boosting is ideally suited for using ranks as factors (features). Something that Marco understands, I believe. Few other methods can claim this. Arguably neural nets can be constructed that can use ranking data effectively.

[color=firebrick]This is a pretty good example of how P123 should ultimately want to handle the data, I think.[/color] Down to keeping a hold-out set with no target to be used at the very end to test the model. Apparently done by numari (and not the user) here. But important no matter how you look at it.

Not too many lines of code either.

Jim


If you want to see the out of sample performance of the numerai models that are staked (which combined together make up the numaria hedge fund). Check out this link below. The average 3M performance is 33.52% as of last Fri close. The numbers are updated daily.

https://numer.ai/tournament

This is not real performance but rather performance on stacked NMR, which is based on CORR and MMC (Correlation between prediction and label/target neutralized by factors provided by numerai and by numerai+other models). The return is probably highly correlated with the real performance but with a lot of leverage.
The real performance of the meta model (model that combines all other models that are stacked by NMR) is unknown. They don’t want to show it.

True, the performance of the combined model is unknown but should be similar to the average return of all the NMR staked models (stake weighted average 3 month return on staked NMR) . The 1 day/3 months/1 year return for individual models are based on real NMR staked including all payouts and burns. Nothing about leverage is mentioned on the website but it should be noted that leverage is common among hedge funds.

Azouz,

That would be interesting to know wouldn’t it?

P123 may find little here to copy. I am a big fan of the way numari presents the data (array with features and target). Presumably, it works for the machine learner’s using their site. But DataMiner works too and Dan is continuing to improve it.

In addition to machine learning, numari is using a type of crowdsourcing. AND giving the people who submit machine learning programs a financial incentive. There is a LOT of literature on this. Related, perhaps, but not just machine learning.

Azouz, maybe this can also be considered to be a type of ensemble which you have discussed previously. I think you may have (correctly) mentioned that ensembles are frequently used in machine learning.

Quantopian was also just a competition for hedge funds to get ideas. Although Quantopian was not as transparent on how they used the algorithms people submitted (or who was paying for it).

Ultimately, P123 is much better than either of them. More transparency and users have access to the most recent data. We get to use the models we develop for our own investing at P123. Something I think P123 should be able to market successfully.

Marketing aside, it would be nice to know how this is working for numari—which as you correctly point out—they are not telling us.

Best,

Jim

Thanks for the link!

After signup I downloaded the dataset (there seems to be only one place to download the data). Few quick comments:

  1. The example_predictions.csv and numerai_tournament_data.csv have 1.7M rows. Obviously it’s not just stocks. But what could it be? 1.7M distinct individual financial instruments predictions is a lot. And it can’t be intraday data since each round last 1 month.

  2. The training data seems very small compared to the tournament data. 1.7M rows for tournament data vs 500K rows for training. Shouldn’t it be the other way around?

  3. There are 311 features. That’s a lot of features.

Whatever it is , it’s well obfuscated and a large dataset.

We are designing our AI system with these parameters for a typical job:

4000 stocks
200 features
15 years of weekly values for training

So that’s 3M rows for training (4000 * 15 * 52) and 4000 rows for predictions.

Something is obviously strange compared with Numerai. I will ask our data scientist to make sense of this.

Thanks

Yes sure but they are not planning to disclose this. Their excuse is that their investors do not want to. Regarding the leverage, what I meant is that their 3M performance displayed will always be inflated in both directions.

Numerai idea is very interesting but nobody knows if it really works or not. Their NMR token is also a problem. You will be paid in NMR depending on how your model is correlated to their target (after being neutralized to all signals they have). NMR like any other cryptocurrency is very volatile so your payout will vary widely. As an example, it has lost I think around 50% of its value in the past few weeks.

Jim,

Yes, it is an ensemble of all user models. The advantage is that users doesn’t have to show their model, just send their signals (score between 1 and 0 for each stock). The ensemble is weighted by how much NMR you stack. The idea is the more you stack the more you are confident in your model and thus the more weight your model will have in the ensemble.

Most of their users models are based on the obfuscated data provided by Numerai. They keep saying that their data is top quality but no one can verify this (but I assume it is for their best interest to provide the best quality data). It is probably fundamental data (like P123) and technical as well. Even the dates and the feature names are obfuscated. Users that are generating signals based on their own data, are doing so using multiple sources like News, Reedit, technical… But most of them, I assume (given what I have read in their forums/chat over several months) are using very basic technical signals like moving average and stuff like that.

1 Like

Marco,

P123 really is the premier site for machine learning so I have not bothered to download data from numari. But maybe I will this weekend to try to help sort this out.

But I have used Boosting before and may math is different.

I have so many rows that they do not fit into one Excel spread sheet. As you probably know Excel has a maximum number of rows: 1,048,576.

So my math: 1,600 socks in a universe repeated (with some changes in the tickers) every week. So hierarchical index of stock and week in the row.

Column headers date, ticker, Target (excess return), factor1, factor2, factor3,……,factorn.

Focusing on the number of rows: tickers in universe (e.g., 1,600) * 52 (weeks in year) * number of years(e.g., 20).

or 1,600 * 52 * 20 = 1,664,000

Yep, time for a second Excel spreadsheet when I do it.

IMHO, this is probably how you will want to do it: ready to load into XGBoost with no manipulation necessary but also ready to use the hierarchical index you have just created (date, ticker) if you do want to do any data wrangling (munging).

Best,

Jim

It is just stocks. US stocks + International stocks.

Correct. Repeated each week (or rebalance period). In case people are not aware Azouz has an incredible amount of formal education and practical professional experience.

But in the end it is ALWAYS, ALWAYS, ALWAYS just one VERY SIMPLE array for Python (Pandas) machine learning programs. Index (Date & ticker), target (future excess returns) and factors (ranks) as column heads.

BTW, seems like a lot of rows and I guess it is.

But XGBoost and TensorFlow run in a reasonable amount of time on even a MacBook Pro. That is not to say that one cannot find programs that have to be shut down after running an entire weekend with no results (e.g., Kernel Regression or Loess, support vector machines).

FWIW for P123 and members.

I think it’s great that P123 is trying to integrate AI into its platform. Regarding the required training data, I have a question: As far as I know, the DataMiner allows you to download the data you need for training (date, ticker, target, factor1, factor2, …). But if I understand it correctly, you need a separate data license with FactSet to get this data. Or is there another way to obtain this training data?

Hi everyone,

OK i’m quite new here and i have a lot to learn but i would like to participate to any project of this kind (IA, stat, quantitative finance…).
If you have some links / tutorial / … it should be great.
Best and have fun in programming

Michael,

Your question is about AI. I believe one of the reasons Marco is doing this is that rank works well with many AI methods. For boosting (and several other methods) it makes no difference whatsoever whether the data is raw data or a rank as long as the data remains in the same order when it is converted to a rank.

This is true for boosting with all data: only the order matters. This is because of the way programs like XGBoost separate the tree branches and it is absolutely true in all situations.

For P123 data, I would argue that ranks are generally better for P123 members because the data gets normalized each rebalance period. P123 already uses this advantage in its ports and rank performance testing. One of the benefits of P123 that not everyone appreciates.

Also, for multiple regression for example, one could argue that a Z-score (the way P123 does it) is better than the raw data (again because it gets normalized or standardized each rebalance period).

Ranks and Z-score do not require a data license, I am not sure about Z-scores but you can get ranks through DataMiner now. And Z-scores a probably coming if you cannot get them already.

I think you can also get raw data but you need a license for that. I would make sure that you really need the raw data before you pay for it. For AI—including boosting, random forest and neural nets–you do not need the raw data. In fact, the way P123 normalizes the data each rebalance period makes using ranks an advantage—even with ports in their present form. And certainly with AI.

Jim

Jean,

Coursera has many free courses at multiple levels. You can even get a degree there but there are plenty of introductory courses too.

Coursera.org

To get the free courses you have to find and click on audit. They do not make it big and it is not available for every course.

IMHO, one could just specialize in XGBoost and do well. Here is the link for the documents of XGBoost

Here is a pretty good step by step introduction to XGBoost on the web: A Gentle Introduction to XGBoost for Applied Machine Learning

If you already know some Python or want a decent book to learn Python while learning some machine learning applications, buy this book: Introduction to Machine Learning with Python: A Guide for Data Scientists

Jim

Jim,
thanks for your feedback. You made a very good point that it is probably better to use ranks for each factor with the data already normalized rather than the raw data. That should make data pre-processing much easier. Since you already have some experience in this area, I would be interested in your opinion on where you currently see the biggest challenges in finding a good model using machine learning techniques like XGBoost, LSTM, etc.

Michael,

By LSTM you mean long short-term memory for time-series data. Wow! You are knowledgable! To start with, I would recommend playing to P123s strength of fundamental data and use cross-sectional methods. I have done some things with LSTM but could not get them to work for me. That does not mean that I will not try again or that you could not get it to work. But just a regular neural-net with P123 fundamental ranking data will work if you use good factors, I believe.

XGBoost is supposed to be able to pick and weigh factors on its own to some extent (picking the most important factors for each iteration of the boosting). In some of the texts (not always about stock data) they just throw every conceivable factor into the program and see how it weighs the factors using the “feature importance.” And leave-in every factor for prediction. While I think this may work for a lot of data, stock data is very noisy without much signal. No signal for some of the factors we like to use, I believe.

For that reason, I think that a person has to do something to pick the factors they want to use. You can’t just throw every factor into the program and expect it work. It turns out to not be a cookbook method like some of the textbooks suggest.

I do not have a strong opinion on how one should do this. I think P123’s rank performance tool is great. You could use the feature importance in XGBoost to exclude factors. Or you could use Dan Parquette’s Google sheets. You may have some statistical methods you like. You may not want to use just one method of picking the factors.

Of course, ultimately you may want to be guided by cross-validation for choosing the final factors.

Hope this helps some.

Best,

Jim

Thanks a lot Jim, I really appreciate your advice. Since I’m not very familiar with tools like DataMiner, you gave me some very good hints (especially using ranking data instead of raw data).

I fully agree with you that it’s probably a very good idea to start with a well thought out set of “good” factors that are believed to correlate well with future stock performance. I guess that should make predictions much better and reliable.

Can you tell me what you mean by Dan Parquette’s Google sheets?

Michael,

I have not used this biu I think this relies heavily on the rank performance tester. Dan is well organized and likely has screened a large number of factors. So again I have not used this but definitely worth a look (and let use know what you think): https://www.portfolio123.com/mvnforum/viewthread_thread,12915

Michael,

I do want to mention the most important thing. I forgot because I have mentioned this multiple times before and I (wrongly) assime everyone may have read it. ONLY USE EXCESS RETURNS AS YOUR TARGET.

Otherwise, I believe the exercise will be a total waste of your time. There is a lot of noise in the market. Data like fed minutes, likelihood of tax policies making it through congress, interest rates, jobs numbers (you get the idea) have a large effect and are difficult to exploit with any algoithm.

All this noise will drown out any signal. Some of the effect of this noise will be mitigated by taking the excess returns as your benchmark will subtract some of this noise out.

Ideally, the “benchmark” will be the returns of your universe, I think.

I highlight your comments about DataMiner because it is great but clunky. I cannot even use it because Dan has not been able to prioritize making this available on the MAC os. Dan does a lot of really cool things (note Gogle sheets above for an example of this) and I get priorities so this is not a complaint in any way.

But while on the subject of DataMiner, I think it will not be all that easy to get the excess returns that you need. Still I am thankful that it is there and it can be done (even if it is not easy).

The main point of this post is that numari has one simple download ready to be uploaded into anything (SPSS, Excel, Python, other computer languages).

IMHO, there is one ideal format for this. Date, ticker, target (excess returns) and factors can be immediately loaded into Python programs. And if, for example, you wanted to slice a year out using a Pandas DataFrame the year is right there at the beginning. Dates can be sorted etc. Obviously, if someone want to adds another column to a standard P123 format this could be easily used or not used in a Pandas DataFrame.

BTW, Pandas DataFrames were designed by AQR Capital Management for this very purpose as you probably already know. We should find the best integration possible.

But IMHO, P123 DataMiner does not need another method of doing a rolling backtest. Maybe P123’s AI expert can get with Dan and suggest a different format if mine is not the best. But numari may have already invented the wheel on this.

Anyway, just suggestions for P123: the best machine learning site on the planet for retail investors. And let use know about your experience with DataMiner. You are obviously highly informed on this and your suggestion may be helpful for P123.

And use excess returns or do something else with your time. The most important thing.

Best,

Jim

Well, yes all the noise in the stock market will definitely be a problem for any reliable prediction. Being able to filter out at least some portion of this noise is probably going to be critical to the success of such an AI-based model.
I have also thought about using alpha (excess returns) as a target but right now I don’t know how to get it either. If anybody here has any clue on how to do that, I’d love to hear it.