Index | Recent Threads | Who's Online | Search

Posts: 22    Pages: 3    1 2 3 Next
Last Post
New Thread
This topic has been viewed 709 times and has 21 replies
marco
ML integration Update

An update on our ML features and how we plan to integrate ML on P123. There are several threads on the topic in case you want to read more. If I missed some important posts let me know.

Python code for calling 123 API - https://www.portfolio123.com/mvnforum/viewthread_thread,12580
Nice ML integration on FactSet - https://www.portfolio123.com/mvnforum/viewthread_thread,12693
Machine Learning for Factor Investing - https://www.portfolio123.com/mvnforum/viewthread_thread,12733

We are currently deciding the best way to have something user friendly enough while maintaining flexibility for power users, that won't require huge effort from us, and won't cost the end user too much to use.

Cost

DataRobot was helpful to show our data scientist an example with real data and to get a sense of the costs . He uploaded the sp500 stocks with about 60 features for 10y and he burned $500 credit pretty fast. So back of the envelope estimate to train a model for russell 3k you would spend maybe 18x that (6x for 3K stocks and 3x for different feature combos). So that's ~$10K which seems expensive since we only used CPU intensive algorithms as far as I know. No GPUs or TPUs. We're going to try different clouds to compare. DataRobot is likely the most expensive. And, if we only require CPUs for these algos, then doing it in our own servers will be the best way to make it affordable. We would just use a cloud for peak usages.

How it will be cohesive and easy to use ...

There are too many ways to screw up if you are downloading from P123 , uploading to train, then downloading predictions to use . So we need this to be seamless. The bare minimum simple integration involves these components

1. A front end to create a feature set and target with some simple tools to examine data and transform it
2. A front end to kick off the learning: universe selection, periods, models, # of "cores" used
3. A front end to examine results
4. A way to use the trained model in P123 systems which will be just another function

Notice nowhere in the use case above is the API mentioned because it's all behind the scenes. The learning part will be (initially) in a cloud service like AWS in a P123 account. We would simple charge the user for the cloud cost + some profit factor. The great thing about the integration is that you will be able to use the actual values if you want , not the ranked ones, since there's no data license issue (data is never downloaded)

For advanced users that want to use the API we will add a way to import the results back into P123 so they can runs backtests with predictions or run a live model.

That's the current direction. Let us know your thoughts.

Portfolio123 Staff.

Feb 20, 2021 9:53:35 AM       
Edit 2 times, last edit by marco at Feb 20, 2021 11:29:05 AM
Jrinne
Re: ML integration Update

Wow!

I found this interesting and pretty cool:

And, if we only require CPUs for these algos, then doing it in our own servers will be the best way to make it affordable. We would just use a cloud for peak usages…..
4. A way to use the trained model in P123 systems which will be just another function...


Jim

From time to time you will encounter Luddites, who are beyond redemption.
--de Prado, Marcos López on the topic of machine learning for financial applications

Feb 20, 2021 11:19:48 AM       
InspectorSector
Re: ML integration Update

He uploaded the sp500 stocks with about 60 features for 10y and he burned $500 credit pretty fast. So back of the envelope estimate to train a model for russell 3k you would spend maybe 18x that (6x for 3K stocks and 3x for different feature combos). So that's ~$10K which seems expensive since we only used CPU intensive algorithms as far as I know.


Thank you for providing this cost estimate. Personally, I think it is on the light side based on my own personal experiences of developing systems. Generally, most factors will strike out and I have to go through hundreds of iterations before I'm happy.

When you implement this, please make sure you have a solution that generates indicators, not just buy/sell trade signals. For example, I have a couple of use cases, one is for enhanced analysts' estimates. (I have mentioned this previously.) Another that I am extremely excited about right now is to predict whether an income stock will cut or increase dividends in the coming year.

Down the road, you might also want to consider introducing properties for indicators, so that users can distinguish between in-sample and out-of-sample for a given indicator.

Feb 20, 2021 11:24:22 AM       
Jrinne
Re: ML integration Update

He uploaded the sp500 stocks with about 60 features for 10y and he burned $500 credit pretty fast. So back of the envelope estimate to train a model for russell 3k you would spend maybe 18x that (6x for 3K stocks and 3x for different feature combos). So that's ~$10K which seems expensive since we only used CPU intensive algorithms as far as I know.


Thank you for providing this cost estimate. Personally, I think it is on the light side based on my own personal experiences of developing systems. Generally, most factors will strike out and I have to go through hundreds of iterations before I'm happy.

When you implement this, please make sure you have a solution that generates indicators, not just buy/sell trade signals. For example, I have a couple of use cases, one is for enhanced analysts' estimates. (I have mentioned this previously.) Another that I am extremely excited about right now is to predict whether an income stock will cut or increase dividends in the coming year.

Down the road, you might also want to consider introducing properties for indicators, so that users can distinguish between in-sample and out-of-sample for a given indicator.

So just adding to what Steve is saying, I think. People like Steve will hopefully have a lot of access under-the-hood to Python and be able to munge the data. Creating his own validation samples and out-of-sample test sets for example.

Steve really understands validation and overfitting. The more access he has the less problems he will have with overfitting.

But P123 will want to create some default methods to perform validation and generally prevent overfitting by newbies. I don’t actually think that would be too hard to implement. There are plenty of templates for this over at Scikit-Learn. Even without the templates it is really a little slicing followed by some for-loops, df.append() and then df.mean() with Pandas. P123 could probably improve on what Scikit-Learn does (as a default for Newbies).

And actually if you could work with Steve and others to get a conservative default that they could use most of the time that would be incredible! And possible, I think.

As you know Pandas was created by Wes McKinney with funding by AQR Capital Management for this very purpose. I definitely think P123 has the skills and the tools to do this.

Jim

From time to time you will encounter Luddites, who are beyond redemption.
--de Prado, Marcos López on the topic of machine learning for financial applications

Feb 20, 2021 11:36:49 AM       
Edit 11 times, last edit by Jrinne at Feb 20, 2021 4:19:18 PM
enisbe
Re: ML integration Update

Hi Marco – I’ve been following this topic with great interest over the past several months and it seems that you guys are getting close to something that can be really cool.
From my perspective I am interested in this comment:



For advanced users that want to use the API we will add a way to import the results back into P123 so they can runs backtests with predictions or run a live model.



Here is my use case. Let’s say I have a model pipeline where I download data via API and built a DNN model in TensorFlow. For this model to be usable for me I would need to be able to upload it into p123 and then have p123 functionality to create a ranking system that uses my TS model to score my universe. Furthermore, I should be able to use it in Simulations/Screens/Live portfolio etc. Then my models would be (could be) fully ML driven. This is the same as what you have today except that the model weights would not be visible to the uses b/c they are non-sensical.

For this to happen on your end you would need a TensorFlow installed and implemented (or some other ML library that you decide to support) to read models and use them for scoring. This should not be difficult to implement. I’ve done something similar with restful API to return a scored observation to a client application. One thing I spent time debugging is to ensure that my scoring library is of the same version as the model building library.

I hope this makes sense. Would love to see this functionality. The only thing missing in P123 right now is the ML scoring module for Ranking Systems.

Feb 20, 2021 4:50:29 PM       
marco
Re: ML integration Update

enisbe, having our own ML libraries in our system is still being evaluated and will take a bit. There's lots of "wiring" to do and NN require specialized hardware which we do not have right now.

But you can use your results now anywhere in P123 by importing the scores/predictions from your model. Did you try it ? The import is under Research->Imported Stock Factors. The factors you import will be visible in the reference in the MY FACTORS folder.

You can import historical values for backtesting as well as current factors for live portfolios. The big pain right now is that you have to import using CSV files with three columns date ,value , and one of these: id/ticker/gvkey/cusip/cik. We should have the import factor API soon so that ML results can be much more seamless with very little effort once programmed.

Let us know, thanks

PS. We recommend using 'id' (our own unique integer identifier for a stock issue). Using tickers is ok as long as they were freshly downloaded.

Portfolio123 Staff.

Feb 21, 2021 10:06:15 AM       
Edit 1 times, last edit by marco at Feb 21, 2021 10:06:50 AM
Jrinne
Re: ML integration Update

enisbe, having our own ML libraries in our system is still being evaluated and will take a bit. There's lots of "wiring" to do and NN require specialized hardware which we do not have right now..

Marco,

I assume you have definitive information on this as that is your are of expertise.

But just in case you are listening to some hype from people who have GPUs and are hyping it (or trying to sell them to you) you might think about where you got that information and quantitate how much better GPUs really are. P123s data seems like a lot to us but I do not think it is compared to say Google. I think CPUs will probably be fine for P123.

There are multiple lines of information that would lead me to think that GPUs have a lot of hype behind them—assuming you are referring to GPUs as the limiting problem.

Again assuming you are talking about CPUs vs GPUs you might look at this: Intel Deep Learning Boost. But there is even hype in this from Intel. Nearly every modern processor has added instruction sets to impove the performance with neural nets.

I took Coursera course on neural nets where a very famous Google engineer (working on speech recognitions and self-driving cars) talked about the hype of GPUs. Perhaps that is my most definitive source. I would get his exact quote if it mattered.

He does seem to be right in my experience. I have run a LOT of neural nets. Deep nets with lots of layers on by OLD MacBook pro. And to put it quite simply it is FAST. As fast as boosting.

There is a lot of additional specific and general information on the internet to support this as not being just my perception (or the engineer at Google). For example, MacBooks has long been the preferred hardware at Kaggle competitions (and they always at least try a neural-net model now).

Especially with the potential use of AWS for peak demand, I think you might look at just how good CPUs really are with regard to neural nets. I think you will find they are quite adequate once you get past the hype.

According to the Google engineer on Coursera the main advantage with GPUs (as you probably know) are faster with matrix multiplication (with numpy). Of course, all of his code used numpy matrix multiplication (and few loops) for what is called backpropagation. But matrix multiplication works with any processor. And newer CPUs have instruction sets to help speed this up.

FWIW as you continue to look at what is practical as a business model. I can use my MacBook for my purposes and I don't not have a personal need for everyone at P123 to be using neural nets right away.

I will say that one needs a lot of control to build a good neural-net (number of layers?, use dropout?, other type of regulation?, Batch Normalization?, standardization or normalization etc). Boosting, ridge regression, random forests etc will be be lot better for newbies.

Jim

From time to time you will encounter Luddites, who are beyond redemption.
--de Prado, Marcos López on the topic of machine learning for financial applications

Feb 21, 2021 10:32:33 AM       
Edit 19 times, last edit by Jrinne at Feb 21, 2021 11:26:11 AM
marco
Re: ML integration Update

Jim, I'm a newbie in ML, I rely on others since I have very little first hand experience. We conducted a study last year with a data scientist with a relatively small data set and the NN training was taking days . I think he told me it would have taken a week+ on our hardware that was not that bad. We do have newer machines so might be a different story now. But the learning has to come down by orders of magnitude, so I don't know. We'll see. I will show him your post.

Portfolio123 Staff.

Feb 21, 2021 11:28:46 AM       
Jrinne
Re: ML integration Update

Jim, I'm a newbie in ML, I rely on others since I have very little first hand experience. We conducted a study last year with a data scientist with a relatively small data set and the NN training was taking days . I think he told me it would have taken a week+ on our hardware that was not that bad. We do have newer machines so might be a different story now. But the learning has to come down by orders of magnitude, so I don't know. We'll see. I will show him your post.

You can borrow my MacBook if you want.

I think they did not have TensorFlow then. All these programs, now, do a lot to speed things up in may different ways. Parallel processing just being the most obvious. That is actually, XGBoost’s main claim to fame but regularization and other things too. it is just a great program. KD-trees, sorting algorithms and a bunch of other things can be important for different machine learning programs.

I do not know what he was doing but I will compare a useful boosting model you (or Steve Auger) develop against a neural-net model when you want. I will bet a year membership at P123 it will take about the same amount of time.

Ask Steve Auger. He too has run both. He likes boosting for a lot of practicals reasons but speed was not one of them.

For practical member models I think what you describe is TOTALLY OFF. I can not explain why but something is off.

In fact, because I generally use a slow learning rate with boosting (e.g., eta =0.001). I found neural-nets to be faster when running the same model. Marveling at how fast both of them are, however.

I have had kernel regressions run for a weekend before I shut them off (and support vector machines do not run on my MAC either). So I know the problem. But I just have not seen it with neural nets—even on my Mac.

At worst you would just have to ask people to consolidate some of their factors into nodes for a while until you eventually got the processing power you need through AWS or your own processors.

Jim

From time to time you will encounter Luddites, who are beyond redemption.
--de Prado, Marcos López on the topic of machine learning for financial applications

Feb 21, 2021 11:37:15 AM       
Edit 10 times, last edit by Jrinne at Feb 21, 2021 12:26:59 PM
piard2
Re: ML integration Update

Marco, do you have a reason to prefer AWS over Azure? I had a 2-week course a few months ago on Azure ML studio, it seemed ahead at modeling pipelines, fast prototyping and managing the full app life cycle. They also have an auto-ML feature with a GUI taking a set of algos with ranges of parameters as inputs, to compare them on your datasets when you don't have a clear idea where to start. I don't know about cost/performance compared with AWS, but it seems cost is easier to control on Azure. I have read a few bad stories from AWS users (individuals and small businesses) who received unexpected large bills because there was no way to set a hard limit for spending in AWS (only alerts). Most of the time they had a way to negotiate the bill, but it was time- and stress-consuming.

Feb 22, 2021 6:36:00 AM       
Edit 4 times, last edit by piard2 at Feb 22, 2021 7:03:34 AM
Posts: 22    Pages: 3    1 2 3 Next
 Last Post