Data Mining

This description gives more detail than was provided in Investing Process about how data mining is used to develop high-return strategies.

The normal scientific process is to postulate a hypothesis and then attempt to disprove the null hypothesis on the logical basis that if it can’t be not true, it must be true. Rather than starting from a hypothesis or set of logical screen criteria that are then back-tested, data mining is a statistical tool that explores thousands of possible statistical relationships that predict the dependent variable. Instead of testing something that would seem possible, data mining finds the relationships first and then leaves it to the explorer to try to make sense of these relationships. While it is nice to have reasons for what works, and certainly makes it easier to sell a strategy to a client, it really doesn’t matter to me why something works if I have reason to think it will work in the future.

The research process is to first formulate the data to be imported into KnowledgeSEEKER, which is the data mining tool that I use. In the example and illustration which follows, I have selected the annual return rate for the next month as the dependent variable which I’m trying to predict. In addition I have to find, import and clean the data for the other variables which I think might predict or explain the dependent variable. Most of my data is imported from Stock Investor Pro, a data source from the American Association of Investors (www.AAII.com) with about 1,500 data elements each week on about 8,000 stocks. I usually eliminate stocks priced less than one dollar and trading fewer than 5,000 shares daily. I also limit the extreme returns, as such outliers distort the averages. (Usually the top and bottom 1%, which comes to about +-1,000%.) My data source is limited to stocks which currently exist, meaning that in a study of ten years, my results will be distorted by not having data on stocks that disappeared because of acquisition or extinction.

KnowledgeSEEKER (KS) produces results in a decision-tree, rather than as a neural network which is another form of data mining. The level of required statistical significance is usually set at .01, meaning that the relationship has to happen less than 1 time in 100 by chance. With a large number of rows, such as in the example below, significance is usually more than six digits.

The tree shown in the example below is from the research I did when trying to find out the longer-term returns for low-priced stocks. The top cell says that the average return each month was .5%, the standard deviation was 20% and there were 747,814 records in the analysis. These were the selected aggregation of each month’s returns for ten years. The program divides each of the variables into ten roughly equal-sized clusters, in this case about 75,000, and then searches to see if adjacent clusters have a similar predictability in explaining the dependent variable, in this case monthly return. If the results are similar, it combines the cells. It then sorts the variables in the order of significant predictive value. As is typical, date is the most predictive, but since we can’t screen by what the market will do for a specific time period, I have chosen price as the top cluster to show you. The 0.0 means that the findings are significant beyond .0000001. The F value and Degrees of Freedom are statistical measures. The price breaks above the rows of cells are where it breaks in getting the ten roughly equal-sized clusters. If you look at the average returns by price, you can see that stocks between $.64 and $18.51 had returns above .7% each month, with stocks below that level and above $25.71 being negative. Stocks between $9.76 and $13.67 have the best returns at 1% each month. Within that cluster, the New York Stock Exchange (N) had returns of 1.8% each month and a standard deviation two-thirds that of the NASDAQ (M). This is remarkable because the higher returns have the effect of raising the standard deviation. Within this cluster, returns are thus more than three times normal, and 60% as volatile. Nice.

Within the 25,032 stocks of this cluster, we see that the best returns are for April, at an average of 4.6%. However, there were only ten Aprils in this study, not enough to have much statistical meaning. You can also see the poor returns for the third quarter over the last ten years for this particular profile.

By pointing and clicking, one can explore all of the variables that are statistically predictive under any one cell. One can search for the positive results to include, or the negative results to exclude (or go short, which I haven’t done). I spend hours exploring these relationships on one computer, and log the results into an outline in a spreadsheet on an adjacent computer. As with any research, preparing the data is a major part of the task.

KnowledgeSEEKER Decision Tree Example (Ten Years of Monthly Market Data)

My experience from data mining is that the rules for successful screens change with a major shift in the market such as occurred on 3/03, 4/04, or 8/04. Data mining works well to identify a momentum and get on board until such changes occur. I have not found a way to use it to predict the timing of reversals. My response to date has been to mix the time frames, such as researching weekly returns within the framework of ten year monthly results.

Back to Papers