Validation is an important step in the data mining process. Knowing how well your targeted mailing mining models perform against real data is important before you deploy the models into a production environment.
On the Mining Accuracy Chart tab of Data Mining Designer, you can calculate how well each of your models makes predictions, and compare the results of each model directly against the results of the other models. This method of comparison is referred to as a Lift Chart. Typically, the predictive accuracy of a mining model is measured by either lift or classification accuracy.
Other validation methods are
On the Mining Accuracy Chart tab of Data Mining Designer, you can calculate how well each of your models makes predictions, and compare the results of each model directly against the results of the other models. This method of comparison is referred to as a lift chart.
The first step in testing the accuracy of your mining models is to select the data source that you will use for testing. You will test how well the models perform against your testing data and then you will use them with external data.
The next step is to select the models that you want to include in the lift chart, the predictable column against which to compare the models, and the value to predict.
By default, all the models in the mining structure are selected. You can decide not to include a model, but for this tutorial leave all the models selected..
However, the money budgeted for the project is not enough to reach all 10,000 customers in the database. Based on the budget, they can afford to mail an advertisement to only 5,000 customers. The marketing department has two choices:
If the company randomly selects 5,000 customers, they can expect to receive only 500 responses, based on the typical response rate. This scenario is what the random line in the lift chart represents. However, if the marketing department uses a mining model to target their mailing, they can expect a larger response rate because they can target those customers who are most likely to respond. If the model is perfect, it means that the model creates predictions that are never wrong, and the company could expect to receive 1,000 responses by mailing to the 1,000 potential customers recommended by the model. This scenario is what the ideal line in the lift chart represents. The reality is that the mining model most likely falls between these two extremes; between a random guess and a perfect prediction. Any improvement from the random guess is considered to be lift.
When you click the tab, a prediction query runs against the server and database for the mining structure and the input table or test data. The results are plotted on the graph.
When you enter a Predict Value in the Input Selection, the lift chart plots a Random Guess Model as well as an Ideal Model.
In this chart, the target attribute is Bike Buyer and the target value is “yes”, meaning that the customer purchased a bike or is likely to do so.
You can tell from the chart that the ideal line peaks at around 42 percent (1), meaning that if you had a perfect model, you could reach 100 percent of your targeted customers by sending a mailing to only 42 % of the total customers. The actual lift for the STM-Decision-Tree when you target around 30 percent of the population is about 73 percent (2), meaning you could reach about 70 percent of your targeted customers by sending the mailing to 30 percent of the total customer population.
The Mining Legend contains the actual values at any point on the curves.
From these results, you can see that, when measured at 29,70% of all cases (Population percentage), the STM-Decision-Tree model can predict the bike buying behavior of 63,84% of the target population. In other words, if you sent out a targeted mailing to only 30 percent of the prospective customers in your database, you could reach slightly more than two-third of your target audience. If you used the STM-Clustering model, you could reach 49,10 percent of your targeted customers.
The value for Predict probability represents the threshold required to include a customer among the “likely to buy” cases. For each case, the model estimates the accuracy of each prediction and stores that value, which you can use to filter out or to target customers. For example, to identify the customers from the Decision-Tree model who are likely buyers, you would use a query to retrieve cases with a Predict probability of at least 76,95 percent.
The value for Score helps you compare models by calculating the effectiveness of the model across a normalized population. A higher score is better, so in this case you might decide that Decision Tree is really the most effective strategy.