User Tools

Site Tools


bicn01:dm04

Working with Clustering Model

The mining structure that Susan created contains a single mining model that is based on the Microsoft Decision Trees algorithm. In order to identify customers for the targeted mailing, Susan will create two additional models and test all of his models against the testing set. Because the data in the testing set already contains known values for bike buying, it is easy to determine whether the model's predictions are correct.

The model that performs the best will be used by the AdventureBikes marketing department to identify the customers for their targeted mailing campaign.

Validation is an important step in the data mining process. Knowing how well your targeted mailing mining models perform against real data is important before you deploy the models into a production environment.

Create a Clustering Mining Model

Clustering models identify relationships in a data-set that you might not logically derive through casual observation. For example, you can logically discern that people who commute to their jobs by bicycle do not typically live a long distance from where they work. The algorithm, however, can find other characteristics about bicycle commuters that are not as obvious.

Consider a group of people who share similar demographic information and who buy similar products from the Adventure-Bikes company. This group of people represents a cluster of data. Several such clusters may exist in a database. By observing the columns that make up a cluster, you can more clearly see how records in a data-set are related to one another.

You can customize the way the algorithm works by selecting a specifying a clustering technique, limiting the maximum number of clusters, or changing the amount of support required to create a cluster.

  • Switch to the Mining Models tab in Data Mining Designer.
  • Right-click the Structure column and select New Mining Model.

  • In the New Mining Model dialog box, in Model name, type STM-Clustering.
  • In Algorithm name, select Microsoft Clustering.
  • Click OK.

The new model now appears in the Mining Models tab of Data Mining Designer. This model, built with the Microsoft Clustering algorithm, groups customers with similar characteristics into clusters and predicts bike buying for each cluster. Although you can modify the column usage and properties for the new model, no changes to the STM-Clustering model are necessary.

Enable Drill-Through

  • Go to the Mining Model.
  • Right click on the Mining Model name → Properties.
  • In the parameter list, set EnableDrillThrough to true.

Deploying and Processing the Model

In Data Mining Designer, we have to process the mining structure, the specific mining model that is associated with a mining structure, or the structure and all the models that are associated with that structure.

  • In the Mining Model menu, select Process Mining Structure and All Models.
  • Click Run in the Processing Mining Structure - Targeted Mailing dialog box.

The Process Progress dialog box opens to display the details of model processing. Model processing might take some time, depending on your computer.

  • Click Close in the Process Progress dialog box after the models have completed processing.

Exploring the Clustering Model

The Microsoft Clustering algorithm groups cases into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and creating predictions.

  • At the top of the Mining Model Viewer tab use the Mining Model list to switch to the STM-Clustering model.
  • In the Viewer list, select Microsoft Cluster Viewer.
  • In the Shading Variable box, select Bike Buyer Flag.

The default variable is Population, but you can change this to any attribute in the model, to discover which clusters contain members that have the attributes you want.

  • Select yes in the State box to explore those cases where a bike was purchased.
  • The Density legend describes the density of the attributes selected in the Shading Variable and the State. In this example it tells us that the cluster with the darkest shading has the highest percentage of bike buyers.
  • Pause your mouse over the cluster with the darkest shading.

A tooltip displays the percentage of cases that have the attribute Bike Buyer = yes.

  • Select the cluster that has the highest density, right-click the cluster, select Rename Cluster and type Bike Buyers High for later identification.
  • Click OK.
  • Find the cluster that has the lightest shading (and the lowest density). Right-click the cluster, select Rename Cluster and type Bike Buyers Low.
  • Click OK.
  • Move the left regulator pointer up to display all links.
  • Click the Bike Buyers High cluster and drag it to an area of the pane that will give you a clear view of its connections to the other clusters.

When you select a cluster, the lines that connect this cluster to other clusters are highlighted, so that you can easily see all the relationships for this cluster. When the cluster is not selected, you can tell by the darkness of the lines how strong the relationships are amongst all the clusters in the diagram. If the shading is light or nonexistent, the clusters are not very similar.

Use the slider to the left of the network, to filter out the weaker links and find the clusters with the closest relationships. The Adventure Bikes marketing department might want to combine similar clusters together when determining the best method for delivering the targeted mailing.

To explore the model in the Cluster Profiles tab

The Cluster Profiles tab contains a column for each cluster in the model. The first column lists the attributes that are associated with at least one cluster. The rest of the viewer contains the distribution of the states of an attribute for each cluster.

The distribution of a discrete variable is shown as a colored bar with the maximum number of bars displayed in the Histogram bars list.

Continuous attributes would be displayed with a diamond chart, which represents the mean and standard deviation in each cluster.
  • If the Mining Legend blocks the display of the Attribute profiles, move it out of the way.
  • Select the Bike Buyers High column and drag it to the right of the Population column.
  • Select the Bike Buyers Low column and drag it to the right of the Bike Buyers High column.
  • Click the Bike Buyers High column.

The Variables column is sorted in order of importance for that cluster. Scroll through the column and review characteristics of the Bike Buyer High cluster.

For example, they are more likely to buy a bicycle in an age group between 46 and 55 or between 56 and 65.

  • Double-click the Age Group cell in the Bike Buyers High column.

The Mining Legend displays a more detailed view and you can see the age range of these customers as well as the mean age.

With the Cluster Characteristics tab, you can examine in more detail the characteristics that make up a cluster. Instead of comparing the characteristics of all of the clusters, you can explore one cluster at a time.

For example, if you select Bike Buyers High from the Cluster list (1), you can see the characteristics of the customers in this cluster. Though the display is different from the Cluster Profiles viewer, the findings are the same.

  • Select the cluster Bike Buyers High and explore some probabilities.

With the Cluster Discrimination tab, you can explore the characteristics that distinguish one cluster from another. After you select two clusters, one from the Cluster 1 list, and one from the Cluster 2 list, the viewer calculates the differences between the clusters and displays a list of the attributes that distinguish the clusters most.

  • In the Cluster 1 box, select Bike Buyers High.
  • In the Cluster 2 box, select Bike Buyers Low.

Click Variables to sort alphabetically.

Some of the more substantial differences among the customers in the Bike Buyers Low and Bike Buyers High clusters include Age Group (Age between 56 and 65), Month of Sales (March and December) and Distance to Sales Office (less then 5 km).

bicn01/dm04.txt · Last modified: 2018/12/04 08:39 (external edit)