User Tools

Site Tools


bicn01:dm03

Working with Decision Trees

Analysis Services enables you to analyze large quantities of data. With it, you can design, create, and manage multidimensional structures that contain detail and aggregated data from multiple data sources. An Analysis Services database also contains mining structures and mining models, and supplemental objects such as data sources and data source views.

Business Intelligence Development Studio is the primary environment for data warehouse designer that will use to develop business solutions that include Analysis Services, Integration Services, and Reporting Services projects. Each project type supplies templates for creating the objects required for business intelligence solutions, and provides a variety of designers, tools, and wizards to work with the objects.

To create new OLAP cubes, designers use Business Intelligence Development Studio. Business Intelligence Development Studio is the environment that you use to develop OLAP cubes and data mining models.

Create an Analysis Services Project

You create Analysis Services projects in Business Intelligence Development Studio by using the Analysis Services Projects template. The Analysis Services project helps you organize and manage the items that are required to build, debug and deploy an Analysis Services database.

Susan works in the marketing department of Adventure Bikes. Susan wants to increase sales by targeting specific customers for a mailing campaign. The company's database, contains a list of past customers and a list of prospective new customers. By investigating the attributes of previous bike buyers, the company hopes to discover patterns that they can then apply to potential customers. They hope to use the discovered patterns to predict which potential customers are most likely to purchase a bike from AdventureBikes.

The prospective customers addresses came from a recent marketing campaign in Bike Magazine where AdventureBikes Germany invited the readers to enter a raffle drawing to win a 7-day adventure tripp to North America.

  • Start Data Tools from SQL Server 2012.
  • Click File | New | Project.
  • In the New Project dialog box, in the Installed Templates pane, click the Business Intelligence folder.
  • In the Templates pane, click the Analysis Services Project icon.
  • In the Name text box, type NO SID Sales Campaign.
  • In the Location text box, leave the default directory path or change it to your needs.
  • In the New Project dialog box, click OK.

Assign the Analysis Server Connection

The design surface in Data Tools is specialized for each object that you work with in Analysis Services. For example, there is a designer for working with data mining models, named Data Mining Designer, and another designer for working with cubes, named Cube Designer.

The Solution Explorer, located to the right of the design surface, provides a mechanism for navigating between design surfaces and managing the items in a project. Business Intelligence Development Studio also contains a deployment window that displays the progress of a deployment, and a properties window, enabling you to change the properties of selected objects.

If the Solution Explorer does not occur, open the window in the View menu group.

When you are working with an Analysis Services project, you must build and deploy the project to an Analysis Services instance to instantiate the objects. Processing the objects in an instance of Analysis Services copies the data from the underlying data sources into the cube objects.

  • In Solution Explorer, right-click the Sales Campaign project, and then click Properties.
  • In the Configuration Properties node in the left pane, click Deployment.

  • Make sure that the server name is set to NO MDB.
  • Click OK.

Create a Data Source

A data source is a data connection that is saved and managed in your project and deployed to your Microsoft SQL Server Analysis Services database. The data source contains the names of the server and database where your source data resides, in addition to any other required connection properties.

  • In the Solution Explorer pane, under the Sales Campaign project, right-click the Data Sources folder, and then select New Data Source from the context menu.

  • In the Data Source Wizard dialog box, on the Welcome to the Data Source Wizard page, click Next.
  • On the Select how to define the connection page, make sure the Create a data source based on an existing or new connection radio button is chosen.
  • Click New….
  • In the Connection Manager dialog box, select the SQL Server Native Client 11.0 from the Native OLE DB folder in the provider drop down list at the top of the page.
  • Use SQL Server Authentication and enter server, username and password.
Server Name: dwh.hdm-server.eu
User Name: Susan.Brown
Password: password123
  • Check the Save my Password box.
  • To connect to a database select or enter AdventureBikes CRM DataMart as the database name.
  • Test the connection with the Test Connection button.

50

  • Click on OK.
  • In the Impersonation Information page, choose Inherit and click Next.
These settings determine the user account that the Analysis Services service uses when connecting to the underlying source of data using Windows Authentication. The appropriate setting depends upon how this data source is being used. In our case, we want to inherit the login information from the data source connection defined above.

The new data source, Adventure Bikes CRM DataMart appears in the Data Sources folder in Solution Explorer.

Create a Data Source View

Projects in Analysis Services are designed based on a logical data model of related tables, views, and SQL queries from one or more data sources. This logical data model is called a data source view. A data source view contains the metadata from selected data source objects, including any relationships between these objects defined in the underlying data source or defined in the data source view.

  • In the Solution Explorer pane, under the NO SID Sales Campaign project, right-click the Data Source Views folder, and then select New Data Source View from the context menu.

  • In the Data Source View Wizard dialog box, on the Welcome to the Data Source View Wizard page, click Next.
  • On the Select Data Source page, in the Relational data sources pane, verify that Adventure Bikes CRM Data Mart is selected, and click Next.
  • On the Name Matching page, make sure that Create logical relationship by matching columns is selected and choose Same name as primary key.
  • On the Select Tables and Views page, select the following objects, and then click the right arrow to include them in the new data source view:
  • DataSet Actual Customers – a dataset with sales data from actual customers.
  • DataSet Prospective Customers – a dataset with prospective customers.
  • Niesons ZIP Data – a dataset with information about ZIP codes.
  • Click Next.
  • On the Completing the Wizard page, by default the data source view is named Adventure Bikes CRM Data Mart. Change the name to Customer DataSets, and then click Finish.

The new data source view opens in the Customer DataSets.dsv

Explore the Data

We take a look at the data sets.

  • Click anywhere in the table field and right click on Explore Data.

0009

Scrolling through the data, Susan notices that there is a lot of interesting information which can be used for analyzing customer sales.

  • Customer Gender
  • Customer Age
  • ZIP code from the Customer City
  • Customer City
  • Customer Address
  • etc.

The marketing department subscribe to the Nielson's data marketplace where they can download sociology-demographic and market research data about different aspects. One aspect is geographical data. They receive a monthly data-set that contains data about cities. The cities are identified by the ZIP code, which is used in postal addresses. From the ZIP code, the creator of the Actual Customer DataSet - perhaps the data warehouse team - has enriched the customer data with.

Have a look at the Nielsons ZIP Data. You will find a lot of useful information about the places where your customers live.

  • Bike Friendly Area
  • Landscape
  • Distance to our next bicycle shop
  • Public transport quality,
  • Population, etc

Create Calculations

Wouldn't it make sense to create a customer age group.

  • Close the Explore Data pane and right-click the DataSet Actual Customers title and select New Named Calculation.
  • In the Column name box, type Age Group.
  • In the Expression box, type
CASE
WHEN [Customer Age] <= 25 THEN 'Age under 25'
WHEN [Customer Age] BETWEEN 26 and 35 THEN 'Age between 26 and 35'
WHEN [Customer Age] BETWEEN 36 and 45 THEN 'Age between 36 and 45'
WHEN [Customer Age] BETWEEN 46 and 55 THEN 'Age between 46 and 55'
WHEN [Customer Age] BETWEEN 56 and 65 THEN 'Age between 56 and 65'
WHEN [Customer Age] > 65 THEN 'Age above 65'
END
  • Click OK.
  • Explore the data and verify that your calculation was correct.

Next, we want to identify all customers who bought a bicycle. If you look on the data, you will notice, that you can derive this information from the product category column.

  • Close the Explore Data pane and right-click the DataSet Actual Customers title and select New Named Calculation.
  • In the Column name box, type Bike Buyer Flag.
  • In the Expression box, type
CASE WHEN [Product Category] like '%Bike%'
THEN 'yes' ELSE 'no' END

Add the ZIP Data

Finally, we want to connect the Nielson ZIP Data to the DataSet Actual Customer. This is an easy job, because both, the ZIP code and the ZIP city name is available in both tables.

A named query is a SQL expression represented as a table. In a named query, you can specify a SQL expression to select rows and columns returned from one or more tables in one or more data sources. A named query is like any other table in a data source view with rows and relationships, except that the named query is based on an expression.
  • Right-click the DataSet Actual Customers title and select Replace tableWith New Named Query.

:-? Named Query - Login failed for user

  • Add the table Nielson ZIP Data and select all columns.

56

Note that there is no relationship between the two tables. Also note that you cannot see the column names! This seems to be a software failure. As a workaround, do the following steps.
  • Click on OK to return to the data source view.
  • Right click on the header of your named query.
  • Click on Edit Named Query and voila, the column names will be displayed.

Next, we manually create a relationship between the Actual Customer Data and the Nielson's ZIP data.

Please note that in Germany a single ZIP code may have more than one village assigned to. For example, the ZIP code 78239 is the ZIP code for Rielasingen and Worblingen and Arlen, which have many different attributes, such as public transport. This is the reason, why we have to use the two columns ZIP and ZIP City Name to connect the customer data and the ZIP data.
  • Drag and Drop the Customer City ZIP from the DataSet Actual Customer to the ZIP at the Nielsons ZIP Data.
  • Drag and Drop the Customer City to the ZIP City Name.
  • Click OK.

57

Your data set should look like this.

Building a Targeted Mailing Structure

By investigating the attributes of previous bike buyers, Susan hopes to discover patterns that they can then apply to potential customers.

In this task Susan will create the initial mining structure, based on the Microsoft Decision Trees algorithm.

Decision tree learning is a common method used in data mining. The goal is to create a model that predicts the value of a target variable based on several input variables. The algorithms that are used for constructing decision trees usually work by choosing a variable at each step that is the next best variable to use in splitting the set of data items. “Best” is defined by how well the variable splits the data set into data subsets that have the same value of the target variable. Different algorithms use different formulae for measuring “best”.

The Microsoft Decision Trees algorithm is an algorithm that incorporates different methods for creating a tree.

The first step in creating a targeted mailing scenario is to use the Data Mining Wizard in Data Tools to create a new mining structure and decision tree mining model.

  • In Solution Explorer, right-click Mining Structures and select New Mining Structure to start the Data Mining Wizard.

  • On the Welcome to the Data Mining Wizard page, click Next.
  • On the Select the Definition Method page, verify that From existing relational database or data warehouse is selected, and then click Next.
  • On the Create mining structure with a mining model page, under Which data mining technique do you want to use?, select Microsoft Decision Trees.
  • Click Next.
  • On the Select Data Source View page, in the Available data source views pane, select Customer DataSets.

You can click Browse to view the tables in the data source view and then click Close to return to the wizard.

  • Click Next.
  • On the Specify Table Types page, select the checkbox in the Case column for DataSet Actual Customers to use it as the case table, and then click Next.

You will use the DataSet Prospective Customer table later for testing; ignore it for now.

  • On the Specify the Training Data page, you will identify at least one predictable column, one key column, and one input column for your model. Select the check box in the Predictable column in the Bike Buyer Flag row.

:!: Verify that the checkbox in the Key column is selected in the Customer ID row.

If the source table from the data source view indicates a key, the Data Mining Wizard automatically chooses that column as a key for the model.

Select the checkboxes in the Input column in the following rows. These attributes will be used by the data mining algorithmn.

  • Age Group
  • Customer Gender
  • Month of Sales
  • ZIP Bike Friendly Area
  • ZIP Distance to Sales Office
  • ZIP Landscape
  • ZIP Public Transport Quality

On the far left column - the structure columns - of the page, also select the check boxes in the following rows.

  • Customer Last Name
  • Customer First Name
  • Customer City ZIP
  • Customer City
  • Customer Address
  • Product Category
  • Customer Age
These columns will be added to your mining structure, but they will not be included in the data mining algorithm model. However, after the model is built, they will be available for the drill-through function.
  • Click Next.

The Microsoft Decision Trees algorithm supports modeling of both discrete and continuous attributes.

  • A discrete attribute has finite and countable possible values. Examples are characteristics such as Gender, ProductCategory, etc. that have measurements like “male” or “female”, or “City Bike” or “Mountain Bike”, etc.
  • A continuous attribute has real numbers or uncountable possible attribute values. Examples are: YearlyIncome, StockAmount or Total Sales Amount.

Typically, the wizard will detect numbers and assign an appropriate numeric data type, but there are many scenarios where you might want to handle a number as text instead. For example, the GeographyKey should be handled as text, because it would be inappropriate to perform mathematical operations on this identifier.

  • On the Specify Columns' Content and Data Type page, click Detect to run an algorithm that determines the default data and content types for each column.
  • Review the entries in the Content Type and Data Type columns and change them to discrete if necessary, to make sure that the settings are the same as those listed in the following image.

53

  • Click Next.

Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you partition a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing. Analysis Services randomly samples the data to help ensure that the testing and training partitions are similar. By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model.

After a model has been processed by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model's guesses are correct.

  • On the Create Testing Set page, for Percentage of data for testing, leave the default value of 30.
  • For Maximum number of cases in testing data set, clear the field.
  • Click Next.

Drill-through can be enabled on models and on structures. The check-box in this window enables drill-through on the named model and enables you to retrieve detailed information from the model cases that were used to train the model.

  • On the Completing the Wizard page, in Mining structure name, type Sales Targeted Mailing.
  • In Mining model name, type STM-Decision-Tree.
  • Select the Allow drill through check box :!:
  • Click Finish.

Deploying and Processing the Model

In Data Mining Designer, you can process a mining structure, a specific mining model that is associated with a mining structure, or the structure and all the models that are associated with that structure. For this task, we will process the structure and the model at the same time.

  • In the Mining Model menu, select Process Mining Structure and All Models.

If you made changes to the structure, you will be prompted to build and deploy the project before processing the models.

  • Click Yes.
  • Click Run in the Processing Mining Structure - Sales Targeted Mailing dialog box.

The Process Progress dialog box opens to display the details of model processing. Model processing might take some time, depending on your computer.

  • Click Close in the Process Progress dialog box after the models have completed processing.
  • Click Close in the Processing Mining Structure - <structure> dialog box.

Exploring the Targeted Mailing Model

After the models in your project are processed, you can explore them in Business Intelligence Development Studio to look for interesting trends. Because the results of mining models are complex and can be difficult to understand in a raw format, visually investigating the data is often the easiest way to understand the rules and relationships that the algorithms have discovered within the data. Exploring also helps you to understand the behavior of the model and discover which model performs best before you deploy it.

Each model you created is listed in the Mining Model Viewer tab in Data Mining Designer. Each algorithm that you used to build a model in Analysis Services returns a different type of result. Therefore, Analysis Services provides a separate viewer for each algorithm.

55

Analysis Services also provides a generic viewer that works for all model types. The Generic Content Tree Viewer displays detailed model content information that varies depending on the algorithm that was used.

On the Decision Tree tab, you can examine all the tree models that make up a mining model.

Because the targeted mailing model in this tutorial project contains only a single predictable attribute, Bike Buyer, there is only one tree to view. If there were more trees, you could use the Tree box to choose another tree.
  • Select the Mining Model Viewer tab in Data Mining Designer.

By default, the designer opens to the first model that was added to the structure – in this case, Decision-Tree.

  • Use the zoom buttons to adjust the size of the tree display.

By default, the Microsoft Tree Viewer shows only the first three levels of the tree. If the tree contains fewer than three levels, the viewer shows only the existing levels. You can view more levels by using the Show Level slider or the Default Expansion list.

  • Slide Show Level to the second bar.
  • Change the Background value of the Bike Buyer Flag to yes.

By changing the Background setting, you can quickly see the number of cases in each node that have the target value of “yes” for the attribute Bike Buyer Flag. Remember that in this particular scenario, each case represents a customer. The value “yes” indicates that the customer previously purchased a bicycle; the value “no” indicates that the customer has not purchased a bicycle.

The darker the shading of the node, the higher the percentage of cases in the node that have the target value.

Place your cursor over the node labeled All. A tool tip will display the following information:

  • Total number of cases.
  • Number of non bike buyer cases.
  • Number of bike buyer cases.
  • Number of cases with missing values.

Alternately, place your cursor over any node in the tree to see the condition that is required to reach that node from the node that comes before it. You can also view this same information in the Mining Legend.

  • Click on the node for Age Group = 'Age under 25' and open all nodes of it.

The histogram is displayed as a thin horizontal bar across the node and represents the distribution of customers in this age range who previously did (pink) and did not (blue) purchase a bike. The Viewer shows us that customers with an age under 25 living in an Very Bike-Friendly Region or Middle Bike-Friendly Region are likely to purchase a bicycle. If you open the Middle Bike-Friendly Region node you can note, that the sales is also determined by the landscape.

Because you enabled drill-through when he created the structure and model, he can retrieve detailed information from the model cases and mining structure, including those columns that were not included in the mining model (e.g., City name, Address, FirstName).

  • Right-click a node, and select Drill Through then Model Columns Only.

The details for each training case are displayed in spreadsheet format. These details come from the DataSet Actual Customers view that you selected as the case table when building the mining structure.

  • Right-click a node, and select Drill Through then Model and Structure Columns.

The same spreadsheet displays with the structure columns appended to the end.

bicn01/dm03.txt · Last modified: 2019/01/14 15:09 by admin