Inform It: Data Mining Architecture

Data Mining Architecture

Data mining is a core component of SQL Server Analysis Services (SSAS) 2012. Data mining is baked into SSAS's multidimensional designer and delivery architecture. The data structures are stored in the same database as SSAS analytical cubes, but they share only a few of the project assets.
To define a data mining model in SQL Server Data Tools (SSDT), you need to create an SSAS multidimensional project, but you don't need to define any cubes or dimensions. A mining model can get its data directly from any data source or database table defined in the project's data source view, as Figure 1 shows.

Figure 1: Examining the Data Mining Architecture

A mining model relies on the same core data connectivity and Unified Dimensional Model definition objects as a multidimensional SSAS project. Within a project, data sources and data source views are foundational objects used to define tables and queries for data mining structures. In turn, a structure defines one or more tables as well as the columns that are used as keys, input attributes, and prediction outputs for each mining model within the structure.

Data Mining Tools

When data mining was first introduced, the only way to create and use a model was through the Business Intelligence Development Studio (BIDS), which was a database development tool rather than an application suited for data analysts. Several data-mining viewers were also developed so that a mining model could be viewed graphically, but all these viewers were baked into the development environment and not accessible to business users. Programmers could integrate some of these viewers into custom applications, but that wasn't done very often.
When Microsoft introduced two data mining add-ins (Data Mining Client and Table Analysis Tools) for Microsoft Excel 2007, data mining was brought to the business community. Many of the model viewers used in the development environment were integrated into the Excel add-ins, along with several features that use Excel's native charts, pivot tables, filters, slicers, and conditional formatting capabilities.
Since then, Microsoft has been providing tools that let business users do their own analyses. Data mining remains a core component of SSAS 2012, but the target audience for the design and delivery tools has shifted from the IT developers to business users, with Excel being the delivery vehicle. The latest data mining add-ins for Excel 2013, which were introduced with SQL Server 2012 SP1, have been enhanced and improved. Business users can use them to create and consume data mining models and to perform advanced predictive analyses.

A Guided Tour

In the following short tour, I'll introduce you to the Data Mining Model Designer in SSDT and the data mining add-ins for Excel 2013. If you want to follow along, I provided a sample database that I derived from real data obtained from the National Oceanic and Atmospheric Administration (NOAA). The database contains weather observations and climatic events—including tornados, hurricanes, tsunamis, earthquakes, and volcanoes—that have occurred over the past 40 years. It's more interesting to work with real information, but I make no guarantee about the accuracy or reliability of this data, so you shouldn't use it as the basis for making any decisions.
To follow along, you need to have:

The Developer or Enterprise edition of SQL Server 2012 SP1, with the relational database engine, SSAS in multidimensional storage mode, and the client tools installed either locally on a single development machine or on a server to which you have administrative access
An SSAS instance (installed locally on a single development machine or a server) on which you have permission to create databases and objects
Access to a SQL Server relational instance that can read and process data for the mining structures
Excel 2013 (32 bit or 64 bit) installed

In addition, you need to:

Download and install the Microsoft SQL Server 2012 SP1 Data Mining Add-ins for Microsoft Office from the Microsoft Download Center.
Download and restore the sample Weather and Events database by clicking the Download the Code button near the top of the page.

My instructions assume that you know how to restore a database, create database objects, and set permissions using SQL Server Management Studio (SSMS).

Using SSDT's Data Mining Model Designer

In the following example, I'll walk you through creating a data mining project in SSDT. The data mining structure and data mining model that you'll create and explore will deal with tornado data from the U.S. states that are in "Tornado Alley," a region known for a high number of seasonal tornados. Those states are:

Kansas (KS)
Missouri (MO)
Nebraska (NE)
Oklahoma (OK)
South Dakota (SD)
Texas (TX)

Step 1: Create a New Data Mining Project

The first step is to create a new data mining project. To do so, open SSDT, select New on the File menu, and choose Analysis Services Multidimensional and Data Mining Project. Name both the project and the solution Weather and Events.

Step 2: Prepare the Data

The next step is to prepare the source data by simplifying, grouping, aggregating, and cleansing it. Don't underestimate the importance of this step. Data preparation is usually an iterative process. Start with small and simple sets of data. Create views or transform source data into separate tables, and don't be afraid to create multiple sets of data in different structures. Some mining models work best with values in separate columns, whereas other mining models work better with different attribute values in the same column. For ongoing analyses and complex data sources, your solution might need to include an extraction, transformation, and loading (ETL) process using SQL Server Integration Services (SSIS) packages.
The data preparation for this sample project has been completed for you. The views I've created in the Weather and Events database include data transformation logic, so this data is in the correct format for the analyses you'll perform.

Step 3: Add the Data Source to the Project

At this point, you need to add the Weather and Events database as a data source in your project. In SSDT's Solution Explorer, right-click the Data Sources folder and select New Data Source to start the Data Source Wizard.
In the Data Source Wizard, click Next, then New to add a new data source. In the Connection Manager dialog box, connect to the relational database server and select the Weather and Events database, as Figure 2 shows. Click OK in the Connection Manager dialog box, then click the Next button.

Figure 2: Adding the Weather and Events Database as a Data Source

In the Impersonation Information page in Figure 3, you need to specify how you want authentication to occur when a mining structure is processed. If you're using a local instance of SQL Server in a development environment and you're a local administrator, choose the Use the service account option. When you select this option, the service account credentials for the Analysis Services service are used to authenticate the SQL Server connection. If you have experience with service data connections, you've probably done this before so you know how this type of authentication works. If not, you need to take the time to understand how it works because it's used in many different types of SQL Server database solutions not specifically related to data mining.

Figure 3: Selecting the Type of Authentication to Use

Note that a successful test connection in the Connection Manager dialog box doesn't guarantee a successful connection when the structure is deployed, because the test connection is running on the developer machine and the processing actually runs on the server. If the processing fails, you'll want to return to the data source impersonation settings and try one of the other authentication options or grant permission to the database in SSMS.
After you select your authentication method in the Impersonation Information page, click the Finish button. In the next page, accept the default data source name and click Finish to add the data source and close the Data Source Wizard.

Step 4: Add the Views

As I mentioned previously, the Weather and Events database already includes the views for this sample project. To add the views to your project, right-click the Data Source Views node in Solution Explorer and choose New Data Source View.
When the Data Source View Wizard appears, click the Next button three times so that you're on the Select Tables and Views page. In the Available objects list on this page, select the six objects highlighted in Figure 4, then click the top-most button between the two list boxes to move the selected views to the Included objects list. Click Next, then click Finish on the following page to add the views and close the wizard.

Figure 4: Adding the Views

Each table in the data source view must have a logical primary key. Two of the views used to define the tables in the data source view don't have these keys, so you need to configure them in the Data Source View window. For the vw_TornadosByYearByState view, right-click the Year field and choose Set Logical Primary Key, as shown in Figure 5. Do the same thing to set the YearMonth field as the logical primary key for the vw_TornadosByMonthByState view.

Figure 5: Setting the Logical Primary Key for One of the Data Source Views

Step 5: Create a Data Mining Structure

You're now ready to create a data mining structure that will have one new mining model. Right-click the Mining Structures node in Object Explorer and select New Mining Structure. When the Data Mining Structure Wizard appears, click Next twice so that you're on the Create the Data Mining Structure page. As Figure 6 shows, there are nine mining model algorithms included in the Microsoft data mining framework. Each algorithm applies a unique set of mathematical formulas, logic, and rules to analyze data in the mining structure. Think of each as a separate black box, capable of analyzing a set of data and making predictions in different ways. This sample project uses the Microsoft Time Series algorithm, so select that algorithm from the drop-down list, then click Next twice to go to the Specify Table Types page. In the Input tables list on this page, select the Case check box for the vw_TornadosByYearByState view and click Next.

Figure 6: Selecting the Mining Model Algorithm

In the Specify the Training Data page in Figure 7, you need to select the columns you want to use to train the model. When you train a model, an initial analysis is performed with a set of data, and the information about the discovered patterns and predictions is stored. This information can then be used to test the patterns in another set of data. In a production-scale solution, you would normally split the data into training and testing sets. This is an important concept in data mining, but it warrants a longer discussion, so put this on your list of things to look into after you learn the basics.

Figure 7: Specifying the Training Data

In this case, you're going to train the model using tornado information from the states in Tornado Alley. So, in the Mining model structure table, select both the Input and Predict check boxes for each of the following columns: KS, MO, NE, OK, SD, and TX. Be sure to leave the Year column set as the Key. Click Next.
In the Specify Columns' Content and Data Type page, change the data type for the KS, MO, NE, OK, SD, and TX columns from Long to Double. Leave the Year column set to Long, because the time series works best with a floating point data type. (It might return errors with long integer values.) Click Next.
In the Completing the Wizard page, you need to give the mining structure and mining model appropriate names. The mining structure will become the container for multiple models, and each model uses a specific model algorithm that should be incorporated into the name. The name of the structure should also reflect the name of the table or view on which it's based.
For this example, modify the default names so that the mining structure is named Tornados By Year By State and the mining model is named Time Series - Tornados By Year By State. Click Finish to create the data mining structure.

Step 6: Process and Explore the Mining Structure

With the mining structure created, it's time to process and explore it. On the Mining Models tab in the Data Mining Model Designer, right-click the Microsoft_Time_Series box and select Process Mining Structure and All Models, as Figure 8 shows.

Figure 8: Choosing the Option to Process the Mining Structure and Its Model

You'll be prompted to update and process the model. Accept all the prompts. When the Process Mining Structure dialog box opens, click the Run button. In the Process Progress dialog box, you can expand the nodes in the tree view to see the details while the structure and model are being processed, as shown in Figure 9. When the Status box displays Process succeeded, click the Close buttons in the Process Progress and Process Mining Structure dialog boxes.

Figure 9: Watching the Progress in the Processing of the Mining Structure and Its Model

Right-click the Microsoft_Time_Series box and select Browse Model. Click Yes to build and deploy the model and to update any objects.

Figure 10: Displaying Historical and Predicted Tornado Data by Year for All the States in Tornado Alley

When the Mining Model Viewer is displayed, you'll see a line chart like that in Figure 10, which shows historical and predicted tornado data by year for the states in Tornado Alley. Specifically, it shows the number of tornados (as a percentage of deviation from a baseline value) in each state from 1973 through 2011, with predictions for five more years. The first thing you're likely to notice is a rather tall spike prediction for Kansas. We know that this prediction is wrong because it was forecasting the future from 2011 and we know that there wasn't roughly a 5,000 percent increase in tornados (i.e., nearly 500 tornados) in Kansas in 2012. This brings us back to Dr. Box's statement that "all models are wrong but some are useful." This one isn't correct or useful. I'll deal with this a little bit later. For now, clear the check box next to KS. As you can see in Figure 11, the projected trend is much better now.

Figure 11: Displaying Historical and Predicted Tornado Data by Year for the States in Tornado Alley When Kansas Isn't Included

Next, clear all the check boxes, except for SD, which will isolate the results for South Dakota. Use the Prediction steps option to increase the prediction steps to 25. Notice that you're now projecting future tornado patterns 25 years into the future, to the year 2036. It's important to note that unless there's a very strong and regular pattern in the historical data, the time series algorithm might not be accurate beyond a few periods. However, looking at several periods will help you spot a predicted pattern and verify that the time series algorithm is doing its job.
Check the Show Deviations box to display the range of confidence in the accuracy of the predicted values. Figure 12 shows the results. South Dakota has had a fairly regular pattern of tornado activity from 1973 to 2011, which gives the time series algorithm a lot to work with. Even if you were to move the line to the upper or lower end of the deviation range, you could still see the predicted pattern.

Figure 12: Displaying the Range of Confidence in the Accuracy of the Predicted Values for South Dakota

Now, back to Kansas. Remember the big spike predicted for 2012? Clearly, the time series algorithm is having problems making a prediction with this data when using the default settings. This scenario is actually very common, and you just need to offer some guidance to get it on the right track.
Every one of the nine Microsoft data mining algorithms has a different set of parameters that do different things. These are the knobs and switches that control the behavior of the complex mathematical processes and rules used to make predictions. There are a lot of complex details that warrant further discussion and a deeper understanding. Many of these settings are covered in depth in the book Data Mining with Microsoft SQL Server 2008 (Wiley Publishing, 2009) by Jamie MacLennan, ZhaoHui Tang, and Bogdan Crivat. Making adjustments to these settings can either make a model work well or make the model go crazy. I encourage you to experiment with different settings by making a change and reprocessing the model. It can be time consuming, but this is an important part of the process for creating a useful data mining solution.
For this project, switch to the Mining Models tab, right-click the Microsoft_Time_Series box, and select Set Algorithm Parameters. Note that the default settings for the MAXIMUM_SERIES_VALUE and MINIMUM_SERIES_VALUE parameters are huge numbers. By leaving these unconstrained, the model algorithm is blowing a fuse and giving crazy results. Change MAXIMUM_SERIES_VALUE to 200 and MINIMUM_SERIES_VALUE to 0, then click the OK button to save the settings.
Reprocess and browse the model. This time the prediction results for KS are in a moderate range. If you increase the number of prediction steps, you'll see that the model seems to be making a reasonable set of predictions for annual tornado counts for the next 25 years. However, if you select the Show Deviations check box, you'll see that the algorithm has very little confidence in its ability to make a prediction with the information provided, as Figure 13 shows.

Figure 13: Displaying the Range of Confidence in the Accuracy of the Predicted Values for Kansas

Why can't this model predict the future of tornado activity in Kansas? I posed this question to Mark Tabladillo, who does a lot of work with predictive modeling and statistical analysis. He said, "Typically, we do not get 'whys' in data mining." It's often necessary to create multiple models with different filters and variables to validate a pattern and a reliable prediction. The desire to explain "why" is human nature, but a scientific explanation might not always be possible. According to Tabladillo, "Correlation and causality are different, and most data mining results are correlation alone. Through time and patience, we can make a case for causality, though people, from academics to news reporters, are tempted to jump to a causal conclusion, either to project that they have done that requisite homework or simply to be the first mover-of-record."
In this case, it might be that Kansas doesn't have a strong fluctuating pattern of annual tornado counts like South Dakota does. Keep in mind that, so far, you're considering only the absolute count of all tornados in each state, aggregated over a year. You're not considering other attributes such as each tornado's category, strength, or duration or the damage caused by each tornado. This information is in the data and can be used to create more targeted models.

Inform It

Labels

Data Mining Architecture

Data Mining Architecture

Data Mining Tools

A Guided Tour

Using SSDT's Data Mining Model Designer

Step 1: Create a New Data Mining Project

Step 2: Prepare the Data

Step 3: Add the Data Source to the Project

Step 4: Add the Views

Step 5: Create a Data Mining Structure

Step 6: Process and Explore the Mining Structure

No comments:

Post a Comment

Featured post

Ror

Recent Posts

Random Posts