Data Mining Architecture
Data mining is a core component of SQL Server Analysis Services (SSAS) 2012. Data mining is baked into SSAS's multidimensional designer and delivery architecture. The data structures are stored in the same database as SSAS analytical cubes, but they share only a few of the project assets.To define a data mining model in SQL Server Data Tools (SSDT), you need to create an SSAS multidimensional project, but you don't need to define any cubes or dimensions. A mining model can get its data directly from any data source or database table defined in the project's data source view, as Figure 1 shows.
Figure 1: Examining the Data Mining Architecture
Data Mining Tools
When data mining was first introduced, the only way to create and use a model was through the Business Intelligence Development Studio (BIDS), which was a database development tool rather than an application suited for data analysts. Several data-mining viewers were also developed so that a mining model could be viewed graphically, but all these viewers were baked into the development environment and not accessible to business users. Programmers could integrate some of these viewers into custom applications, but that wasn't done very often.When Microsoft introduced two data mining add-ins (Data Mining Client and Table Analysis Tools) for Microsoft Excel 2007, data mining was brought to the business community. Many of the model viewers used in the development environment were integrated into the Excel add-ins, along with several features that use Excel's native charts, pivot tables, filters, slicers, and conditional formatting capabilities.
Since then, Microsoft has been providing tools that let business users do their own analyses. Data mining remains a core component of SSAS 2012, but the target audience for the design and delivery tools has shifted from the IT developers to business users, with Excel being the delivery vehicle. The latest data mining add-ins for Excel 2013, which were introduced with SQL Server 2012 SP1, have been enhanced and improved. Business users can use them to create and consume data mining models and to perform advanced predictive analyses.
A Guided Tour
In the following short tour, I'll introduce you to the Data Mining Model Designer in SSDT and the data mining add-ins for Excel 2013. If you want to follow along, I provided a sample database that I derived from real data obtained from the National Oceanic and Atmospheric Administration (NOAA). The database contains weather observations and climatic events—including tornados, hurricanes, tsunamis, earthquakes, and volcanoes—that have occurred over the past 40 years. It's more interesting to work with real information, but I make no guarantee about the accuracy or reliability of this data, so you shouldn't use it as the basis for making any decisions.To follow along, you need to have:
- The Developer or Enterprise edition of SQL Server 2012 SP1, with the relational database engine, SSAS in multidimensional storage mode, and the client tools installed either locally on a single development machine or on a server to which you have administrative access
- An SSAS instance (installed locally on a single development machine or a server) on which you have permission to create databases and objects
- Access to a SQL Server relational instance that can read and process data for the mining structures
- Excel 2013 (32 bit or 64 bit) installed
- Download and install the Microsoft SQL Server 2012 SP1 Data Mining Add-ins for Microsoft Office from the Microsoft Download Center.
- Download and restore the sample Weather and Events database by clicking the Download the Code button near the top of the page.
Using SSDT's Data Mining Model Designer
In the following example, I'll walk you through creating a data mining project in SSDT. The data mining structure and data mining model that you'll create and explore will deal with tornado data from the U.S. states that are in "Tornado Alley," a region known for a high number of seasonal tornados. Those states are:- Kansas (KS)
- Missouri (MO)
- Nebraska (NE)
- Oklahoma (OK)
- South Dakota (SD)
- Texas (TX)
Step 1: Create a New Data Mining Project
The first step is to create a new data mining project. To do so, open SSDT, select New on the File menu, and choose Analysis Services Multidimensional and Data Mining Project. Name both the project and the solution Weather and Events.Step 2: Prepare the Data
The next step is to prepare the source data by simplifying, grouping, aggregating, and cleansing it. Don't underestimate the importance of this step. Data preparation is usually an iterative process. Start with small and simple sets of data. Create views or transform source data into separate tables, and don't be afraid to create multiple sets of data in different structures. Some mining models work best with values in separate columns, whereas other mining models work better with different attribute values in the same column. For ongoing analyses and complex data sources, your solution might need to include an extraction, transformation, and loading (ETL) process using SQL Server Integration Services (SSIS) packages.The data preparation for this sample project has been completed for you. The views I've created in the Weather and Events database include data transformation logic, so this data is in the correct format for the analyses you'll perform.
Step 3: Add the Data Source to the Project
At this point, you need to add the Weather and Events database as a data source in your project. In SSDT's Solution Explorer, right-click the Data Sources folder and select New Data Source to start the Data Source Wizard.In the Data Source Wizard, click Next, then New to add a new data source. In the Connection Manager dialog box, connect to the relational database server and select the Weather and Events database, as Figure 2 shows. Click OK in the Connection Manager dialog box, then click the Next button.
Figure 2: Adding the Weather and Events Database as a Data Source
Figure 3: Selecting the Type of Authentication to Use
After you select your authentication method in the Impersonation Information page, click the Finish button. In the next page, accept the default data source name and click Finish to add the data source and close the Data Source Wizard.
Step 4: Add the Views
As I mentioned previously, the Weather and Events database already includes the views for this sample project. To add the views to your project, right-click the Data Source Views node in Solution Explorer and choose New Data Source View.When the Data Source View Wizard appears, click the Next button three times so that you're on the Select Tables and Views page. In the Available objects list on this page, select the six objects highlighted in Figure 4, then click the top-most button between the two list boxes to move the selected views to the Included objects list. Click Next, then click Finish on the following page to add the views and close the wizard.
Figure 4: Adding the Views
Figure 5: Setting the Logical Primary Key for One of the Data Source Views
Step 5: Create a Data Mining Structure
You're now ready to create a data mining structure that will have one new mining model. Right-click the Mining Structures node in Object Explorer and select New Mining Structure. When the Data Mining Structure Wizard appears, click Next twice so that you're on the Create the Data Mining Structure page. As Figure 6 shows, there are nine mining model algorithms included in the Microsoft data mining framework. Each algorithm applies a unique set of mathematical formulas, logic, and rules to analyze data in the mining structure. Think of each as a separate black box, capable of analyzing a set of data and making predictions in different ways. This sample project uses the Microsoft Time Series algorithm, so select that algorithm from the drop-down list, then click Next twice to go to the Specify Table Types page. In the Input tables list on this page, select the Case check box for the vw_TornadosByYearByState view and click Next.
Figure 6: Selecting the Mining Model Algorithm
Figure 7: Specifying the Training Data
In the Specify Columns' Content and Data Type page, change the data type for the KS, MO, NE, OK, SD, and TX columns from Long to Double. Leave the Year column set to Long, because the time series works best with a floating point data type. (It might return errors with long integer values.) Click Next.
In the Completing the Wizard page, you need to give the mining structure and mining model appropriate names. The mining structure will become the container for multiple models, and each model uses a specific model algorithm that should be incorporated into the name. The name of the structure should also reflect the name of the table or view on which it's based.
For this example, modify the default names so that the mining structure is named Tornados By Year By State and the mining model is named Time Series - Tornados By Year By State. Click Finish to create the data mining structure.
Step 6: Process and Explore the Mining Structure
With the mining structure created, it's time to process and explore it. On the Mining Models tab in the Data Mining Model Designer, right-click the Microsoft_Time_Series box and select Process Mining Structure and All Models, as Figure 8 shows.
Figure 8: Choosing the Option to Process the Mining Structure and Its Model
Figure 9: Watching the Progress in the Processing of the Mining Structure and Its Model
When the Mining Model Viewer is displayed, you'll see a line chart like that in Figure 10, which shows historical and predicted tornado data by year for the states in Tornado Alley. Specifically, it shows the number of tornados (as a percentage of deviation from a baseline value) in each state from 1973 through 2011, with predictions for five more years. The first thing you're likely to notice is a rather tall spike prediction for Kansas. We know that this prediction is wrong because it was forecasting the future from 2011 and we know that there wasn't roughly a 5,000 percent increase in tornados (i.e., nearly 500 tornados) in Kansas in 2012. This brings us back to Dr. Box's statement that "all models are wrong but some are useful." This one isn't correct or useful. I'll deal with this a little bit later. For now, clear the check box next to KS. As you can see in Figure 11, the projected trend is much better now.
Next, clear all the check boxes, except for SD, which will isolate the results for South Dakota. Use the Prediction steps option to increase the prediction steps to 25. Notice that you're now projecting future tornado patterns 25 years into the future, to the year 2036. It's important to note that unless there's a very strong and regular pattern in the historical data, the time series algorithm might not be accurate beyond a few periods. However, looking at several periods will help you spot a predicted pattern and verify that the time series algorithm is doing its job.
Check the Show Deviations box to display the range of confidence in the accuracy of the predicted values. Figure 12 shows the results. South Dakota has had a fairly regular pattern of tornado activity from 1973 to 2011, which gives the time series algorithm a lot to work with. Even if you were to move the line to the upper or lower end of the deviation range, you could still see the predicted pattern.
Now, back to Kansas. Remember the big spike predicted for 2012? Clearly, the time series algorithm is having problems making a prediction with this data when using the default settings. This scenario is actually very common, and you just need to offer some guidance to get it on the right track.
Every one of the nine Microsoft data mining algorithms has a different set of parameters that do different things. These are the knobs and switches that control the behavior of the complex mathematical processes and rules used to make predictions. There are a lot of complex details that warrant further discussion and a deeper understanding. Many of these settings are covered in depth in the book Data Mining with Microsoft SQL Server 2008 (Wiley Publishing, 2009) by Jamie MacLennan, ZhaoHui Tang, and Bogdan Crivat. Making adjustments to these settings can either make a model work well or make the model go crazy. I encourage you to experiment with different settings by making a change and reprocessing the model. It can be time consuming, but this is an important part of the process for creating a useful data mining solution.
For this project, switch to the Mining Models tab, right-click the Microsoft_Time_Series box, and select Set Algorithm Parameters. Note that the default settings for the MAXIMUM_SERIES_VALUE and MINIMUM_SERIES_VALUE parameters are huge numbers. By leaving these unconstrained, the model algorithm is blowing a fuse and giving crazy results. Change MAXIMUM_SERIES_VALUE to 200 and MINIMUM_SERIES_VALUE to 0, then click the OK button to save the settings.
Reprocess and browse the model. This time the prediction results for KS are in a moderate range. If you increase the number of prediction steps, you'll see that the model seems to be making a reasonable set of predictions for annual tornado counts for the next 25 years. However, if you select the Show Deviations check box, you'll see that the algorithm has very little confidence in its ability to make a prediction with the information provided, as Figure 13 shows.
Why can't this model predict the future of tornado activity in Kansas? I posed this question to Mark Tabladillo, who does a lot of work with predictive modeling and statistical analysis. He said, "Typically, we do not get 'whys' in data mining." It's often necessary to create multiple models with different filters and variables to validate a pattern and a reliable prediction. The desire to explain "why" is human nature, but a scientific explanation might not always be possible. According to Tabladillo, "Correlation and causality are different, and most data mining results are correlation alone. Through time and patience, we can make a case for causality, though people, from academics to news reporters, are tempted to jump to a causal conclusion, either to project that they have done that requisite homework or simply to be the first mover-of-record."
In this case, it might be that Kansas doesn't have a strong fluctuating pattern of annual tornado counts like South Dakota does. Keep in mind that, so far, you're considering only the absolute count of all tornados in each state, aggregated over a year. You're not considering other attributes such as each tornado's category, strength, or duration or the damage caused by each tornado. This information is in the data and can be used to create more targeted models.
No comments:
Post a Comment