## Computer Science

# 5 Steps to Start Data Mining

By: Martin Brown, Posted on:We’ve never had it so good when it comes to data and the tools and physical storage required to record information. That’s fortunate, because there has been a corresponding surge in the data that is being stored. Everything from web access logs, user profile information, system logs, and all the data from sensors and physical content — such as maps and geographical data — are being stored by so many businesses. The result is massive quantities of data.

To make use of it, we need to extract useful information from this mountain of data by digging through it, and looking for sense among the bytes. This is called data mining.

Data mining is a five-step process:

- Identifying the source information
- Picking the data points that need to be analyzed
- Extracting the relevant information from the data
- Identifying the key values from the extracted data set
- Interpreting and reporting the results

**Identify source information**

As described in Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition, you need to check different datasets, and different collections of information and combine that together to build up the real picture of what you want:

There are several standard datasets that we will come back to repeatedly. Different datasets tend to expose new issues and challenges, and it is interesting and instructive to have in mind a variety of problems when considering learning methods. In fact, the need to work with different datasets is so important that a corpus containing around 100 example problems has been gathered together so that different algorithms can be tested and compared on the same set of problems.

As from our list above, you need to identify the data, or the sources of information, and from that you should be able to determine what information you should be studying to retrieve data from. This requires building rules and structure around the information to extract the critical elements. In Chapter 3 of *Data Mining*: *Practical Machine Learning Tools and Techniques*, you’ll find different techniques for building the rules and clustering techniques to concentrate on the information you need. Chapter 6 covers some important points on how to build a learning structure that correctly gets the data you need.

## Picking Data Points

This learning structure helps you identify the data that needs to be analyzed. Bayesian techniques rely on building a corpus of data and then working out the probability that data is specifically related to the information that you have extracted. Depending upon the complexity of the data and the information you are working with, the extraction of that information and the calculation of the probability required can be straightforward or complex, but it is easy to determine by calculating the frequency, sometimes based upon the past analysis of similar data sources.

*Doing Bayesian Data Analysis*, by John Kruschke goes into significantly more detail about the process of building the rules that ultimately define your Bayesian analysis. The book starts by examining the core data structure, and then covers building rules using the R language to calculate the probabilities. The beauty of the book is the simple way these processes are introduced, first through simpler examples, and then onto forming specific hypotheses using these data points:

*A crucial **application of Bayes**’ rule is to **determine the probability of a model when **given a set of data. What the model itself provides is the probability of the data, **given speci**ﬁc **parameter values and the model structure. We use Bayes**’ rule to get from the probability of the data, **given the model, to the probability of the model, **given the data.*

The book also covers a more critical element of the process: the justification of the results by comparing the computed value with both the original hypothesis and the null hypothesis that disproves the result. The content of this book goes towards understanding the mechanics of the Bayesian calculations and rules, but this is only one part of the overall data analysis process.

Once the basics of the data extraction and identification process have been completed, it is time to turn that information and structure into a result. Chapter 6 of *Data Mining: Practical Machine Learning Tools and Techniques* covers the role of implementing this process and building the decision that helps to generate the ultimate result. Again, the complexity of the process is not hidden here. Using straightforward statistics, it covers Bayesian techniques and more advanced clustering and learning-based solutions. Clustering involves setting up ranges and groups to align data into specific clusters. The difficulty with clustering is determining the size and complexity of the cluster, and what the groupings will ultimately define and describe.

## Extracting and Identifying Key Values

Learning techniques are more complex, and they rely on current and past data to produce a structure of past, valid experiences that can ultimately be compared to the new information and then interpreted and extracted. These steps help with both the extraction and identification of the information that is extracted (points 3 and 4 from our step-by-step list).

Clustering, learning, and data identification is a process also covered in detail in *Data Mining: Concepts and Techniques, 3rd Edition.* This book covers the identification of valid values and information, and how to spot, exclude and eliminate data that does not form part of the useful dataset. For example, when looking at weather data, ignoring values that are outside sensible values is key. Temperature readings above 50C in most regions are probably bogus, but temperatures slightly outside the typical ranges may indicate extreme, rather than impossible weather.

*As explained in Chapter 2, one way of handling them is to treat them as just another possible value of the attribute; this is appropriate if the fact that the attribute is missing is significant in some way. In that case, no further action need be taken. But if there is no particular significance in the fact that a certain instance has a missing attribute value, a more subtle solution is needed. It is tempting to simply ignore all instances in which some of the values are missing, but this solution is often too draconian to be viable. Instances with missing values often provide a good deal of information. Sometimes the attributes with values that are missing play no part in the decision, in which case these instances are as good as any other.*

By this point, you should have collated, identified, and extracted the correct information from the larger corpus of data. Now you need to interpret the results of this collation. There are many different approaches to do this, but all of them build on the previous steps, using further validation and qualification of the information to pick out the key data required. The results also imply a wider role that the extracted data highlights:

*When wise people make critical decisions, they usually take into account the opinions of several experts rather than relying on their own judgment or that of a solitary trusted advisor. For example, before choosing an important new policy direction.*

## Interpreting and Reporting Results

This final stage from our five-step process involves resolving the information into more equal qualifiable values, such as using basic numerical counts, direct value comparison, or group comparison to pick out the specific elements. A simple ranking is common, for example, with say hotel room ratings, while more complex comparative ranking may be used with products. Individual products may be compared against their group of equals with similar features, or that are top sellers. The data that you extracted in earlier stages can be combined into the final result.

Data mining is not a simple process, and it relies on approaching the data in a systematic and mathematical fashion. But it also relies on being flexible, and taking data that might not necessarily fit into a nicely organized and sequential format.

**About the Author**

*Martin ‘MC’ Brown is an author and contributor to over 26 books covering an array of topics, including the recently published** **Getting Started with CouchDB**. His expertise spans myriad development languages and platforms Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. Martin currently works as the Director of Documentation for Continuent and can be reached at **about.me/mcmcslp**.*

*The books highlighted in this post are all available on Safari Books Online. If you aren**’t currently a member, a 10-day free trial is available here.*