Five Low-tech Data Mining Tips
In the data mining world, many would have you believe that refining the technical aspects of the analytic process is key for improving the performance of mining exercises and the insight gained from them. While there is some truth to this, the critical area for progress lies in the non-technical procedures that are so vital to a powerful outcome.
Here are five low-tech ways to improve your data mining exercises, with a view to preventing all-too-common errors that can keep marketers from optimizing customer relationships.
1. Take time to prepare your data.
We’ve all heard the “garbage in, garbage out” slogan that suggests the quality of the final data output depends on the quality of the data input. Indeed, experience shows that securing the appropriate data, the input, and preparing it for analysis consumes about two thirds of the total time necessary to complete a data mining exercise. It is time well invested.
Consider the anecdote of the analyst who discovered that 39 percent of his client’s customers owned nine vehicles. When searching for a logical explanation, a somewhat embarrassed systems analyst volunteered that the number “9” was used as a default to represent missing data. This same convention was used on much of the demographic data housed by the marketer and, of course, the analyst used much of the data as is.
A little more data preparation and a few questions in advance would have prevented this unfortunate scenario.
2. Select a valid sample.
Sampling is a key component of analysis. Using a valid sample provides reliable results. What constitutes a valid sample? Consider the following scenario: A communications firm with a customer base of 2.5 million selected a 10,000-name customer sample by choosing every hundredth record until it secured the desired quantity. In using this procedure, the firm had to count until it reached the 1 millionth record. Sounds pretty straightforward, doesn’t it? But what about the remaining 1.5 million customers? Is there a problem that none of these “bottom” 1.5 million names were selected? As it turns out, this firm sorts its files not alphabetically, but by how long ago a customer made his or her first purchase. Consequently, the result was a biased sample that included more tenured customers, and excluded “newer” customers. As you can see, selecting a valid sample makes a big difference.