Data Mining: Where to Dig First?
In the age of abundant data, obtaining insights out of mounds of data often becomes overwhelming even for seasoned analysts. In the data-mining business, more than half of the struggle is about determining “where to dig first.”
The main job of a modern data scientist is to answer business questions for decision-makers. To do that, they have to be translators between the business world and the technology world. This in-between position often creates a great amount of confusion for aspiring data scientists, as the gaps between business challenges and the elements that makes up the answers are very wide, even with all of the toolsets that are supposedly “easy to use.” That’s because insights do not come out of the toolsets automatically.
Business questions are often very high-level or even obscure. Such as:
- Let’s try this new feature with the “best” customers
- How do we improve customer “experience”?
- We did lots of marketing campaigns; what worked?
When someone mentions “best” customers, statistically trained analysts jump into the mode of “Yeah! Let’s build some models!” If you are holding a hammer, everything may look like nails. But we are not supposed to build models just because we can. Why should we build a model and, if we do, whom are we going after? What does that word “best” mean to you?
Breaking that word down in mathematically representable terms is indeed the first step for the analyst (along with the decision-makers). That’s because “best” can mean lots of different things.
If the users of the information are in the retail business, in a classical sense, it could mean:
- Frequently Visiting Customers: Expressed in terms of “Number of transactions past 12 months,” “Life-to-date number of transactions,” “Average days between transactions,” “Number of Web visits,” etc.
- Big Spenders: Expressed in terms of “Average amount per transaction,” “Average amount per customer for past four years,” “Lifetime total amount,” etc.
- Recent Customers: Expressed in terms of “Days or weeks since last transaction.”
I am sure most young analysts would want requesters to express these terms like I did using actual variable names, but translating these terms into expressions that machines can understand is indeed their job. Also, even when these terms are agreed upon, exactly how high is high enough to be called the “best”? Top 10 percent? Top 100,000 customers? In terms of what, exactly? Cut-out based on some arbitrary dollar amount, like $10,000 per year? Just dollars, or frequency on top of it, too?
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is principal and chief product officer at BuyerGenomics. Previously, Yu was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, he was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at firstname.lastname@example.org.