Data Mining: Where to Dig First?
In the age of abundant data, obtaining insights out of mounds of data often becomes overwhelming even for seasoned analysts. In the data-mining business, more than half of the struggle is about determining “where to dig first.”
The main job of a modern data scientist is to answer business questions for decision-makers. To do that, they have to be translators between the business world and the technology world. This in-between position often creates a great amount of confusion for aspiring data scientists, as the gaps between business challenges and the elements that makes up the answers are very wide, even with all of the toolsets that are supposedly “easy to use.” That’s because insights do not come out of the toolsets automatically.
Business questions are often very high-level or even obscure. Such as:
- Let’s try this new feature with the “best” customers
- How do we improve customer “experience”?
- We did lots of marketing campaigns; what worked?
When someone mentions “best” customers, statistically trained analysts jump into the mode of “Yeah! Let’s build some models!” If you are holding a hammer, everything may look like nails. But we are not supposed to build models just because we can. Why should we build a model and, if we do, whom are we going after? What does that word “best” mean to you?
Breaking that word down in mathematically representable terms is indeed the first step for the analyst (along with the decision-makers). That’s because “best” can mean lots of different things.
If the users of the information are in the retail business, in a classical sense, it could mean:
- Frequently Visiting Customers: Expressed in terms of “Number of transactions past 12 months,” “Life-to-date number of transactions,” “Average days between transactions,” “Number of Web visits,” etc.
- Big Spenders: Expressed in terms of “Average amount per transaction,” “Average amount per customer for past four years,” “Lifetime total amount,” etc.
- Recent Customers: Expressed in terms of “Days or weeks since last transaction.”
I am sure most young analysts would want requesters to express these terms like I did using actual variable names, but translating these terms into expressions that machines can understand is indeed their job. Also, even when these terms are agreed upon, exactly how high is high enough to be called the “best”? Top 10 percent? Top 100,000 customers? In terms of what, exactly? Cut-out based on some arbitrary dollar amount, like $10,000 per year? Just dollars, or frequency on top of it, too?