Chicken or the Egg? Data or Analytics?
When I speak at marketing conferences, talking about this subject of this "model-ready" environment, I always ask if there are statisticians and analysts in the audience. Then I ask what percentage of their time goes into non-statistical activities, such as data preparation and remedying data errors. The absolute majority of them say they spend of 80 percent to 90 percent of their time fixing the data, devoting the rest to the model development work. You don't need me to tell you that something is terribly wrong with this picture. And I am pretty sure that none of those analysts got their PhDs and master's degrees in statistics to spend most of their waking hours fixing the data. Yeah, I know from experience that, in this data business, the last guy who happens to touch the dataset always ends up being responsible for all errors made to the file thus far, but still. No wonder it is often quoted that one of the key elements of being a successful data scientist is the programming skill.
When you provide datasets filled with unstructured, incomplete and/or missing data, diligent analysts will devote their time to remedying the situation and making the best out of what they have received. I myself often tell newcomers that analytics is really about making the best of what you've got. The trouble is that such data preparation work calls for a different set of skills that have nothing to do with statistics or analytics, and most analysts are not that great at programming, nor are they trained for it.
Even if they were able to create a set of sensible variables to play with, here comes the bigger trouble; what they have just fixed is just a "sample" of the database, when the models must be applied to the whole thing later. Modern databases often contain hundreds of millions of records, and no analyst in his or her right mind uses the whole base to develop any models. Even if the sample is as large as a few million records (an overkill, for sure) that would hardly be the entire picture. The real trouble is that no model is useful unless the resultant model scores are available on every record in the database. It is one thing to fix a sample of a few hundred thousand records. Now try to apply that model algorithm to 200 million entries. You see all those interesting variables that analysts created and fixed in the sample universe? All that should be redone in the real database with hundreds of millions of lines.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at email@example.com.