Missing Data Can Be Meaningful
The important matter is not the rules or methodologies, but the consistency of them throughout the organization and the databases. That way, all users and analysts will have the same starting point, no matter what the analytical purposes are. There could be a long debate in terms of what methodology should be employed and deployed. But once the dust settles, all data fields should be treated by pre-determined rules during the database update processes, avoiding costly errors in the downstream. All too often, inconsistent imputation methods lead to inconsistent results.
If, by some chance, individual statisticians end up with freedom to come up with their own ways to fill in the blanks, then the model-scoring code in question must include missing value imputation algorithms without an exception, granted that such practice will elongate the model application processes and significantly increase chances for errors. It is also important that non-statistical users should be educated about the basics of missing data and associated imputation methods, so that everyone who has access to the database shares a common understanding of what they are dealing with. That list includes external data providers and partners, and it is strongly recommended that data dictionaries must include employed imputation rules wherever applicable.
Keep an Eye on the Missing Rate
Often, we get to find out that the missing rate of certain variables is going out of control because models become ineffective and campaigns start to yield disappointing results. Conversely, it can be stated that fluctuations in missing data ratios greatly affect the predictive power of models or any related statistical works. It goes without saying that a consistent influx of fresh data matters more than the construction and the quality of models and algorithms. It is a classic case of a garbage-in-garbage-out scenario, and that is why good data governance practices must include a time-series comparison of the missing rate of every critical variable in the database. If, all of a sudden, an important predictor's fill-rate drops below a certain point, no analyst in this world can sustain the predictive power of the model algorithm, unless it is rebuilt with a whole new set of variables. The shelf life of models is definitely finite, but nothing deteriorates effectiveness of models faster than inconsistent data. And a fluctuating missing rate is a good indicator of such an inconsistency.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at email@example.com.