Missing Data Can Be Meaningful
Now, if the data compiler in question boldly inputs zeros for the cases of unknowns? Take a deep breath, fire the vendor, and don't deal with the company again, as it is a sign that its representatives do not know what they are doing in the data business. I have done so in the past, and you can do it, too. (More on how to shop for external data in future articles.)
For non-numeric categorical data, similar rules apply. Some values could be truly "blank," and those should be treated separately from "Unknown," or "Not Available." As a practice, let's list all kinds of possible missing values in codes, texts or other character fields:
- " "—blank or "null"
- "N/A," "Not Available," or "Not Applicable"
- "Other"—If it is originating from some type of multiple choice survey or pull-down menu
- "Not Answered" or "Not Provided"—This indicates that the subjects were asked, but they refused to answer. Very different from "Unknown."
- "0"—In this case, the answer can be expressed in numbers. Again, only for known zeros.
- "Non-match"—Not matched to other internal or external data sources
It is entirely possible that all these values may be highly correlated to each other and move along the same predictive direction. However, there are many cases where they do not. And if they are combined into just one value, such as zero or blank, we will never be able to detect such nuances. In fact, I've seen many cases where one or more of these missing indicators move together with other "known" values in models. Again, missing data have meanings, too.
Filling in the Gaps
Nonetheless, missing data do not have to left as missing, blank or unknown all the time. With statistical modeling techniques, we can fill in the gaps with projected values. You didn't think that all those data compilers really knew the income level of every household in the country, did you? It is not a big secret that much of those figures are modeled with other available data.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at email@example.com.