Missing Data Can Be Meaningful
I have seen too many cases where missing numeric values are filled with zeros, and I must say that such a practice is definitely frowned-upon. If you have to pick just one takeaway from this article, that's it. Like I emphasized, not all missing values are the same, and zero is not the way you record them. Zeros should never represent lack of information.
Take the example of a popular demographic variable, "Number of Children in the Household." This is a very predictable variable—not just for purchase behavior of children's products, but for many other things. Now, it is a simple number, but it should never be treated as a simple variable—as, in this case, lack of information is not the evidence of non-existence. Let's say that you are purchasing this data from a third-party data compiler (or a data broker). If you don't see a positive number in that field, it could be because:
- The household in question really does not have a child;
- Even the data-collector doesn't have the information; or
- The data collector has the information, but the household record did not match to the vendor's record, for some reason.
If that field contains a number like 1, 2 or 3, that's easy, as they will represent the number of children in that household. But the zero should be reserved for cases where the data collector has a positive confirmation that the household in question indeed does not have any children. If it is unknown, it should be marked as blank, "." (Many statistical softwares, such as SAS, record missing values this way.) Or use "U" (though an alpha character should not be in a numeric field).
If it is a case of non-match to the external data source, then there should be a separate indicator for it. The fact that the record did not match to a professional data compiler's list may mean something. And I've seen cases where such non-matching indicators are made to model algorithms along with other valid data, as in the case where missing indicators of income display the same directional tendency as high-income households.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at firstname.lastname@example.org.