Even in this age of ubiquitous computing where all kinds of data constantly flow around all of us through every conceivable electronic device, knowing everything about everyone all the time is just not possible. Some say that marketers collect more data in one hour than they did in a year in the '70s. But linking all those data points to a known individual (or even an anonymous match key) is always a challenge due to privacy issues, data ownership or lack of a common key by which data are combined. Statisticians always want more variables for better predictability, but, like in the olden days, modeling still is about "making the best of what we know."
Then, what to do with the "unknowns"? Do we just dismiss them and move on? Properly treating missing data may boost targeting efficiency as not all missing data are created equal, and missing data often contain interesting stories behind them. For example, certain variables may be missing only for very rich people and very poor people, as their residency may not be as exposed as others. That in itself is a story. Some data may be missing in certain geographic regions or for certain age groups. "Not" having access to broadband may mean something interesting, too.
Filling in the Blanks
Like other targeting challenges, missing-data management starts with proper database design. Even at the data collection stage, reasons why certain data points are missing should not be ignored. If you are dealing with numeric data, such as dollars, frequency counts, dates, etc., why are they missing? Is it because they are really unknown and incalculable (no transaction to deal with), or a simple issue of mismatches among different data marts and sources? Database managers may not always know the actual reasons why they are missing, but they should never blindly fill the missing values with "0"s. Zeros must be reserved for known and verified zeros.
Users may agree that "true" missing values must be stored as ".", for instance. If a variable such as "number of children in the household" is missing, data managers should never put it in the system as zero unless it's confirmed that the household does not include any children. Further, one should assign separate codes for "missing values due to non-matches to external data source" (i.e., matching issue) vs. "matched to external source but still missing" (i.e., even your data vendor doesn't know). After all, not matching to a professional data compiler's list may mean something, and the missing denotation may act as an independent predictor in models.
For categorical data (or non-numeric) data, similar rules should apply. Values such as plain blank, "N/A", "0" or "." may be used to represent different reasons why the values are missing. Once coded separately, these values often end up playing different roles in subsequent models, often moving together with other known values.
Accounting for Known Unknowns
Modelers often impute values when they encounter missing values, and there are many different methods. Conversely, there are hardly two statisticians who completely agree with each other when it comes to imputation methodologies. Nevertheless, it is important for an organization to have a unified rule for each variable regarding its imputation method. Will it be a simple average of non-missing values? If such method is to be employed, what is the minimum required fill rate? Or, will it be populated with some type of predictive model scores? Once the dust settles, all data fields must be treated with pre-defined rules during the database update process. That way, all analysts will have the common starting point. Often, inconsistent imputation methods lead to inconsistent results.
If, by any chance, individual statisticians end up with freedom to come up with their own ways to fill in the blanks, their model scoring code must include missing value imputation algorithms, as well. It is important that non-statistical staff should be educated about imputation methods, so that everyone who has access to the database shares a common understanding. That list may include external data providers.
In any case, database managers should constantly be aware of fill rate of each variable, and such figures must be compared with the ones from the previous updates. Often, model shelf life is greatly affected by fluctuations in missing rate. Conversely, it is prudent to check the missing percentage of each model variable when sudden changes in model group distribution is observed.
These few guidelines regarding the missing data may add more flavors to statistical models and, in turn, may also prolong the predictive power of models. After all, missing data may be very meaningful when treated properly.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at firstname.lastname@example.org.