Even AI Needs Clean Data in Order to Be the Shiny Object
Users are quickly realizing that investing in AI is not the end of the road. Then again, in this analytics journey, there really is no end anyway; much like the scientific journey, it is a constant series of hypothesis, testing, and course corrections. And now, I’ll explain why that means even AI needs clean data.
If there is a book out there — many have asked me about it — it would look more like a long series of case studies, not some definitive roadmap for all. Why? Because prescribing analytics is much like a doctor’s work. It depends as much on the unique situation of the patient as on the list of solutions.
That is the main reason why one cannot just install AI and call it a day. Who’d give it a purpose, guide it, and constantly fine-tune it? Not itself, for sure.
Then there is a question about what goes into it. AI — or any type of analytics tool, for that matter — depends on clean and error-free data. If the data are dirty and unusable, you may end up automating inadequate decision-making processes, getting wrong answers really fast. I’d say that would be worse than not having any answer at all.
So far, you may say I am just stating the obvious here. Of course, AI or machine learning require clean and error free data. The real trouble is that such data preparation often takes up as much as 80% (if not more) of the whole process of applying data-based intelligence to decision-making. In fact, users are finding out that the algorithmic part of the equation is the simplest to automate. The data refinement process is far more complicated than that, as it really depends on the shape of the available data. And some are really messy (hence, the title of my series in this fine publication, “Big Data, Small Data, Clean Data, Messy Data”).
So, why aren’t data readily usable?
- Data Are in Silos: This is so common that “siloed data” is actually a term that we commonly use in meeting rooms. Simply, if the data are locked up somewhere, they won’t be much of use for anyone. Worse, each silo may be on a unique platform, with incompatible data formats from others.
- Data Are in One Place, But Not Connected: Putting the data in one place isn’t enough, if they are not properly connected. Let’s say an organization is pursuing the coveted “Customer 360” (or more properly, “360-degree view of a customer”) for personalized marketing. The first thing to do is to define what a "person" means, in the eyes of the machine and algorithms. It could be any form of PII or even biometrics data, through which all related data would be merged and consolidated. If the online and offline shopping history of a person aren’t connected properly, algorithms will treat them as two separate entities, devaluating the target customer. This is just one example; all kinds of analytics — whether they be forecasting, segmentation, or product analysis — perform better with more than one type of data, and they should be in one place to be useful.
- Data Are Connected, But Many Fields Are Wrong or Empty: So what if the data are merged in one place? If data are mostly empty or incorrect, they will be worse than not having any at all. Good luck forecasting or predicting anything with data fields with really low fill rates. Unfortunately, we encounter tons of missing values in the case of "Customer 360." What we call Big Data have lots of holes in them, when everything is lined up around the target (i.e., it is nearly impossible to know everything about everyone). Plus, remember that most modern databases record and maintain what are available; but in predictive analytics, what we don’t know is equally important.
- Data Are There, But They Are Not Readily Usable, as They Are in Free-Form Formats: You may have the data, but they may need some serious standardization, refinement, categorization, and transformation processes to be useful. Many times I encountered hundreds, at time over a thousand, offer and promotion codes. To find out “what marketing efforts worked,” we would have to go through some serious data categorization to make them useful. (Refer to “The Art of Data Categorization”) This is just one example of many. Too often, analytics work is stuck in the middle of too much free-form, unstructured data.
- Data Are Usable, But They Are One-Dimensional: Bits and pieces of data, even if they are clean and accurate, do not provide a holistic portrait of target individuals (if the job is about 1:1 marketing). Most predictive analytics work requires diverse data of a different nature, and only after proper data consolidation and summarization, we can obtain a multi-dimensional view. So-called relational databases and unstructured databases do not provide such a perspective without data summarization (or de-normalization) processes, as entities of such databases are just lists of events and transactions (e.g., on such and such date, this individual clicked some email link and bought a particular item for how much).
- Data Are Cleaned, Consolidated, and Summarized, But There Is No Built-in Intelligence: To predict what the target individual is interested in, data players must rearrange the data to describe the person, not just events or transactions. Why do you think even large retailers, like Amazon, treat you like you are only about the very last transaction, sending the “likes” of the last item you purchased, ignoring years of interaction history? Because their data are not describing “you” as a target. And you are not just a sum of past transactions, either. For instance, your days in between purchases in the home electronics category may be far greater than those in the apparel category, yet showing higher average spending in the first category. This type of insight only comes out when the data are summarized properly to describe the buyer, not each transaction. Further, summarized data should be in the form of answers to questions, acting as building blocks of predictive scores. Intelligent variables always increase the predictive power of models, machine-based or not.
- Data Variables Include Intelligence, But It Is Still Difficult to Derive Insights: Lists of intelligent variables are just basic necessities for advanced analytics, which would lead us to deeper and actionable insights. Even statisticians and analysts require a long training period to derive meanings out of seemingly beautiful charts and effectively develop stories around them. Yes, we can see that certain product sales went down, even with heavy promotion. But what does that really mean, and what should we do about it? For a machine to catch up with that level of storytelling, the data best be on silver platters in pristine condition first. Because changing assumptions based on “what is not there” or “what looks suspicious” is still in the realm of human intuition. Machines, for now, will read the results as if every bit of input data is correct and carries equal weight.
There are schools of thought that machines should be able to take raw data in any form, and somehow spit out answers for us mortals. But I do not subscribe to such a brute-force approach. Even if there is no human intervention in the data refinement process, machines will have to clean data in steps, like we have been doing. Simply put, a machine that is really good at identifying target individuals will be separately trained from the one that is designed for prediction of any kind.
So, what does clean and useful data mean? Just reverse the list above. In summary, good data must be:
- Free from silos
- Properly connected, if coming from disparate sources
- Free from errors and too many missing values (i.e., must have good coverage)
- Readily usable by non-specialists without having to manipulate them extensively
- Multi-dimensional as a result of proper data summarization
- In forms of variables with built-in intelligence
- Presented in ways that provide insights, beyond a simple list of data points
Then, what are the steps of data refinement process? Again, if I may summarize the key steps out of the list above:
- Data collection (from various sources)
- Data consolidation (around the key object, such as individual target)
- Data hygiene and standardization
- Data categorization
- Data summarization
- Creation of intelligent variables
- Data visualization and/or modeling for business insights
I have covered all of these steps in detail through this column over the years. Nevertheless, I just wanted to share these steps on a high level again, as the list will serve as a checklist, of sorts. Why? Because I see too many organizations — even the advanced ones — that miss the whole category of necessary activities. How many times have I seen unstructured and uncategorized data, and how many times have I seen very clean data but only on an event and transaction level? How can anyone predict the target individual’s future behavior that way, with or without the help of machines?
The No. 1 reason why AI or machine learning do not reach their full potential is inadequate input data. Imagine putting unrefined oil as fuel or lubricant for a brand new Porsche. If the engine stalls, is that the car’s fault? To that point, please remember that even the machines require clean and organized data. And if you are about to have machines do the clean-up, also remember that machines are not that smart (yet), and they work better when trained for a specific task, such as pattern recognition (for data categorization).
One last parting thought: I am not at all saying that one must wait for a perfect set of data. Such a day will never come. Errors are inevitable, and some data will be missing. There will be all kinds of collection problems, and the limitation in data collection mechanisms cannot be fully overcome, thanks to those annoying humans who don’t comply well with the system. Or, it could be that the target individual simply did not create an event for the category yet (i.e., data will be missing for the Home Electrics category, if the buyer in question simply did not do anything in that category).
So, collect and clean the data as much as possible, but don’t pursuit 100% either. Analytics — with or without machines — always have been making the most of what we have. Leave it at “good enough,” though machine wouldn’t understand what that means.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at email@example.com.