Big Data Must Get Smaller
Like many folks who worked in the data business for a long time, I don't even like the words "Big Data." Yeah, data is big now, I get it. But so what? Faster and bigger have been the theme in the computing business since the first calculator was invented. In fact, I don't appreciate the common definition of Big Data that is often expressed in the three Vs: volume, velocity and variety. So, if any kind of data are big and fast, it's all good? I don't think so. If you have lots of "dumb" data all over the place, how does that help you? Well, as much as all the clutter that's been piled on in your basement since 1971. It may yield some profit on an online auction site one day. Who knows? Maybe some collector will pay good money for some obscure Coltrane or Moody Blues albums that you never even touched since your last turntable (Ooh, what is that?) died on you. Those oversized album jackets were really cool though, weren't they?
Seriously, the word "Big" only emphasizes the size element, and that is a sure way to miss the essence of the data business. And many folks are missing even that little point by calling all decision-making activities that involve even small-sized data "Big Data." It is entirely possible that this data stuff seems all new to someone, but the data-based decision-making process has been with us for a very long time. If you use that "B" word to differentiate old-fashioned data analytics of yesteryear and ridiculously large datasets of the present day, yes, that is a proper usage of it. But we all know most people do not mean it that way. One side benefit of this bloated and hyped up buzzword is data professionals like myself do not have to explain what we do for living for 20 minutes anymore by simply uttering the word "Big Data," though that is a lot like a grandmother declaring all her grandchildren work on computers for living. Better yet, that magic "B" word sometimes opens doors to new business opportunities (or at least a chance to grab a microphone in non-data-related meetings and conferences) that data geeks of the past never dreamed of.
So, I guess it is not all that bad. But lest we forget, all hypes lead to overinvestments, and all overinvestments leads to disappointments, and all disappointments lead to purging of related personnel and vendors that bear that hyped-up dirty word in their titles or division names. If this Big Data stuff does not yield significant profit (or reduction in cost), I am certain that those investment bubbles will burst soon enough. Yes, some data folks may be lucky enough to milk it for another two or three years, but brace for impact if all those collected data do not lead to some serious dollar signs. I know how the storage and processing cost decreased significantly in recent years, but they ain't totally free, and related man-hours aren't exactly cheap, either. Also, if this whole data business is a new concept to an organization, any money spent on the promise of Big Data easily becomes a liability for the reluctant bunch.
This is why I open up my speeches and lectures with this question: "Have you made any money with this Big Data stuff yet?" Surely, you didn't spend all that money to provide faster toys and nicer playgrounds to IT folks? Maybe the head of IT had some fun with it, but let's ask that question to CFOs, not CTOs, CIOs or CDOs. I know some colleagues (i.e., fellow data geeks) who are already thinking about a new name for this—"decision-making activities, based on data and analytics"—because many of us will be still doing that "data stuff" even after Big Data cease to be cool after the judgment day. Yeah, that Gangnam Style dance was fun for a while, but who still jumps around like a horse?
Now, if you ask me (though nobody did yet), I'd say the Big Data should have been "Smart Data," "Intelligent Data" or something to that extent. Because data must provide insights. Answers to questions. Guidance to decision-makers. To data professionals, piles of data—especially the ones that are fragmented, unstructured and unformatted, no matter what kind of fancy names the operating system and underlying database technology may bear—it is just a good start. For non-data-professionals, unrefined data—whether they are big or small—would remain distant and obscure. Offering mounds of raw data to end-users is like providing a painting kit when someone wants a picture on the wall. Bragging about the size of the data with impressive sounding new measurements that end with "bytes" is like counting grains of rice in California in front of a hungry man.
Big Data must get smaller. People want yes/no answers to their specific questions. If such clarity is not possible, probability figures to such questions should be provided; as in, "There's an 80 percent chance of thunderstorms on the day of the company golf outing," "An above-average chance to close a deal with a certain prospect" or "Potential value of a customer who is repeatedly complaining about something on the phone." It is about easy-to-understand answers to business questions, not a quintillion bytes of data stored in some obscure cloud somewhere. As I stated at the end of my last column, the Big Data movement should be about (1) Getting rid of the noise, and (2) Providing simple answers to decision-makers. And getting to such answers is indeed the process of making data smaller and smaller.
In my past columns, I talked about the benefits of statistical models in the age of Big Data, as they are the best way to compact big and complex information in forms of simple answers (refer to "Why Model?"). Models built to predict (or point out) who is more likely to be into outdoor sports, to be a risk-averse investor, to go on a cruise vacation, to be a member of discount club, to buy children's products, to be a bigtime donor or to be a NASCAR fan, are all providing specific answers to specific questions, while each model score is a result of serious reduction of information, often compressing thousands of variables into one answer. That simplification process in itself provides incredible value to decision-makers, as most wouldn't know where to cut out unnecessary information to answer specific questions. Using mathematical techniques, we can cut down the noise with conviction.
In model development, "Variable Reduction" is the first major step after the target variable is determined (refer to "The Art of Targeting"). It is often the most rigorous and laborious exercise in the whole model development process, where the characteristics of models are often determined as each statistician has his or her unique approach to it. Now, I am not about to initiate a debate about the best statistical method for variable reduction (I haven't met two statisticians who completely agree with each other in terms of methodologies), but I happened to know that many effective statistical analysts separate variables in terms of data types and treat them differently. In other words, not all data variables are created equal. So, what are the major types of data that database designers and decision-makers (i.e., non-mathematical types) should be aware of?
In the business of predictive analytics for marketing, the following three types of data make up three dimensions of a target individual's portrait:
- Descriptive Data
- Transaction Data / Behavioral Data
- Attitudinal Data
In other words, if we get to know all three aspects of a person, it will be much easier to predict what the person is about and/or what the person will do. Why do we need these three dimensions? If an individual has a high income and is living in a highly valued home (demographic element, which is descriptive); and if he is an avid golfer (behavioral element often derived from his purchase history), can we just assume that he is politically conservative (attitudinal element)? Well, not really, and not all the time. Sometimes we have to stop and ask what the person's attitude and outlook on life is all about. Now, because it is not practical to ask everyone in the country about every subject, we often build models to predict the attitudinal aspect with available data. If you got a phone call from a political party that "assumes" your political stance, that incident was probably not random or accidental. Like I emphasized many times, analytics is about making the best of what is available, as there is no such thing as a complete dataset, even in this age of ubiquitous data. Nonetheless, these three dimensions of the data spectrum occupy a unique and distinct place in the business of predictive analytics.
So, in the interest of obtaining, maintaining and utilizing all possible types of data—or, conversely, reducing the size of data with conviction by knowing what to ignore, let us dig a little deeper:
Generally, demographic data—such as people's income, age, number of children, housing size, dwelling type, occupation, etc.—fall under this category. For B-to-B applications, "Firmographic" data—such as number of employees, sales volume, year started, industry type, etc.—would be considered as descriptive data. It is about what the targets "look like" and, generally, they are frozen in the present time. Many prominent data compilers (or data brokers, as the U.S. government calls them) collect, compile and refine the data and make hundreds of variables available to users in various industry sectors. They also fill in the blanks using predictive modeling techniques. In other words, the compilers may not know the income range of every household, but using statistical techniques and other available data—such as age, home ownership, housing value, and many other variables—they provide their best estimates in case of missing values. People often have some allergic reaction to such data compilation practices siting privacy concerns, but these types of data are not about looking up one person at a time, but about analyzing and targeting groups (or segments) of individuals and households. In terms of predictive power, they are quite effective and results are very consistent. The best part is that most of the variables are available for every household in the country, whether they are actual or inferred.
Other types of descriptive data include geo-demographic data, and the Census Data by the U.S. Census Bureau falls under this category. These datasets are organized by geographic denominations such as Census Block Group, Census Tract, Country or ZIP Code Tabulation Area (ZCTA, much like postal ZIP codes, but not exactly the same). Although they are not available on an individual or a household level, the Census data are very useful in predictive modeling, as every target record can be enhanced with it, even when name and address are not available, and data themselves are very stable. The downside is that while the datasets are free through Census Bureau, the raw datasets contain more than 40,000 variables. Plus, due to the budget cut and changes in survey methods during the past decade, the sample size (yes, they sample) decreased significantly, rendering some variables useless at lower geographic denominations, such as Census Block Group. There are professional data companies that narrowed down the list of variables to manageable sizes (300 to 400 variables) and filled in the missing values. Because they are geo-level data, variables are in the forms of percentages, averages or median values of elements, such as gender, race, age, language, occupation, education level, real estate value, etc. (as in, percent male, percent Asian, percent white-collar professionals, average income, median school years, median rent, etc.).
There are many instances where marketers cannot pinpoint the identity of a person due to privacy issues or challenges in data collection, and the Census Data play a role of effective substitute for individual- or household-level demographic data. In predictive analytics, duller variables that are available nearly all the time are often more valuable than precise information with limited availability.
Transaction Data/Behavioral Data
While descriptive data are about what the targets look like, behavioral data are about what they actually did. Often, behavioral data are in forms of transactions. So many just call it transaction data. What marketers commonly refer to as RFM (Recency, Frequency and Monetary) data fall under this category. In terms of predicting power, they are truly at the top of the food chain. Yes, we can build models to guess who potential golfers are with demographic data, such as age, gender, income, occupation, housing value and other neighborhood-level information, but if you get to "know" that someone is a buyer of a box of golf balls every six weeks or so, why guess? Further, models built with transaction data can even predict the nature of future purchases, in terms of monetary value and frequency intervals. Unfortunately, many who have access to RFM data are using them only in rudimentary filtering, as in "select everyone who spends more than $200 in a gift category during the past 12 months," or something like that. But we can do so much more with rich transaction data in every stage of the marketing life cycle for prospecting, cultivating, retaining and winning back.
Other types of behavioral data include non-transaction data, such as click data, page views, abandoned shopping baskets or movement data. This type of behavioral data is getting a lot of attention as it is truly "big." The data have been out of reach for many decision-makers before the emergence of new technology to capture and store them. In terms of predictability, nevertheless, they are not as powerful as real transaction data. These non-transaction data may provide directional guidance, as they are what some data geeks call "a-camera-on-everyone's-shoulder" type of data. But we all know that there is a clear dividing line between people's intentions and their commitments. And it can be very costly to follow every breath you take, every move you make, and every step you take. Due to their distinct characteristics, transaction data and non-transaction data must be managed separately. And if used together in models, they should be clearly labeled, so the analysts will never treat them the same way by accident. You really don't want to mix intentions and commitments.
The trouble with the behavioral data are, (1) they are difficult to compile and manage, (2) they get big; sometimes really big, (3) they are generally confined within divisions or companies, and (4) they are not easy to analyze. In fact, most of the examples that I used in this series are about the transaction data. Now, No. 3 here could be really troublesome, as it equates to availability (or lack thereof). Yes, you may know everything that happened with your customers, but do you know where else they are shopping? Fortunately, there are co-op companies that can answer that question, as they are compilers of transaction data across multiple merchants and sources. And combined data can be exponentially more powerful than data in silos. Now, because transaction data are not always available for every person in databases, analysts often combine behavioral data and descriptive data in their models. Transaction data usually become the dominant predictors in such cases, while descriptive data play the supporting roles filling in the gaps and smoothing out the predictive curves.
As I stated repeatedly, predictive analytics in marketing is all about finding out (1) whom to engage, and (2) if you decided to engage someone, what to offer to that person. Using carefully collected transaction data for most of their customers, there are supermarket chains that achieved 100 percent customization rates for their coupon books. That means no two coupon books are exactly the same, which is a quite impressive accomplishment. And that is all transaction data in action, and it is a great example of "Big Data" (or rather, "Smart Data").
In the past, attitudinal data came from surveys, primary researches and focus groups. Now, basically all social media channels function as gigantic focus groups. Through virtual places, such as Facebook, Twitter or other social media networks, people are freely volunteering what they think and feel about certain products and services, and many marketers are learning how to "listen" to them. Sentiment analysis falls under that category of analytics, and many automatically think of this type of analytics when they hear "Big Data."
The trouble with social data is:
- We often do not know who's behind the statements in question, and
- They are in silos, and it is not easy to combine such data with transaction or demographic data, due to lack of identity of their sources.
Yes, we can see that a certain political candidate is trending high after an impressive speech, but how would we connect that piece of information to whom will actually donate money for the candidate's causes? If we can find out "where" the target is via an IP address and related ZIP codes, we may be able to connect the voter to geo-demographic data, such as the Census. But, generally, personally identifiable information (PII) is only accessible by the data compilers, if they even bothered to collect them.
Therefore, most such studies are on a macro level, citing trends and directions, and types of analysts in that field are quite different from the micro-level analysts who deal with behavioral data and descriptive data. Now, the former provide important insights regarding the "why" part of the equation, which is often the hardest thing to predict; while the latter provide answers to "who, what, where and when." ("Who" is the easiest to answer, and "when" is the hardest.) That "why" part may dictate a product development part of the decision-making process at the conceptual stage (as in, "Why would customers care for a new type of dishwasher?"), while "who, what, where and when" are more about selling the developed products (as in "Let's sell those dishwashers in the most effective ways."). So, it can be argued that these different types of data call for different types of analytics for different cycles in the decision-making processes.
Obviously, there are more types of data out there. But for marketing applications dealing with humans, these three types of data complete the buyers' portraits. Now, depending on what marketers are trying to do with the data, they can prioritize where to invest first and what to ignore (for now). If they are early in the marketing cycle trying to develop a new product for the future, they need to understand why people want something and behave in certain ways. If signing up as many new customers as possible is the immediate goal, finding out who and where the ideal prospects are becomes the most imminent task. If maximizing the customer value is the ongoing objective, then you'd better start analyzing transaction data more seriously. If preventing attrition is the goal, then you will have to line up the transaction data in time series format for further analysis.
The business goals must dictate the analytics, and the analytics call for specific types of data to meet the goals, and the supporting datasets should be in "analytics-ready" formats. Not the other way around, where businesses are dictated by the limitations of analytics, and analytics are hampered by inadequate data clutters. That type of business-oriented hierarchy should be the main theme of effective data management, and with clear goals and proper data strategy, you will know where to invest first and what data to ignore as a decision-maker, not necessarily as a mathematical analyst. And that is the first step toward making the Big Data smaller. Don't be impressed by the size of the data, as they often blur the big picture and not all data are created equal.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at email@example.com.