Freeform Data Are Not Exactly Free
Whenever "Big Data" is mentioned, there follows this sick stat that 2.5 quintillion bytes of data are being collected every day. The reason that number is so bloated is because literally everything that is digitized is considered as data now. They could be coming from simple tweets, Facebook postings, emails, blogs, videos, audio files, Web pages, mobile apps and program downloads. Imagine combining those with every click that you made, every page you viewed, every breath you took—literally. If you are connected to a medical device, every heartbeat your heart generated. Same goes for when you are wearing one of those fancy devices while jogging through your neighborhood. You can see why they say (though I always wonder who "they" are) we are living in the sea of data. At the dawn of Internet of Things (or the beginning of the Skynet), we should also accept that the human collectives will not be the dominant generators of massive amounts of data in the near future. We will hopefully be in control of the machines and the collected data will remain being mostly about us humans. Relatively speaking, we have only recently become the dominant species on this little planet at the far corner of the Milky Way Galaxy, and might as well enjoy being on the top of the food chain a little while longer.
Now, some of those massive data are in forms of numbers that we can add or subtract. That type of data is expressed in terms of dollars, cents, shillings, Euros or Yuans in the transactional world. Or, if they are about countable human behavior, it can be expressed in days, hours, minutes, seconds, clicks, views, downloads, meters, yards, miles, heartbeats, breaths, gallons, kilometers per hour, etc. Heck, during the last World Cup, they were measuring the exact running distance for each player through some wearable device already. (They displayed impressive figures—around six to seven miles per player per game without counting overtime—a few times more than a running back would cover in a typical football match).
And those numbers and figures are the easy part. Most of the data we've been dealing with so far—through computers, at least—has been in forms of numbers, anyway. Other than binning or transforming them and dealing with inevitable missing values (more on that in future articles), numbers are generally in ready-for-analytics forms. The real trouble is that most of the so-called Big Data are not in such numeric shapes at all. It gets worse as most of the data that we are collecting nowadays are unstructured, unorganized, unedited, uncategorized and unrefined. In other words, they are freeform raw data. That means someone—generally the last person who touches the data for reporting to some big shots—has to make sense out of them. And how are we going to do that when Mr. Data of the 24th Century a has hard time understanding sarcasm? (But then, one humanoid, Sheldon Cooper of this century, has a hard time with sarcasm too. So let's not be too harsh on machines for such shortcomings.) Most big shots are bottom line-oriented folks, and all they really care about are short bursts of answers like "percentage increase of unfavorable sentiments toward a new product that was just released with lots of development chagrins and marketing fanfare."
One thing is for sure. While it requires hard work, filtering and categorizing non-numeric data through computer-based analysis or mostly through hard labor (or both), the hard work certainly pays off. I said numerous times that the Big Data movement must be:
1. Cutting down the noise; and
2. Providing answers to decision-makers in the form of simple answers (refer to "Big Data Must Get Smaller").
Combing through the freeform data and throwing away the unnecessary bits is the beginning of such a data reduction process. Here again, the first part is defining the questions to be answered. Once the goals are set, we can start throwing things away with conviction. If we find patterns in such activities, we can then start automating the hygiene and data categorization processes.
Let's start with simple examples instead of worrying about the CIA having to comb through billions of lines of emails and phone records to determine who has the intention to blow places up. Even when we just stick to marketing applications, there are plenty of examples. Product descriptions, product labels, service categories, offer types, channels, data sources, Web pages, surveys or business titles are often in freeform (yes, surveys, too), and they are definitely not ready to be used in advanced analytics. It is not just having to dissect everyone's tweets and determine their sentiments toward certain products. Useful data are hidden in the most obvious and immediate places. And in this world of uncategorized data, they are the lowest-hanging fruits and the most potent predictors in modeling and analytics.
Take, for instance, a simple data field called "Professional Title." If you have a stack of business cards in your drawer, pull them out and see if you can find any two titles that are completely identical to each other. And no, "SVP, Finance" is not the same as "Sr. Vice President/Chief Finance Officer." While you as a human being may assume that those two titles mean about the same level and function, try to explain that to your computer. It is common to find over tens of thousands of variations of business titles in a relatively small B-to-B list. As a result, even the most seasoned marketers often give up on that field, and only use them to address the contact.
To use it in analytics and reporting for marketing and sales, let us try to break the professional title field into two separate variables: one for the ranks within an organization, and the other by functionality. As I mentioned, deciding "what to define" is the first step. Then, set up the details of the final categories and rules regarding what should go into which category.
For "Title Rank," we may put all professional titles into the following categories:
- SVP, VP, and other "Chief" or "Officer" level titles under CEO
- Sr. Director and Director
- Manager, and other middle management titles
- Tactical titles, such as Account Manager, Programmer, Designer, Engineer, Writer, Editor, etc.
- Administrative Assistant, Secretary and other admin titles
- Rank-free independent titles for doctors, lawyers, consultants, etc.
- Etc., etc.
You may come up with other lists depending on your purpose. Similarly, we can set up "Title Function" for functional categories as the following:
- Research & Development
- Human Resources
- Etc., etc.
Again, depending on the purpose, we may expand or reduce this list. The key is to come up with ideal categories for specific purposes (in this case, for sales and marketing), that are not too general and not too specific. Such a Goldilocks zone lies at around 20 or fewer categories per variable. If you need more details, break the variable apart, like in this example.
Now, if you just sit down and go through 10,000 to 20,000 business titles and put them into these two categories, it won't be easy. But it won't be impossible, either. Granted that one person may categorize three to four titles in a minute, it can be estimated that one full-time person can go through 10,000 titles in six to seven working days or so, with some coffee breaks. If that person doesn't get suicidal doing it, the result would be quite rewarding and useful. Or, put eight interns on it and finish the task in one day, if you have that option.
The better and saner way is to set up a program that recognizes patterns of words, and let it assign the values to predetermined categories. There is an old saying in the programming field that "A lazy programmer turns out to be a good one," meaning that developers who hate manual work would create more automated modules and macros. Some editing and tweaking would be inevitable in an exercise like this, but looking for exceptions would be a lot easier than looking through the whole list. Plus, we might as well get used to that idea—as some freeform data may have a few billion variations in one field. In any case, auditing is very important, as words like "secretary" could be coming from "Secretary of the Treasury" or it could mean an administrative position. The same is true for "Manager, Accounting" vs. "Account Manager." As the order of operation becomes important, I recommend employing the "More Specific the Better" rule, where more specific strings of words are categorized first, and the general ones later (in this example, categorize the "Account Manager" first).
The Professional Title categories may be considered as optional by some marketers (especially when it is about B-to-C marketing). But I found an important variable, such as Offer Type, is in freeform in many databases. That is a shame, as it is a result of lack of planning or sheer laziness. If a field called "Offer" is recorded like "15% discount offer for 2014 Labor Day weekend blowout sale for total purchase amount over $250," it is hardly a data variable. Try to find out which customer would prefer flat dollar discount vs. free shipping with it. Unfortunately, such freeform data are not uncommon and, at times, I saw literally thousands of offer types in one database. The first thing that I do in a situation like that is ask about the company's HR policy regarding hiring interns. Someone has to comb through that mess and make sensible variables out of it. And I'd rather train the interns, rather than waiting for the marketing department to come up with a plan to fix its practice spontaneously. That would be like waiting for clinically diagnosed hoarders to cure themselves on their own.
But I would still ask the marketers about their long-term and immediate marketing goals and their channel preferences, and collaborate with them to come up with a uniform Offer Code. As there are only so many ways to lure customers into stores and websites, thousands of offer types can be categorized into the following groups:
- Flat Dollar Discount
- % Discount
- Buy 1, Get 1 Free
- Free Shipping
- No Payment Until …
- Interest-Free Loan
- 3 Easy Payments of …
- Free Gifts
Indeed, the word "Free" is used freely here, and I heard that it is one of the most powerful words on the Internet. Too bad that there is no free lunch, and that "no-payment-until" date will come around soon enough.
Now, you may think that this table is "too" simple. If a marketer is concerned about the terms and conditions attached to these offers, then I'd recommend creating sub-categories under the main Offer Code. It is a good idea to keep the number of variations in one code to a manageable size, anyway. We may create an "Offer Condition Code" to capture details, such as:
- Minimum required purchase amount
- Period/season specific
- Coupons required
- Store cards only
- Specific products only
- Limited to 1 gift per customer for 3 months
- Students only
Now, the combination of these two codes will produce all kinds of variations. Going further with it, if capturing the seasonal element is critical, then another sub-category called "Offer Season" could be assigned to:
- Memorial Day
- Fourth of July
- Labor Day
- Black Friday
- White Monday
If you feel bad for the U.S. presidents or Columbus for leaving them out, then you may include those holidays, as well. But you get the point. The whole idea is to avoid freeform data as much as possible from the data collection stage and on. It is unbelievable how many so-called surveys result in unusable freeform data, and we should have a word with the survey designer in such cases.
In the world of analytics, categories and tags are your friends. Have you wondered how music services like iTunes or Pandora auto-magically (I apologize for using this cliché) pick related songs like a personal DJ for you? I am certain they all rely on wonderful algorithms that calculate the distances among millions of songs. But the starting point of such a calculation is setting up useful categories and tags for each song, such as musical genre, artists, artist category, main instrument type, year released, year composed, original/remake, band type, band members, lead singers, composer, arranger, conductor, length, album, album type, song sequence in an album, collaborating artists, featured artists, etc. Going deeper, one can imagine obscure tags such as "One-hit-wonder of the 80s," "Guitar heroes of the 70s," "Girl groups of K-pop," etc. I don't care if they wrote computer codes to create such tags, or had a farm of young and hip interns go crazy with it. The point is that building a mathematical model is a stepwise exercise, and categorization is an important part of it.
When everything is digitized, even the food labels can be used to profile consumers and predict future behaviors. Predicting "why" is the most difficult part in predictive analytics. But if some household is buying unusual proportion of products labeled as "Sugar Free," do we really need to know the "why" part? It could be that there is a diabetic in the household, or someone is in a weight loss program. But once such a correlation is found, we can personalize the offers to such households (without, of course, being too creepy about it).
Here again, the whole exercise starts with creating new variables and categories. In the spirit of going nuts about it, let's start by putting down all the things that we can find from simple food labels. The goal here is to describe the buyers, not the product itself. So, let's imagine the buyers of products labeled as:
- Organic (Though I wonder what that means at times. The opposite of Synthetic?)
- Diet (as in "Diet Coke," or "Coke Light" in Europe)
- Low calorie/no calorie
- Low sugar/no sugar
- Low fat/fat free
- Low sodium/sodium free
- Gluten free
- Lactose free
- Peanut free
- Energy (not for bunnies, but for drinks)
- Family size/value pack
- Fun size/small packages (Though I personally believe the "Fun Size" should mean something crazy-big, like an 8x11 size chocolate bar that's ¾-inches thick.)
- Etc., etc.
Once you break down the labels this way, it is entirely possible to build models targeting "Cooking from scratch for a family," "Health-conscious organic," "Weight watchers," "Busy parents with young kids," "Energetic on-the-gos," "Buyers with dietary restrictions," etc. This is how to convert monotonous product labels into descriptors of buyers. Through categories and tags, then with statistical models.
Now you get the idea, so let's continue with ridiculously large data. I have some personal experience with it, as I led a team to create the first large-scale consumer co-op database in the U.S. that fully incorporated SKU-level item data into an individual-level targeting engine at the turn of this century. That may sound easy nowadays (though very few people are doing it right, even now). But being the first in the industry attempting such a thing, it was a borderline crazy idea at the time. And like anything in the age of Big Data, the data collection was the easiest part. In fact, after we created the SKU-level co-op database, item-level data became the price of entry in the co-op and list industry, and every follower started collecting the data at that level. But at the risk of sounding too much like Jerry Seinfeld, hey, anybody can just collect the data. The important part is holding on to them. All the way to modeling, targeting and selection.
Because we were collecting data from more than 1,200 sources then (now, that company has over 2,000 sources, I hear) and participating co-op members each had a number of SKUs ranging from 50,000 to 500,000, by the time we had more than 150 million buyers in our database, we had literally billions of item-level transaction details. Well, that is pretty big—even by today's standards. So how did we make sense out of it?
First, we created a multi-level product category table into which all SKU data would be assigned. It was multi-level, as we set up 20 to 30 major categories (from apparel to video entertainment), and broke each major category down to more specific categories. For instance, Apparel would be broken in to women, men, children, large size, petite size, big and tall, etc. And then women's apparel would be further broken down into fashion, formalwear, eveningwear, casualwear, underwear, loungewear, footwear, swimwear, bride wear, etc., for example. I now see that Google's product categories took that type of multi-level structure, and if you visit any major shopping sites, such as Amazon, and drill down their product categories, you will recognize such layers there, as well. The major difference? We did not categorize products, but we categorized buyers who bought those items.
And that is the punch line. Buyers, not the product. Why? It was because our goal was to predict individuals' future behaviors. We were not doing this for inventory management or website efficiency. So, when in doubt, it was perfectly OK for us to assign different categories to the same product, depending on the context. An easy example of that would be baking soda. Buyers of baking soda could be buying it for baking, deodorizing, dental hygiene, household cleaning and the list goes on. And we must recognize such differences. Similarly, let's take an example of a fancy weather station that tells time, temperature, atmospheric pressure, humidity, etc. One can buy that item from a nautical catalog or website, or from an executive gift catalog. If you force that item into a nautical category regardless of the context, you may end up sending nautical product offers to a gift buyer in the future. Not the end of the world, but not ideal, for sure. Again, buyers before products.
That type of "Buyer-centric" mindset has been the main theme of this whole series (refer to "It's All About Ranking"). And for this daunting task of having to categorize millions of SKU's, that single-mindedness also became our savior. Simply, why categorize any item that did not sell? In fact, we could explain the majority of the transactions by categorizing the most popular items first and ignoring the unpopular ones.
If we were doing it today, we would have put more emphasis on crowdsourcing, pattern recognitions and machine learning. In fact, what we did even back then was a combination of small-scale crowdsourcing (with lots of part-time moms who were highly educated and informed consumers), plus pattern-recognition techniques. We tried machine learning in the beginning, but quickly realized that it would be humans who would have to teach the machine to begin with. And, in those days, such software was cost-prohibitive, with not-so-great results. They were alright with long strings of texts from emails and messages, but with a burst of product description like "Disney's Tarzan," it had no idea where to go and couldn't be taught, either.
That means we would still need some type of human intervention at the beginning stage, and I think it would be beneficial to share some of the major rules of categorization, regardless of employed technologies or techniques:
1. Define the Categories First: The key is to set up categories that fit your goals, granted that you've set the goals. Be specific, as we can combine categories in analytical steps, but analysts cannot break apart the ones that are lumped in together.
2. Categorize as the Data Are Being Collected: It is not always possible, but we should try to categorize data at earlier stages of data collection. For example, inadequate surveys and data-entry forms on Web pages are the main source of unusable freeform data. And be consistent about it during the journey through data, starting with data collection forms to database design and analytics.
3. Buyers, Not Product: As explained already, for marketing purposes, when in doubt, buyer categorization must be the primary goal. Buyer categories are definitely not the same as product categories. Product taxonomy designed for inventory and website management is better than nothing, but they aren't suitable for target marketing.
4. The More Specific, the Better: During categorization efforts, the most specific category in the master table should be considered first. Do not get lazy and just assign an item to "Home Electronics," when it could be under "Home Electronics > Home Theater System > Audio Equipment > Speakers."
5. Cut Out the Noise: Even in short product descriptions, there are many noises. For instance, is it really necessary to categorize every color of shoes a woman bought? Great, you now know that she bought a pair of red shoes in the spring of 2014. Types of shoes, designer, brand and the price range are important, but the "red" part? You don't need it, unless you will have a "red color only" sale one day. And even so, she may never respond to it, as she has a pair of red shoes already. No matter how interesting the category or tag may sound, cut it out if it doesn't help sell more products.
6. Consistency Over Accuracy: Do not forget that we will be using these categorized data in model development, and an important part of that exercise is recognizing patterns in consumer behavior. If you keep changing your mind about the category for the same item, it will mess it all up, for sure. People are often confused because of the English version of "Category Descriptions" (as in "That handbag should not be in women's accessory, as it is a fashion brand!"). But once it is categorized, it is just a set of assigned numbers that provide patterns for the analytical programs in later stages. The worst thing to do is to put in conflicting categories or tags for the same item from different transactions. Also, if the brand or merchant name are so important, those should be separate variables, like in the earlier examples for professional titles and offer types.
7. Automate as Much as Possible: No matter how expensive computing time may be, it is just a fraction of the cost of human labor. Once the patterns and rules are set, employ all available technologies to automate the process. That will also ensure consistency, right or wrong. But do not forget that there is no software that can just create categories and groups that are suitable for your goals on its own.
Last month, I discussed how we can create hundreds of meaningful statistics out of simple RFM data by combining them with other categorical elements, such as products and channels (refer to "Beyond RFM Data," where the concept of RFM, P & C were introduced). And such categorical data are abundant. We talked about products, services, channels, offer types and business titles. But we may also dig into markets, regions, member status, payment types, data sources, Web pages, call-center logs and any type of action that may happen at websites or stores. Create a categorization template suitable for specific goals and lay out proper categorization rules. Then you will be able to make any analytical dataset immensely more colorful. After all, that is what I meant by "Beyond RFM Data." The trick is to combine different types of data at the variable creation stage in preparation for the analytical steps. But if you give up on the freeform data, none of it would be possible, even with a simple field, such as Professional Title.
So, the final lesson is that you should never give up, never surrender. Making sense of seemingly impossible amounts and variations of data is the essence of the Big Data movement, anyway. Blessed are the ones who are innovative, committed, persistent and consistent. You didn't think that this whole thing would be that easy, did you? (Silly rabbit, Trix are for kids … ) But, like the crossing of the Atlantic Ocean, any challenge can be turned into a routine if you get to know the proven steps. The sea of data should be looked at the same way.
Stephen H. Yu is a world-class database marketer. He has a proven track record in comprehensive strategic planning and tactical execution, effectively bridging the gap between the marketing and technology world with a balanced view obtained from more than 30 years of experience in best practices of database marketing. Currently, Yu is president and chief consultant at Willow Data Strategy. Previously, he was the head of analytics and insights at eClerx, and VP, Data Strategy & Analytics at Infogroup. Prior to that, Yu was the founding CTO of I-Behavior Inc., which pioneered the use of SKU-level behavioral data. “As a long-time data player with plenty of battle experiences, I would like to share my thoughts and knowledge that I obtained from being a bridge person between the marketing world and the technology world. In the end, data and analytics are just tools for decision-makers; let’s think about what we should be (or shouldn’t be) doing with them first. And the tools must be wielded properly to meet the goals, so let me share some useful tricks in database design, data refinement process and analytics.” Reach him at firstname.lastname@example.org.