Catch Bad Data Before It Wrecks Your Business
During the past two decades, conventional wisdom surrounding data quality has drastically changed. As decision support systems, data warehousing and business intelligence have triggered greater scrutiny of the data used to measure and monitor corporate performance, a changing attitude has gradually altered our perception of what is meant by "quality information."
Instead of focusing on specific uses of data in the context of how data sets support operation of transactional systems, we have started to consider data reuse and repurposing, noting the data's inherent value, which goes beyond its ability to make functional applications work. And, as opposed to a knee-jerk reaction to data errors, the industry now focuses on evaluating conformance to business rules that are indicative of a data set's fitness for its (potentially numerous and varied) purposes.
Even so, the fundamental aspects of data quality improvement have generally remained the same and center on a virtuous cycle:
- Evaluate data to identify any critical errors or issues that are impacting the business
- Assess the severity of the errors and prioritize their remediation
- Develop and deploy mitigation strategies
- Measure improvement to the business
- Go back to step one
This cycle of excellent data quality requires that businesses rely on effective tools and techniques for each step. Tools help uncover the existence of data errors, evaluate the severity of the problem, eliminate the root causes and correct the data, and further inspect and monitor ongoing data quality activities. The real challenge lies in understanding: when data values are or are not valid and correct; how data values can be made correct; and how data cleansing services can be integrated into the environment. By focusing on five key aspects of data quality management—including data cleansing, address data quality, address standardization, data enhancement, and record linkage and matching—businesses achieve a practical and proactive approach to data quality management.
1. Data Cleansing
Data cleansing combines the definition of business rules in concert with software designed to execute these rules. Yet there are some idiosyncrasies associated with building an effective business rules set for data standardization and, particularly, data cleansing.
At first blush, the process seems relatively straightforward: We have a data value in a character string that we believe to be incorrect and we'd like to use the automated transformative capability of a business rule to correct that incorrect string. For example, a rule might transform the abbreviation "ST" into the word "street." It is a simple data cleansing process; but, in reality, it is too simple to provide the best results. Without further controls, an address such as "St. Charles St." would be transformed to "Street Charles Street."
In order to correctly transform the data, a bit more control is required in regard to how, where and when the rule is applied. One approach to resolving rule conflicts is the introduction of contextual constraints for application of the rules. This is more complex, but assists in differentiating the application of rules. Another approach somewhat adjusts the rule set to ensure distinction of abbreviation and then phasing the application of rules. Most importantly, businesses must evaluate the ways data cleansing tools define rules as a way to determine the best option for their particular dataset.
2. Address Data Quality
One aspect of managing the quality of master address and location data involves reviewing much of the existing documentation collected from a number of different operational systems, as well as reviewing the business processes to see where location data is either created, modified or read. There are likely to be many references to operations or transformations performed on addresses—mostly with the intent of improving the quality of the address.
Curiously, there are often a number of different terms used to refer to these different transformations: validation, verification, standardization, cleansing and correction. But what do all these things mean? And why are these different terms used if they mean the same thing?
After considering the variety of terms used in describing address quality, the following core concepts must be correct to provide the best benefits for accurate parcel delivery.
- The item must be directed to a specific recipient party (either an individual or an organization).
- The address must be a deliverable address.
- The intended recipient must be associated with the deliverable address.
- The delivery address must conform to the USPS standard.
These assertions translate to specific steps for transforming a provided address into a complete, verified, validated and standardized address.
Based on USPS guidelines, a complete address by definition can be matched with current Postal Service ZIP-plus-four ZIP code and city and state files. Verification means that a complete address matches the USPS files and further has the correct ZIP-plus-four. Validated addresses are consistent with the postal standard in terms of valid and invalid values. For example, a street address cannot have a number that is outside the range of recognized numbers (i.e., if the USPS file says that Main Street goes from 1 to 104, an address with 109 Main St. is invalid). Lastly, standardization means the address is spelled out using USPS standard abbreviations.
These are all essential address quality steps to be managed by your data quality tools, but they underlie the most important element of address correctness. The address may be complete, all the elements may be valid, the ZIP-plus-four is accurate and all values conform to standardized abbreviations ... but business value is not ensured unless the right recipient is associated with the right address. There are many aspects of assessing and assuring the quality and correctness of addresses, and it is a worthwhile business commitment to review the ways your organization verifies, validates, standardizes and corrects its location and customer data.
3. Address Standardization
Here is a simple scenario, followed by what should be a simple question: You have an item you would like to have delivered to a specific individual at a particular location and you plan to engage an agent to deliver the item on your behalf. How can you communicate to the agent where the item is to be delivered? From the modern day perspective, it should be obvious—you only need to provide the street address and can expect the agent will be able to figure it out on his own.
We expect the delivery agent will be able to figure out how to get to a location, because the standard address format contains a hierarchical breakdown for refining the location at finer levels of precision. In the U.S., an address contains a street name and number, as well as a city, state and a postal code. This process works in the U.S., because there is a postal standard and, in fact, the driving force behind addressing standards is the need for accuracy in delivery. Ultimately, delivery accuracy saves money because it reduces the amount of effort to find the location and it eliminates the rework and extra costs of failed delivery.
Problems occur when, for one reason or another, the address does not conform to the standard. If the address is slightly malformed (e.g. it is missing a postal code), the chances are still good the location can be identified. If the address has serious problems (e.g. the street number is missing, there is no street, the postal code is inconsistent with the city and state, or other components are omitted), resolving the location becomes much more difficult, and therefore, costly.
The primary way of dealing with this problem is to treat each non-standard address as an exception, forcing the delivery agent to deal with it. The other approach attempts to fix the problem earlier in the process by using data tools to transform a non-standard address into one that conforms to the standard.
Address standardization is actually not very difficult, especially when you have access to a proven standard. At the highest level, the process is to first determine where the address does not conform to the standard, then to standardize the parts that did not conform. One can define a set of rules to check if the address has all the right pieces, whether they are in the right place and if they use the officially sanctioned abbreviations. Rules can also move address parts around, to map commonly used terms to the standard ones and use lookup tables to fill in the blanks when data is missing. In many cases, it is straightforward to rely on tools and methods to automatically transform non-standard addresses into standardized ones.
4. Data Enhancement
Most business applications are originally designed to serve a specific purpose and, consequently, the amount of data either collected or created by any specific application is typically just enough to get the specific job done. In this case, the data is utilized for the specific intent, and we'd say that the "degree of utility" is limited to that single business application.
On the other hand, businesses often use data created by one application to support another application. As a simple example, customer location data (such as ZIP codes) collected at many retail points of sale is used later by the retail business to analyze customer profiles and characteristics by geographical region. In these kinds of scenarios, the degree of utility of the data is increased, because the data values are used for more than one purpose.
There are a number of different ways data sets can be enhanced, including adapting values to meet defined standards, applying data corrections and adding additional attributes. We can start with a very common use of enhancement: postal standardization and address correction. Another common example involves individuals' names, which appear in data records in more than a thousand different forms: first name followed by last name; last name with a comma, followed by first name; with or without titles such as "Mr." or "Professor;" or perhaps different generational suffixes.
The value of data enhancement is not limited to insertion in specific operational workflows, because enhancement is often performed to provide additional detail for reporting and analysis purposes. And in these cases, enhancement goes beyond data standardization and correction; instead, the enhancement process can add more information by linking one data set to another. The appended data can augment an analytical process to include extra information in generated report and interactive visualizations.
Consider collecting ZIP code values at a point of sale. A retail company can take sales data that includes this geographic information and then enhance the data with demographic profiles provided by the U.S. Census Bureau to look for correlation between purchasing patterns and documented demographics about the specific locations (including sex, age, race, Hispanic or Latino origin, household relationship, household type, group quarters population, housing occupancy and housing tenure). Geographic data enhancement also adds value for analysis. Given a pair of addresses, an enhancement process can evaluate different types of distances (direct distance and driving distance are two examples) between those two points. This can be useful in a number of analytical applications, such as site location planning which compares properties based on a variety of criteria (possibly including the median driving distance for local customers for a bank branch or average driving time for delivering pizza to frequent customers).
Standardizing names and addresses is the first step, and linking those records to the reference data collections allows direct linkage based on specific criteria, ranging from gross-level linkage (say, at the county level) down to specific enhancement at the individual level (such as the names of the magazine to which a customer subscribes). These qualitative enhancements augment the business intelligence and analytics processes to help companies make more sales, increase revenues and improve profitability.
With some thought, you and your data quality vendor can come up with many of these hybrid scenarios—ones in which data enhancement is used for both analytical and operational purposes, including data standardization and cleansing—all with a focus on improved business functions.
5. Record Linkage and Matching
You might not realize how broad your electronic footprint really is. Do you have any idea how many data sets contain information about any specific individual? These days, any interaction you have with any organization is likely to be documented electronically. And, any time you fill out a form or respond to one survey or another, more information about you is captured. Remember that registration card you filled out for the toaster you bought? The survey you filled out to get that free subscription? Didn't you subscribe to some magazine about fishing and other outdoor activities? How about that contest you entered at the county fair?
Actually, you are not the only one collecting your information. Did you buy a house? Home sales are reported to the state and the data is made available, including address, sales price and often the amount of your mortgage. Wedding announcements, birth announcements and obituaries log lifecycle events.
Every single one of these artifacts captures more than just some information about an individual—it also captures the time and place that information is captured, sometimes with accurate precision (such as the time of an online order) or less precision (such as the day the contest entry was collected from the box. There are many distributed sources of information about customers, and each individual piece of collected data holds a little bit of value. But when these distributed pieces of data are merged together, they can be used to reconstruct an incredibly insightful profile of the customer.
Many distributed pieces of data about a single individual can be combined to form a deep profile about that individual. But how are different data records from disparate data sets combined to formulate insightful profiles?
These records are connected together through a process called "record linkage." This process searches through one or more data sets for records that refer to the same unique entity based on identifying characteristics that can be used to distinguish one entity from all others, such as names, addresses or telephone numbers. When two records are found to share the same pieces of identifying information, you might assume that those records can be linked together. It sounds simple in the scheme of a well-established data quality process, but there are still a number of challenges with linking records across more than one data set:
- The records from the different data sets don't necessarily share the same identifying attributes (one might have a phone number but the other one does not).
- The values in one data set use a different structure or format than the data in another data set (such as using hyphens for social security numbers in one data set but not in the other).
- The values in one data set are slightly different than the ones in the other data set (such as using nicknames instead of given names).
- One data set has the values broken out into separate data elements while the other does not (such as titles and name suffixes).
There are many variations on these themes. For example, merge/purge can be used for combining customer data sets following a corporate acquisition; enrichment can be used to institute a taxonomic hierarchy for customer classification and segmentation. Loosening the matching rules for merge/purge can help with a process called "householding," which attempts to identify individuals with some shared characteristics (such as "living in the same house").
Basically, by merging a number of data sets together, all records can be enriched as a byproduct of exposed transitive relationships. We can add to this another tool: approximate matching. This matching technique allows for two values to be compared with a numeric score that indicates the degree to which the values are similar. This is particularly valuable, because the exposure of embedded knowledge can, in turn, contribute to our other enhancement techniques for cleansing, enrichment and merge/purge, ultimately improving business value.
6. Achieving Proactive Data Quality Management
Standardizing the approaches and methods used for reviewing data errors, performing root cause analysis, and designing and applying corrective or remedial measures all help ratchet up an organization's data quality maturity a notch or two. This is particularly effective when fixing the processes that allow data errors to be introduced in the first place, totally eliminating errors altogether.
When the root cause is not feasibly addressed, we still have another standardized approach—definition of data validity rules that can be incorporated into probe points to monitor compliance with expectations and alert a data steward as early as possible when invalid data is recognized. This certainly reduces the "reactive culture," and better governs data stewardship activities. In fact, many organizations consider this level of maturity as proactive in data quality management because they are anticipating the need to address new issues on an ongoing basis.
Many organizations are looking at drastically increasing their consumption of information with "big data" analytics programs. At the same time, people are exploring many different ways to reuse and repurpose data for both operational and strategic benefit. However, to truly be proactive, companies must anticipate the types of errors that they don't already know. Instead of only using profiling tools to look for existing patterns and errors, they might use these analytical tools to understand the methods and channels through which any potential errors could occur. The true proactive win is to control the introduction of flawed data before it ever leads to any material impact.
Greg Brown is vice president of Melissa, provider of global contact data quality and identity verification solutions that span the entire data quality lifecycle and integrate into CRM, e-commerce, master data management and Big Data platforms. Connect with Greg at email@example.com or via LinkedIn.