The Cost of Perfection
Why Does This Happen? The Wrong Conceptual Model
The organization had a conceptual model that did not serve the business. Those most reasonable-sounding priorities were misplaced.
Why? because “clean” or “not clean” data is actually highly dependent on the specific purposes for that data. The same applies for almost any other description of data (and many other things) made without meaningful context.
One example is sentiment analysis. In this case, limited or even no cleaning at all could work with Bayesian methods on unstructured data, such as reviews on Yelp.
Granted, every review would not make an equal contribution to the final determination of the consumer sentiment about that brand — but does it have to? Of course not.
Your House is Dirty.
How did that statement make you feel? I wrote it, and for me it is really uncomfortable. I’m not a total neat freak, but it gives me an emotional, visceral reaction.
Why? Because I like “clean.”
But if someone wanted to come in with a white glove, I guess that person would be able to get a smudge on it somewhere. Now maybe I could clean up the place to pass even the most stringent white-glove test ...
But what if we came in with a microscope … regrettably, I’d find microscopic organisms in every home — mine, included.
But this being the case, we also know that our homes aren’t any less livable or enjoyable.
Let’s say we irradiated it. Like the perfect “cleaning” you may be envisioning for your data. Let’s just say we applied ultraviolet radiation to every surface. Now it’s as clean as we can get it ...
Is home that is anything short of irradiated better than being homeless?
The corollary … is good even if ‘not perfect’ data any better than being data-less?
The reality — “clean” in data and elsewhere really is in the eye of the beholder.