Text mining has an illustrious history in the world of analytics. Investigators have used text mining to fight fraud, conduct anti-terrorist surveillance and analyze police interrogations in criminal investigations. Matters of national security, public safety, patent protections and trade secrets, and other high-minded topics all have made use of text-mining technology tools, and the wise minds of analysts behind them. But there's a very important role for text mining to play in direct marketing, as well.
The following illustration is based on about 19,000 e-mails received by a major digital camera manufacturer. All these communications were manually reviewed and classified as being either a technical question or a customer service inquiry. Approximately 17,100 of the e-mails were categorized as customer service, with the remainder assigned to the technical category.
Analysts were charged with developing a set of rules that would allow an automatic process to categorize each incoming e-mail into one of the two categories of interest. Once an accurate classification algorithm is developed, the previous manual process of review could be automated. The savings in personnel costs would be quite significant.
The e-mails were subjected to a text-mining algorithm, with the results showing the frequency of keywords and the relationship among the words. Counts were available by document for each word or set of words. Below is a sample of data that was organized after the text-mining portion of the exercise was completed.
Email # Category 'Understand' 'Problem' 'Attempt' 'Years' 'Angry'
1 Technical Question 2 3 0 0 1
2 Customer Service 1 0 0 0 0
3 Customer Service 0 0 1 1 0
4 Technical Question 0 0 0 0 2
5 Customer Service 0 1 2 0 0
6 Customer Service 1 0 0 1 0
7 Customer Service 0 0 0 0 2
The first column is identification for the e-mail. “Category,” populating column two, is the classification that was assigned by an analyst after reading all 19,000 documents manually. The next five columns are examples of actual words that appeared in these 19,000 communications. The cell contents represent the number of times the word appeared in the particular e-mail. So, for example, the word “attempt” appeared twice in e-mail No. 5, while the word “angry” did not appear at all in e-mail No. 6.
There were 19,000 of these rows and 162 columns looking at various words and phrases.
The next part of the project was to use data modeling tools to construct a set of rules that would optimize correct classification. Below are the results:
Technical Question Customer Service
Actual Technical Question 90.21% 9.79%
Customer Service 6.09% 93.91%
The analysis demonstrated that of all the known technical-related e-mails, modelers were able to classify 90.2 percent of them correctly. An equally impressive 93.9 percent of customer service communications were assigned correctly. Further refinements were able to improve upon the success metrics of these two classifications, enabling subsequent business rules for marketing automation to be applied and prompt proper handling by the manufacturer.
Sam Koslowsky is vice president of modeling solutions for Harte-Hanks, a worldwide, direct and targeted marketing company that provides direct marketing services and shopper advertising opportunities to consumer and B-to-B marketers. Contact Koslowsky at (212) 520-3259 or via e-mail at firstname.lastname@example.org.