A New Approach to Predictive Analytics Model Evaluation
Predictive analytics has become routine in a wide variety of disciplines. While models have become standard for many, I am not convinced that many analysts are appropriately evaluating the results of their efforts.
Many, including novice analysts, believe that with the availability of gains or decile analysis, the evaluation standards are obvious. Users overvalue the reasonableness of the gains table. Are more responders identified in the top decile segments, fewer in the middle, and a minimum amount on the bottom? While this is important, it does not always lead to selecting the "best" model for a given situation.
Consider the table below that reports on percent responders identified by decile. Looks good, doesn't it?
This may be referred at as a monotonic decreasing relationship. At each succeeding decile, fewer responders are found. This is precisely what a marketer wants. But there are additional elements that influence model evaluation that are frequently ignored. These include:
- Lift Variations
Lift Variations: The percentage of all responders identified at a specific depth plays a significant role in selecting which model to deploy. Let's look at the following chart:
The blue and red bars above represent results of two different models developed. Upon closer examination, we discovered a few things, visible in this table:
The original model identified over 27 percent at the 10 percent depth. That appeared to satisfy the marketer. It performed better than an alternative scenario. However, if we proceed to the fifth decile, the alternative option becomes the winner. It finds slightly over 80 percent. Which model should be deployed?
Choppiness: Let's look at the following results:
Chart 2 displays the gains results for two different models. Note the blue bars exhibit the monotonically decreasing relationship (no choppiness), while the red bar (choppiness) presents erratic or choppy behavior. Observe, for example, the change from decile 4 to decile 5 in the choppiness scenario. So although "choppiness" is evident in this scenario, we nevertheless identify more responders at the 50th percentile (79.58 vs. 77.31). Which model should our marketers adopt?
Stability: During analysis the modeler will test his preliminary result on a development file. If the analyst is satisfied with what he observes, he continues to submit his algorithm to the holdout sample. The outcome is typically presented as the evaluation of model performance.
If the validation file result is in conflict with the training file result, we may very well encounter an unstable model. Look at the model on the left, in Chart 3. The height of the bars is almost identical, indicating a close match between analysis and validation files. The bars on the right for the second model have slightly different heights, suggesting a less stable model. The horizontal axis represents the deciles for model 1 on the left side, and the deciles for model 2 on the right side.
Predictability: As we know, the final product of a model is a prediction for each individual. When these records are grouped appropriately, deciles are formed. We may compute for each of these segments:
- Actual response rate
- Predicted response rate
The actual rate is calculated by adding up all the responders and dividing by the total number of records. The predicted rate is determined by employing the model and assigning a probability of responding for each individual. We then take the average of these likelihoods by decile. This is the predicted response rate. Large variances between predicted and actual are a cause for concern.
Analyses must assess how close actual is to predicted. This is an important dimension in evaluating model strength. There are statistical tests that help determine the 'closeness' of the distributions.
Variety: It is necessary to apply several approaches to determine best results. Many analysts do not.
Parsimonious Parameterization: Generally speaking, fewer model predictors are more favorable than more predictors. There is nothing wrong in determining whether the model developer "peeled" off variables. The 'peeling' should stop when model results are impacted.
Explainability: Would you accept the results of a model if there is a predictor that you find difficult to explain? Generally, if I don't understand it, I won't use it. It is imperative that there be a comfort level with how the model is arriving at results.
So, these are the rules when considering a model to be good:
- Adequate segmentation
- No choppiness
- Maximum stability
- Optimal predictability
- Multiple approaches attempted
- Optimal number of predictors
- Effortless explainability
A weight may be assigned to each of these dimensions. Coupled with the decile report, these added conditions further validate model results. Data miners produce a better product and managers design more successful programs.
Sam Koslowsky is VP of modeling solutions at Harte Hanks, a targeted marketing services company offering a wide array of integrated, multichannel and data-driven solutions. Reach him at email@example.com.