If you haven’t been following along since the beginning, I recommend starting at the top and following through.
Links to Other Parts
Link to Part 6 [Current Article]
Continuous improvement is better than delayed perfection
—Mark Twain
You found your north, sketched out the big features in your map, avoided the big traps, and done your journey. You can immediately forget everything you’ve learned about navigation data analysis and be happy with your conclusions forever, right?
Not quite…
Once you have found where you think the treasure is, you still must dig down and confirm that the chest is where you ended up and that the gold inside is real gold.
What that means in the world of data analysis is that you need to continue to ingest new data to continuously validate your findings.
For our example analysis, we need to make sure that our conclusions translate into true, real-world relationships for our call center reps to leverage for their calls.
Two issues to be wary of that can come from completely valid data analysis often are only discovered upon further inspection are spurious correlations and mis-assigned causation.
Spurious Correlations
We found that contacting customers via cell phone yielded a meaningfully larger chance of them closing an account. Great!
Now we need to double check our analysis for spurious correlations: similarities not caused by any causal relationship, but rather by random chance.
Many software services use a single measure, p-value, to give statistical significance. The p-value tests the likelihood that a tested change, like offering a new discount to a customer segment, changes the behavior of the tested group.
The size of the p-value is the chance that an effect was caused by random chance in the data rather than your test. This is a confusing metric, but remember that the smaller the p-value is for a test, the more likely the hypothesis is correct.
The issue of spurious correlations arises because many software providers call something ‘significant’ when the p-value is less than 0.05, or there is a 95% chance an effect was caused by your test.
A 95% sounds like a good confidence level, but if you compare 6 items against each other, you have a greater than 50% chance of a spurious correlation appearing between two or more of them.
In the case of our analysis, we should ensure that the variables that are related could plausibly be related to each other. Could people reasonably respond differently to calls on a cell phone versus a landline? Although there’s not a clear causal direction or link between them, it would not be unreasonable to see a difference in the responses between those methods of contact.
To avoid this issue in future analysis as well, pare down your initial exploration to only variables that could plausibly be related, pay close attention to the level of statistical significance shown, and continue testing future data to see if that relationship continues to hold.
Mis-assigned Causation
Even if you’ve done your analysis perfectly well and discovered legitimate relationships between variables, ensuring that you’ve assigned causality in the right direction is important—wet streets don’t cause rain.
One way to test causation rather than just correlation is to do randomly assigned trials among homogenous test groups. In marketing, that usually comes in the form of A/B or multi-armed bandit testing.
Another potential cause of this issue could be the ‘third-variable problem’ where an unconsidered third variable mediates the true relationship between the two observed variables.
In this case, a potential cofounder could be age—maybe younger people are both more likely to make a deposit and to own a cell phone.
To get an intuitive understanding of the relationships between age, success in securing a deposit, and cell phone prevalence, we will graph them together and see how they relate.
It looks like the customers called over the age of 60 are much more likely to have a cellphone than anyone over 30 (in 2013) AND are more likely to make a bank deposit when asked.
Although the oldest members of the bank having some of the highest deposit and cell phone rates seems unintuitive, the number of calls made to members over the age of 60 is significantly lower than those made to members between the ages of 25 and 60.
It’s possible that older members are more likely to respond to a telemarketing call, but without a more evenly-distributed data set, we can’t make that conclusion.
Checks Out. Now What?
Now that you’ve done your due diligence on your data, you should share your findings. If they’re well received, designing a follow-up campaign that allows you to further test your results with new business data will ensure that you’ve found the true source of a phenomenon while avoiding potential dead ends.
In summary, double checking your analysis against what makes sense and real-life business results will improve your department, your understanding of your customers, and open opportunities that you may not have seen before.
Be careful about your initial conclusions, but don’t be afraid to put your analytics muscles to the test and learn more about your customers.
Check out the Appendix for a PDF version of this whole exercise and code snippets used to generate the calculations and graphs. Thanks for following along!
Links to Other Parts
Link to Part 6 [Current Article]
There are more statistically rigorous methods of doing this with factor analysis to test for predictive power or variance inflation factors to test for multicollinearity, but for this exercise, we will not get into them.