Links to Other Parts
Link to Part 4 [Current Article]
Other News
Welcome to Part 4 of our Marketing Data Analytics series.
Mixed in as the final few parts in this package are released, I’ll be shifting gears and going back to cover the basics with a series titled “MarTech Basics” where we can start from the beginning and get you up to speed.
Check out the first post of that series here:
Following that series, I’ll be putting one together called “Fluent MarTech” where we discuss how to put the separate pieces together into a functional tech stack.
Introduction
Have you ever watched someone take off on something without knowing what mistakes were common? How often are they tripped up by something that even a little research could have foreseen?
In college, I tried to trade biotech stocks right before they released trial news. This is a notoriously volatile (read: bad) strategy.
If I had done any actual research into the risk vs reward and potential outcomes of trading like this, I wouldn’t have bothered starting down this road. But to the uninformed (me), the huge potential jumps in price were exciting and an easy way to make money.
Obviously, I didn’t do the math, and after being burned the first big time, went back and did some reading into the chances of success in this sector.
Don’t Do The Same For Your Approach To Analytics!
This article will guide you away from making any big mistakes that would stop a presentation to your boss in its tracks. Make sure you keep a lookout for these mistakes because if you don’t, others will.
This part shows you how to begin analyzing data like a pro (Python code will be in the Appendix section for those interested), what to look for, and what the results mean.
Avoid: Consider Measurement Artifacts and Outliers
Westley: "A few more steps and we'll be safe in the fire swamp."
Princess: "We'll never survive!"
Westley: "Nonsense! You're only saying that because no one ever has!"
—Rob Reiner, “The Princess Bride”
When hiking through an unfamiliar area, it’s important to understand the dangers you’re likely to encounter. Hiking in the Montana forests requires having a plan to keep bears away while a bigger hazard in the Louisiana bayou is failing to route around the deepest swamps.
Similarly, analysis pitfalls are likely to be different for each business and measurement method, but most risks can be anticipated and mitigated in advance of any analysis.
Measurement Artifacts
Referring to Figure 3.1 in Part 3 (reprinted below), note the strange-looking 25%, 50%, and 75% percentile values for “pdays” that we mentioned earlier.
Looking through the provided dataset description, we can see that the value ‘999’ is a what the call center reps entered to indicate that a user had not been previously contacted.
Plotting this series graphically, it’s obvious that ‘999’ does not provide meaningful numerical information and is a measurement artifact.
Unless the bank had done a single day of extreme campaigning 2 ½ years ago and nothing since, we would not expect to see a distribution like this without the instruction to enter 999 for not-previously-contacted customers.
Most customers contacted have not been known to be in a previous marketing campaign, and so the ‘999’ value dominates the series.
By removing the measurement artifact, we can see more reasonable distributions within the subgroup. Most calls would not happen on a weekend, and people who had been contacted more recently might be more likely to answer the phone.
With the 999 values removed, we see a more reasonable distribution for the days since a customer had been previously contacted.
Identifying and removing measurement artifacts from your data series can keep you from falling into statistical traps and failing to change any business outcomes.
Outliers
Similarly, outliers—data that fall far outside the normal distribution of a dataset—need to be identified and a dealt with before performing any analysis.
Outliers are any values in a dataset that vary significantly from the rest of the data. These could be from measurement artifacts as mentioned above or from real-life events that prevent a good measurement.
In the context of this dataset, an example of an outlier would be if a few phone calls were accidentally left on the line and we found very long calls occurring in our dataset.
Obviously, there is no predictive power gained by including an accidental phone call data point. But because that outlier can be so far from the rest of the distribution, it could disproportionately weight a regression run on the dataset.
By graphing out the distribution, we should be able to confirm that we do not have any significant outliers1.
What we’d be looking for here would be call durations that are many times longer than a person could reasonably be on the phone or peaks outside of a single distribution.
After graphing the series, it looks like we do not have any obvious outliers!
Although we do see a fairly long tail on the right side of the graph, there are no calls over 90 minutes and extremely few over 1 hour—still a reasonable, if unpleasant, amount of time to be on the customer service line with a bank.
In conclusion, stay on the lookout for measurement artifacts and outliers that could affect your analysis. Ensure you record how you determine which data points are outliers and how they were dealt with. Future analysis will be much easier if there is no guess work to replicate the original analysis and common pitfalls were avoided.
We’ve now found our north, gotten the first pass at our map, and marked out the potential pitfalls, we’re ready to head to where X marks the spot.
Up Next
In the next post, get ready to see how we do the analysis for our particular question and dataset as well as seeing linked resources for other common statistical explorations.
If you want to stay in the loop for future parts, make sure to subscribe to the newsletter.
Share with a friend who might find this useful, who might have made these mistakes, or someone who would be happy to see a Princess Bride reference
Links to Other Parts
Link to Part 4 [Current Article]
On a normally-distributed dataset, this is usually done by defining outliers as data some multiplier outside the interquartile range. However, for this heavy-tailed data series, that would be inappropriate.If you know anyone who could benefit from learning how to manage their own analytics, feel free to send them a link to the article.