Links to Other Parts
Link to Part 3 [Current Article]
Introduction
Welcome to Part 3.
This part shows you how to begin analyzing data like a pro (Python code is in the Appendix section), what to look for, and what the results mean.
Do not feel the need to understand every part of this article. This can be a reference that you come back to for specific information on as needed.
If you get stuck on a section, just move on and come back later. It’s very easy to get bogged down in the details, but I promise the process is more important than specific math tools.
1. Observe: Run Descriptive Statistics on Relevant Data
You’ve got to be very careful if you don’t know where you are going, because you might not get there.
—Yogi Berra
Knowing which way is north may be the first step, but you still need to plot out the best path to get to your result. Determining where hills and valleys lie and routing around them will reduce the effort is takes to hike to the treasure.
Once you have a direction to navigate towards, mapping out the high-level view of the data landscape means running descriptive summary statistics on relevant data sets.
In our Portuguese dataset, that means running the descriptive statistics on the numerical values, and plotting histograms of categorical, AKA nominal, variables (variables that have no inherent order).
Descriptive Statistics
For existing numerical categories, we can run a descriptive analysis on each series (column of data). A summary of each data series is shown below.
There’s no need to spend a lot of time reading every value here. We’re going to be primarily looking at the mean and standard deviation (std) values for each series to get an idea of which series are tightly clustered around a single value and which are very spread out.
Wow, the number of days since previous contact (pdays) category has a mean value of 962—more than 2 ½ years—but the standard deviation is only about 6 months. Something fishy is going on with that column. The 25%, 50%, and 75% percentiles (shown by rows 4, 5, and 6) are all the same value, 999 days, so we need to explore this data series in more detail later.
Looking through the standard deviations of each series we can see that the call duration (duration) category has a very large standard deviation (row 2), and the maximum value of 4918 seconds is far above the 75 percentile of call time.1
Hopefully, the longer calls are people spending time putting in deposit orders!
Categorical Variable Graphing
It would make no sense to find the average value of someone’s marital status.
To learn more about the dataset columns that do not have a numbering system for their possible values, we will graph all of them to see how often each option appears.
The descriptive statistics and histograms above show the general shape of your data. Although it’s unlikely that any of these numbers or graphs alone are useful to your business, understanding the bumps and edges of any dataset helps inform which way to go for future analysis.
By looking at these additional graphs in Figure 3.2, we should note that the markets called a lot more in May + the summer than all of the fall and winter and that there are no entries for January or February.2
Now that we’ve gotten a bird’s eye view of our data, we can map out some potential problem areas to avoid.
Coming Up
In the next post, get ready to see how to preemptively avoid common mistakes in stats!
If you know anyone who could benefit from learning how to manage their own analytics, feel free to send them a link to the article.
If you want to stay in the loop for future parts, make sure to subscribe to the newsletter.
Links to Other Parts
Link to Part 3 [Current Article]
For the more math minded: even without knowing more about the success rates at each quartile, we can intuitively tell that this is a right-skewed, long-tail distribution
If you’re planning on comparing apples to apples, ensure that deposits for each month are normalized to success rates instead of using total numbers to compare months (e.g. use rates instead of counts)