I’ve always been “the math kid”. I obsessed over multiplication tables as a young girl, and I LOVED learning functions and how to graph them. I was always especially enamored with probability and statistics, even from a young age. This started with introduction of simple concepts like the mean and the median.
Since I was introduced to these terms so young, I always wrote them off as “the baby stuff”. I assumed there was NO WAY that these concepts could be useful as I furthered my knowledge in mathematics and statistics.
Spoiler alert: oh boy, was I wrong.
So I got to college, took some more statistics and probability theory, and promptly realized that a) the mean and median can tell an extremely powerful story if used correctly and b) the mean and median can tell an extremely deceptive story if used incorrectly… or maliciously.
If you haven’t read Naked Statistics by Charles Wheelan yet, I highly highly recommend you check it out. His explanation as to why the mean can be misleading was a game changer. For those that haven’t read it, I’ll give you a quick rundown, with a modern twist.
A story of misleading means: 3 Soundcloud rappers and Lil Uzi Vert
Suppose that there are three “up and coming” Soundcloud rappers hanging out at a coffee shop (because that’s what Soundcloud rappers do, duh). To better understand the rappers’ degree of fame, we could determine how many followers each rapper has on the platform. Then, we could take the mean of those values to get an idea of the average followers between the rappers. Let’s say that average is 2,496.
Suddenly, the talented and well-known rapper Lil Uzi Vert bursts into the coffee shop to get his oat milk latte (full disclosure: not sure of his coffee order… but oat milk lattes are delish). Anyway, now there are 4 rappers in the coffee shop. Since our population has changed, we must recalculate the mean of their follower counts.
In just a few seconds, the average Soundcloud follower count in the coffee shop increased from 2,071 to 481,997! Woah, did those 3 up and coming rappers just blow up and go viral? No! Lil Uzi Vert just showed up and his 2.4 million followers pulled the mean way up from its previous value.
The mean is sensitive to outliers… so should we even use it?
Clearly, the mean can be heavily influenced by any outliers present in the data set. By the time I graduated college, I swear this was borderline indoctrinated into my mind. So much so, in fact, that until recently, I truly thought that means could not be trusted. I believed that the median was the superior summary statistic.
Spoiler alert #2: oh boy, was I wrong.
Once again, I have Mr. Wheelan to thank. His book revealed to me that means are not only useful, but in some cases, they are vital forms of measurement to better understand a data set.
Yes, means are susceptible to outliers. And yes, the median is not susceptible to outliers. The median gives each observation in the data set an equal amount of weight, no matter its magnitude.
However, this is actually the median’s ultimate downfall. *cue dramatic music*
“Not all data points are created equal”
In our rapper example, there is no denying that the median was the better descriptive statistic. But this is because we wanted to treat each rapper equally when investigating their fame.
When dealing with data sets where not all data points are created equal, the mean triumphs over the median. Charles Wheelan did an amazing job illustrating this in his book, by using a medical trial as an example.
Since my brain thrives off of seeing examples, I took the time after my daily reading to develop my own scenario where the mean would beat out the median as the best descriptive statistic of choice.
An example where the mean might get you a raise (and the median might get you fired)
Suppose that you are an analyst for an e-commerce business. The marketing team is planning a campaign to reach out to your existing customers in hopes of boosting revenue in a slow quarter. They build a beautiful email template with brilliant copy, and blast that email out to about 500,000 of your loyal customers.
10 days go by, and everyone is dyyiiiinnng to see the results of this campaign that the team worked so hard for. You start pulling the data, and naturally, you think to first summarize the data with some descriptive statistics. One option is to find the central tendency of the dollars that were spent by each customer who received the campaign email.
Since you’ve heard that the mean is sensitive to outliers, you decide to use the median as your central tendency measurement. You run your calculations, and your result is a big, fat, terrifying ZERO. Zilch. What? But there were thousands of customers who used the promo code!!! There’s no way in heck that you can deliver these results to the marketing team (at least without putting some people in hot water).
Then, it hits you: what about the mean? You rerun your calculations and release a huge sigh of relief. The average dollars spent by each customer who received the campaign email is $1.30. This sounds much better, especially since the cost per email was about $0.008. The mean is more appropriate to capture this specific metric in this specific scenario. Plus, you can deliver these results to your team with confidence.
Scenarios where the mean could be the better option
As I mentioned before, the idea of utilizing the median for skewed data has been burned into my brain. But, before reading Naked Statistics, I never knew rules of thumb about when to choose the mean as your measure of central tendency. Here’s two population characteristics that often signify that the mean could be the better choice:
- We do not lose much for failure.
- A small portion of the population (say, less than half) show success. In fact, we may have even expected small success.
In our email example, both 1. and 2. are present. We do not lose much for failure. If they ignore the email, the customers will not be in danger. The price to send out an email is extremely small. Also, it is typical for a small portion of the population to respond to email marketing campaigns. Some say that “good response rates” fall around 5-10%.
When we try to use the median for these types of scenarios, we can miss out on hidden insights. But when we use the mean, we may uncover some secrets about our successes.