In this introductory statistics article, we will explore the mean, formally known as the arithmetic mean (average) and how it’s used and abused; in later articles we will look at other measures of central tendency such as the median, the mode, and some others.
When can the mean be calculated?
There are various ways to classify variables. One useful way is to distinguish between continuous and categorical data. Data is continuous if it can (at least in theory) take on any number. Data is categorical if it can only take on certain numbers. For example, weight, income, age and IQ are continuous. Choice of whom to vote for (e.g. McCain or Obama) party, hair color, and marital status are categorical. We will discuss this more in a later article.
When you have continuous data, two things that you often want to know are “What values are likely?” and “How spread out are the values?” Today, we will look at the first question, which, in statistician’s language, is called central tendency. The most common measure of central tendency is the mean, more formally the arithmetic mean, and less formally the average. (To see why the mean makes no sense for categorical data – well, what’s the average of McCain and Obama? Or of married and single? Perhaps the latter is “engaged”?)
How to calculate the mean
The mean is probably familiar, even if you only know it as the average. Add up the numbers, divide by how many numbers there are, and you’ve got the mean. So, for example, if the IQs of the people in your family are
155 (that would be you)
135 (your sister)
70 (her husband)
then the mean is (155 + 135 + 70)/ 3 = 120
Or, suppose the heights of the students in introductory psychology are (in inches, rounded to the nearest inch)
64 65 64 67 64 67 66 70 66 66
66 64 69 69 62 67 64 59 66 67
65 71 67 68 59 69 67 65 68 66
68 67 75 67 69 70 67 76 67 70
68 67 78 67 73 64 75 65 70 68.
The arithmetic mean of the above is 67.36 inches.
The mean: When not to use it
The mean is a bad choice if the data are skewed, which means that there is a ‘tail’ to the distribution on one side, but not the other. One common example of this is income. Some people make a whole lot more than the average person, but no one makes that much less. For instance, if the average income in the USA is $30,000 per year (I made that up) then there are some people who make millions more than that, but the poorest people make $30,000 less. When the data are skewed, the median and the trimmed or Winsorized mean are good choices. (You don’t see the trimmed mean much, but it can be very useful), I will cover these in later articles.
The mean is also a bad choice if the data are multimodal, which means they have two or more “humps”. For instance, if you had data on the heights of basketball players and jockeys, taking the overall mean would not be very informative.
The mean: What can go wrong
People sometimes try to average things that shouldn’t be averaged. The most common is to average percentages. This is a bad idea. Here are some data from the last presidential election I will use just 4 states, to keep it simple; the same thing applies with all 50):
State Obama McCain
CA 61% 37%
NY 63% 36%
WY 33% 65%
UT 34% 63%
If one averages the percentages, one would get 48% for Obama (61 + 63 + 33 + 34)/4 and 50% for McCain (37 + 36 + 65 + 63) but that isn’t right. A percentage is a form of a fraction, and you have to add the numerators and denominators and then form a new percentage, that is, add up the NUMBER voting Dem and Repub. and then get the percentage from the total. Here are the total voting, in millions of people:
State Obama McCain
CA 8.2 5.0
NY 4.8 2.8
WY 0.1 0.2
UT 0.3 0.6
Tot 13.4 8.6
In these four states, Obama got 61% of the vote..
As another example, suppose I ask the following;
In September, Joe’s average gas mileage was 30 mpg. In October, it was 20 mpg. What was his average gas mileage for September and October? You might think; 20 + 30 = 50, divide by 2 = 25. But that’s not the mean, because he might have driven different distances in the two months. If he drove 2000 miles in September, and 500 in October, then in September he used 2000/30 = 67 gallons, and in October he used 500/20 = 25 gallons. So, in total, he used 92 gallons to drive 2500 miles, and the mean is 2500/92 = 27.2 mpg.