Chapter 4
Fitting models (central tendency)
Puzzle 1
What does the variance measure?
The variance measures the fit of the mean, or the average error between the mean and the observed scores.
Puzzle 2
Why does the variance in a sample underestimate the variance in the population?
The variance in the sample will underestimate the variance in the population because the sample is likely to be narrower than the population from which it comes. This is because scores at the extremes of the population occur less often and so are less likely to appear in the sample.
Puzzle 3
What does a small standard deviation relative to the mean tell us about our data?
It tells us that most of the scores cluster close to the mean and so the mean is a good ‘fit’ to the data.
Puzzle 4
Milton recruited a group of nine cats and recorded how many lactose-free lattes they drank in a week: 7, 9, 16, 20, 21, 28, 26, 32, 45. Calculate the mean, median and mode of these data.
To compute the mean
$$ \begin{aligned} \bar{X} &= \frac{\sum_{i = 1}^n x_i}{n} \\ &= \frac{7+9+16+20+21+28+26+32+45}{9} \\ &= \frac{204}{9} \\ &= 22.67 \end{aligned} $$
To calculate the median, we first arrange the scores in ascending order: 7, 9, 16, 20, 21, 26, 28, 32, 45.
Next, we find the position of the middle score by counting the number of scores we have collected (n), adding 1 to this value, and then dividing by 2. With 9 scores, this gives us (n + 1)/2 = (9 + 1)/2 = 10/2 = 5. Then we find the score that is positioned at the location we have just calculated. So in this example we find the 5th score, which is 21. Therefore, the median is 21.
The mode is the most frequent score. However, in this dataset all the scores are different so there is no mode (or lots of modes).
Puzzle 5
It seems that people are spending more and more time on their electronic devices. Zach asked a group of 20 people how long (in minutes) they spend on their Proteus each day: 65, 125, 34, 90, 45, 25, 10, 22, 22, 24, 30, 50, 60, 65, 34, 90, 100, 15, 20, 35. Calculate the sum of squares, variance and standard deviation of these data.
First we need to compute the mean. First let’s sum the scores
$$ \begin{aligned} \sum_{i = 1}^n x_i &= 65 + 125 + 34 + 90 + 45 + 25 + 10 + 22 + 22 + 24 + \dots \\ \ &+ 30 + 50 + 60 + 65 + 34 + 90 + 100 + 15 + 20 + 35 \\ &= 961 \end{aligned} $$
Which gives the mean of
$$ \begin{aligned} \bar{X} &= \frac{\sum_{i = 1}^n x_i}{n} \\ &= \frac{961}{20} \\ &= 48.05. \end{aligned} $$
To calculate the sum of squares, subtract the mean from each score, then square each deviance. Finally, add up these squared deviances (Table 1).
Calculating the sums of squared errors | ||||
---|---|---|---|---|
Time ($x$) | Mean ($\bar{X}$) | Deviance ($x-\bar{X}$) | $\text{Deviance}^2$ | |
65 | 48.05 | 16.95 | 287.3025 | |
125 | 48.05 | 76.95 | 5921.3025 | |
34 | 48.05 | -14.05 | 197.4025 | |
90 | 48.05 | 41.95 | 1759.8025 | |
45 | 48.05 | -3.05 | 9.3025 | |
25 | 48.05 | -23.05 | 531.3025 | |
10 | 48.05 | -38.05 | 1447.8025 | |
22 | 48.05 | -26.05 | 678.6025 | |
22 | 48.05 | -26.05 | 678.6025 | |
24 | 48.05 | -24.05 | 578.4025 | |
30 | 48.05 | -18.05 | 325.8025 | |
50 | 48.05 | 1.95 | 3.8025 | |
60 | 48.05 | 11.95 | 142.8025 | |
65 | 48.05 | 16.95 | 287.3025 | |
34 | 48.05 | -14.05 | 197.4025 | |
90 | 48.05 | 41.95 | 1759.8025 | |
100 | 48.05 | 51.95 | 2698.8025 | |
15 | 48.05 | -33.05 | 1092.3025 | |
20 | 48.05 | -28.05 | 786.8025 | |
35 | 48.05 | -13.05 | 170.3025 | |
Sum | 19,554.95 |
So, the sum of squared errors is a massive 19,554.95.
The variance is the sum of squared errors divided by the degrees of freedom ($ N - 1 $). There were 20 scores and so the degrees of freedom were 19. The variance is, therefore,
$$ s^2 = \frac{19554.95}{19} = 1029.21. $$
The standard deviation is the square root of the variance
$$ s = \sqrt{1029.21} = 32.08. $$
Puzzle 6
Would you say that the mean in Puzzle 5 ‘fits’ the data well? Explain your answer.
The standard deviation in puzzle 5 was 32.08 and the mean was 48.05. This is a huge standard deviation relative to the mean: over half its size! This suggests that the data points are distant from the mean and so the mean is not a good ‘fit’ to the data.
Puzzle 7
While Zach was worrying about whether Alice had left him, he ruminated about how successful couples often seem to divorce. Alice is a brilliant scientist and he is a brilliant musician, so perhaps their relationship is doomed. To see if his observation might be true he got The Head to check the (approximate) length in days of some celebrity marriages from before the revolution: 240 (J-Lo and Cris Judd), 144 (Charlie Sheen and Donna Peele), 143 (Pamela Anderson and Kid Rock), 72 (Kim Kardashian, if you can call her a celebrity, and Chris Humphries), 30 (Drew Barrymore and Jeremy Thomas), 26 (Axl Rose and Erin Everly), 2 (Britney Spears and Jason Alexander), 150 (Drew Barrymore again, but this time with Tom Green), 14 (Eddie Murphy and Tracy Edmonds), 150 (Renee Zellweger and Kenny Chesney), 1657 (Jennifer Aniston and Brad Pitt). Compute the mean, median, standard deviation, range and interquartile range for these lengths of celebrity marriages.
First we need to compute the mean:
$$ \begin{aligned} \bar{X} &= \frac{\sum_{i = 1}^n x_i}{n} \\ &= \frac{240+144+143+72+30+26+2+150+14+150+1657}{11} \\ &= \frac{2628}{11} \\ &= 238.91 \end{aligned} $$
Then we compute the standard deviation starting with the sum of squared errors, which is 2,268,540.91 (Table 2).
Calculating the sums of squared errors for celebrity marriages | ||||
---|---|---|---|---|
Length of marriage ($x$) | Mean ($\bar{X}$) | Deviance ($x-\bar{X}$) | $\text{Deviance}^2$ | |
240 | 238.9091 | 1.091 | 1.190 | |
144 | 238.9091 | -94.909 | 9007.718 | |
143 | 238.9091 | -95.909 | 9198.536 | |
72 | 238.9091 | -166.909 | 27858.614 | |
30 | 238.9091 | -208.909 | 43642.970 | |
26 | 238.9091 | -212.909 | 45330.242 | |
2 | 238.9091 | -236.909 | 56125.874 | |
150 | 238.9091 | -88.909 | 7904.810 | |
14 | 238.9091 | -224.909 | 50584.058 | |
150 | 238.9091 | -88.909 | 7904.810 | |
1657 | 238.9091 | 1418.091 | 2010982.084 | |
Sum | 2,268,540.91 |
The variance is the sum of squared errors divided by the degrees of freedom ($ N - 1 $). There were 11 scores and so the degrees of freedom were 10. The variance is, therefore,
$$ s^2 = \frac{2,268,540.91}{10} = 226,854.09. $$
The standard deviation is the square root of the variance
$$ s = \sqrt{226,854.09} = 476.29 \ \text{days}. $$
To calculate the median, range and interquartile range, first arrange the scores in ascending order:
2, 14, 26, 30, 72, 143, 144, 150, 150, 240, 1657.
The median: The median will be the score in position (n + 1)/2. There are 11 scores, so this will be the 12/2 = 6th score. The 6th score in our ordered list is 143. The median length of these celebrity marriages is therefore 143 days.
The lower quartile: This is the median of the lower half of scores. If we split the data at 143 (the 6th score), there are 5 scores below this value. The median of these 5 will be the 6/2 = 3rd score. The 3rd score is 26. The lower quartile is therefore 26 days.
The upper quartile: This is the median of the upper half of scores. If we split the data at 143 again (not including this score), there are 5 scores above this value. The median of these 5 will be the 6/2 = 3rd score above the median. The 3rd score above the median is 150; the upper quartile is therefore 150 days. The range: This is the highest score (1657) minus the lowest (2), i.e., 1655 days. The interquartile range: This is the difference between the upper and lower quartile: 150 − 26 = 124 days.
Puzzle 8
Repeat Puzzle 7 but excluding Jennifer Anniston and Brad Pitt’s marriage. How does this affect the mean, median, range, interquartile range and standard deviation? What do the differences in values between puzzles 7 and 8 tell us about the influence of unusual scores on these measures?
First let’s compute the new mean
$$ \begin{aligned} \bar{X} &= \frac{\sum_{i = 1}^n x_i}{n} \\ &= \frac{240+144+143+72+30+26+2+150+14+150}{10} \\ &= \frac{971}{10} \\ &= 97.1. \end{aligned} $$
The mean length of celebrity marriages is now 97.1 days, compared to 238.91 days when Jennifer Aniston and Brad Pitt’s marriage was included. This demonstrates that the mean is greatly influenced by extreme scores. Let’s now calculate the standard deviation excluding Jennifer Aniston and Brad Pitt’s marriage, starting with the sum of squared errors, which is 56,460.9 (Table 3).
Calculating the sums of squared errors for celebrity marriages | ||||
---|---|---|---|---|
Length of marriage ($x$) | Mean ($\bar{X}$) | Deviance ($x-\bar{X}$) | $\text{Deviance}^2$ | |
240 | 97.1 | 142.9 | 20420.41 | |
144 | 97.1 | 46.9 | 2199.61 | |
143 | 97.1 | 45.9 | 2106.81 | |
72 | 97.1 | -25.1 | 630.01 | |
30 | 97.1 | -67.1 | 4502.41 | |
26 | 97.1 | -71.1 | 5055.21 | |
2 | 97.1 | -95.1 | 9044.01 | |
150 | 97.1 | 52.9 | 2798.41 | |
14 | 97.1 | -83.1 | 6905.61 | |
150 | 97.1 | 52.9 | 2798.41 | |
Sum | 56,460.90 |
The variance is the sum of squared errors divided by the degrees of freedom ($ N - 1 $). There were 10 scores and so the degrees of freedom were 9. The variance is, therefore,
$$ s^2 = \frac{56,460.9}{9} = 6273.43. $$
The standard deviation is the square root of the variance
$$ s = \sqrt{6273.43} = 79.21 \ \text{days}. $$
From these calculations we can see that the variance and standard deviation, like the mean, are both greatly influenced by extreme scores. When Jennifer Aniston and Brad Pitt’s marriage was included in the calculations (see puzzle 7), the variance and standard deviation were much larger: 22,6854.09 and 476.29, respectively. Therefore, by removing the outlier (Jen and Brad’s marriage) the mean is now a better ‘fit’ to the data.
To calculate the median, range and interquartile range, first, arrange the scores in ascending order but this time excluding Jennifer Aniston and Brad Pitt’s marriage:
2, 14, 26, 30, 72, 143, 144, 150, 150, 240.
The median: will be score in position (n + 1)/2. There are now 10 scores, so this will be the 11/2 = 5.5th. Therefore, we take the average of the 5th score and the 6th score. The 5th score is 72, and the 6th is 143; the median is therefore 107.5 days. Like the mean, the median has reduced because the extreme score has been omitted; however, the reduction is less dramatic than it was for the mean, showing how the median is relatively less affected by extreme scores than the mean.
The lower quartile: This is the median of the lower half of scores. If we split the data at 107.5 (this score is not actually present in the data set), there are 5 scores below this value. The median of these 5 will be the 6/2 = 3rd score. The 3rd score is 26; the lower quartile is therefore 26 days.
The upper quartile: This is the median of the upper half of scores. If we split the data at 107.5 (this score is not actually present in the data set), there are 5 scores above this value. The median of these 5 will be the 6/2 = 3rd score above the median. The 3rd score above the median is 150; the upper quartile is therefore 150 days.
The range: This is the highest score (240) minus the lowest (2), i.e., 238 days. You’ll notice that without the extreme score the range drops dramatically from 1655 to 238 – around a seventh of the size. The interquartile range: This is the difference between the upper and lower quartile: 150 − 26 = 124 days of marriage. This is the same as the value we got when Jennifer Aniston and Brad Pitt’s marriage was included. This demonstrates the advantage of the interquartile range over the range: it isn’t affected by extreme scores at either end of the distribution.
Puzzle 9
Zach asked Nick to get 15 of their fans on memoryBank to rate his new song, ‘The Gene Mixer’, on a scale ranging from 0 (the worst thing you’ve ever recorded) to 10 (the best thing you’ve ever recorded). The ratings were: 3, 5, 7, 8, 2, 4, 10, 8, 8, 5, 5, 7, 9, 10, 6. Calculate the mean, standard deviation, median, range and interquartile range for these ratings of the song.
First we need to compute the mean:
$$ \begin{aligned} \bar{X} &= \frac{\sum_{i = 1}^n x_i}{n} \\ &= \frac{3+5+7+8+2+4+10+8+8+5+5+7+9+10+6}{15} \\ &= \frac{97}{15} \\ &= 6.47. \end{aligned} $$
Then the standard deviation starting with the sum of squared errors, which is 83.73 (Table 4).
Calculating the sums of squared errors for ratings of Zach's song | ||||
---|---|---|---|---|
Rating of song ($x$) | Mean ($\bar{X}$) | Deviance ($x-\bar{X}$) | $\text{Deviance}^2$ | |
3 | 6.466667 | -3.4666667 | 12.0177778 | |
5 | 6.466667 | -1.4666667 | 2.1511111 | |
7 | 6.466667 | 0.5333333 | 0.2844444 | |
8 | 6.466667 | 1.5333333 | 2.3511111 | |
2 | 6.466667 | -4.4666667 | 19.9511111 | |
4 | 6.466667 | -2.4666667 | 6.0844444 | |
10 | 6.466667 | 3.5333333 | 12.4844444 | |
8 | 6.466667 | 1.5333333 | 2.3511111 | |
8 | 6.466667 | 1.5333333 | 2.3511111 | |
5 | 6.466667 | -1.4666667 | 2.1511111 | |
5 | 6.466667 | -1.4666667 | 2.1511111 | |
7 | 6.466667 | 0.5333333 | 0.2844444 | |
9 | 6.466667 | 2.5333333 | 6.4177778 | |
10 | 6.466667 | 3.5333333 | 12.4844444 | |
6 | 6.466667 | -0.4666667 | 0.2177778 | |
Sum | 83.73 |
The variance is the sum of squared errors divided by the degrees of freedom ($ N - 1 $). There were 15 scores and so the degrees of freedom were 14. The variance is, therefore,
$$ s^2 = \frac{83.73}{14} = 5.98. $$
The standard deviation is the square root of the variance
$$ s = \sqrt{5.98} = 2.45. $$
To calculate the median, range and interquartile range, first arrange the scores in ascending order:
2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 8, 8, 9, 10, 10.
The median: The median will be the score in position (n + 1)/2. There are 15 scores, so this will be the 16/2 = 8th score, which in our ordered list is 7. The lower quartile: This is the median of the lower half of scores. If we split the data at 7 (the 8th score), there are 7 scores below this value. The median of these 7 will be the 8/2 = 4th score. The 4th score is 5. The lower quartile is therefore 5. The upper quartile: This is the median of the upper half of scores. If we split the data at the 8th score again, there are 7 scores above this value. The median of these 7 will be the 8/2 = 4th score above the median. The 4th score above the median is 8; the upper quartile is therefore 8. The range: This is the highest score (10) minus the lowest (2), i.e., 8. The interquartile range: This is the difference between the upper and lower quartile: 8 − 5 = 3.
Puzzle 10
Is the mean in Puzzle 9 a good ‘fit’ to the data? Explain your answer.
The mean in Puzzle 9 was 6.47 and the standard deviation was 2.45. A standard deviation of 2.45 is very large relative to a mean of 6.47, suggesting that the mean is not a particularly good ‘fit’ to the data.