Standard Deviation
What is the standard deviation?
The standard deviation measures the spread of a set of data values. A high standard deviation indicates a wide spread of data values, while a low standard deviation indicates a narrow spread of values clustered around the mean of the data set.
How is the standard deviation used?
The standard deviation is used to investigate variability in a set of data values. It is also used in conjunction with the mean for calculating statistical intervals, hypothesis test statistics, and control chart limits.
What are some issues to think about regarding the standard deviation?
The standard deviation can be affected by extreme values and/or small data sets. Be sure to consider how outliers may be affecting your analysis. Also, the standard deviation is only relevant for continuous data.
The standard deviation describes the spread of a set of data.
Suppose you have a set of data values and plot them as in the graphs below. The horizontal axis shows your data values. The vertical axis measures the frequency of each data value. In statistical terms, this is a histogram, or distribution, of data values. The standard deviation is a single number that estimates the spread, or width, of the data.
What is the population standard deviation?
In statistics, the population is the entire set of data that you are trying to understand and draw some conclusions about. In many cases, due to the sheer size of the population, it is impossible to collect data about every element of a population. In these situations, the population standard deviation measures the spread of the theoretical population and is almost always unknown.
Let’s think about an example where you do know the population. Suppose you want to know the spread of wind speeds at landfall for Atlantic hurricanes since 1950. This is a relatively small population. Since data is readily available for all Atlantic hurricanes since 1950 that have made landfall, you can calculate the population standard deviation.
What is the sample standard deviation?
To estimate the unknown population standard deviation, you collect a sample of data. Then you calculate the standard deviation of that sample. The sample standard deviation measures the spread of the data in your sample. This is an estimate of the population standard deviation.
What is the difference between the standard deviation and the variance?
The standard deviation is the square root of the variance. Both standard deviation and variance are measures of spread. The standard deviation is in the same units as your data. For example, if you measure age in years, the standard deviation is also in years, which is one reason that people use the standard deviation instead of the variance. “Age in years” is simpler to think about than “squared age in years.”
What is the difference between the standard deviation and the coefficient of variation (CV)?
The coefficient of variation, or CV, is the standard deviation divided by the mean. The CV is used to compare the standard deviation of data sets on a common scale. The CV is used as an indicator of the precision of a measurement system.
What are the possible values of the standard deviation?
The standard deviation is almost always a positive value. One exception: if all of the values in your data set are the same, then the standard deviation is zero. There is no variability or spread in the data.
How to calculate the standard deviation
To calculate the sample standard deviation, first calculate the sample mean. Then for each data value, find the difference between the value and the sample mean. Next, square these differences and sum them. Finally, divide that sum by the number of data values minus one to get the sample variance. To obtain the standard deviation, take the square root. The standard deviation is in the same units as the data.
Let’s explore this calculation with a simple example. Suppose you measure the resting heart rate of six people. Most people have a resting heart rate between 60 and 100 beats per minute (BPM). Athletes can have a healthy resting heart rate as low as 40. High heart rates can be a health concern or simply a result of measuring the heart rate during exercise.
Suppose your data values are:
55 |
60 |
65 |
75 |
80 |
85 |
First, calculate the sample mean by adding up the data values and dividing by the number of values:
$\frac{(55+60+65+75+80+85)}{6} = \frac{420}{6} = 70$
Next, calculate the difference between each data value and the sample mean:
Difference from mean |
---|
55-70 = -15 |
60-70 = -10 |
65-70 = -5 |
75-70 = 5 |
80-70 = 10 |
85-70 = 15 |
By calculating the differences, you get an idea of how far each data value is from the sample mean.
Next, square the differences. If you simply added up the differences, you would get zero, suggesting that there was no spread in the data. That’s not true. By squaring the differences before adding them up, you get a positive measure for distance from the mean for both the points above and below the sample mean.
Difference from mean | Squared difference |
---|---|
55-70 = -15 | 225 |
60-70 = -10 | 100 |
65-70 = -5 | 25 |
75-70 = 5 | 25 |
80-70 = 10 | 100 |
85-70 = 15 | 225 |
Next, take the sum of the squared differences:
$225+100+25+25+100+225=700$
Since there are six data values, divide the sum above by 6 – 1 = 5:
$\frac{700}{5} = 140$
Why not divide by 6? The simple answer is that the sample mean was used in these calculations. If you know the sample mean and five data values, you could calculate the sixth data point. This example uses what is called one degree of freedom when calculating the mean. Statistically, when you divide by n-1, you obtain an unbiased estimate of the variance.
At this point, you have determined the sample variance. It is in the units of "squared beats per minute," which is difficult to interpret. So the final step is to take the square root to get the sample standard deviation:
$\sqrt{140}=11.8$
Based on the sample of six people, the sample mean is 70 BPM, with a sample standard deviation of 11.8 BPM, which makes sense.
Typically you will use software to calculate the sample standard deviation. The formula for the sample standard deviation is:
$\sqrt{\frac{Σ^n_{i=1}(x_i - \overline{x})^2}{n-1}}$
In the formula above, the sample has n data values. Each data value is represented by an x. The symbol x̅ represents the sample mean. The Σ symbol is the summation symbol; in this formula, it means that each of the squared differences between a data value and the sample mean should be added up, just as in the example.
Population standard deviation
In the rare situations where you have data for the entire population, the calculation of the standard deviation is slightly different than for a sample from the population. For the entire population, the size of the population is denoted with a capital N. The formula is:
$\sqrt{\frac{Σ^N_{i=1}(x_i - μ)^2}{N}}$
The formula above uses the population size (N) and the population mean (μ). The idea behind the formula is the same as the formula for the sample standard deviation.
Understanding the standard deviation
Visualizing the standard deviation
Figure 3 below illustrates how the standard deviation is an estimate of the spread of your data values. The center line shows the sample mean (70) of the six heart rate data values from the previous example. For two of the values (65 and 80), the plot highlights the calculation of the difference from the mean.
You can see that differences are negative when the data value is lower than the mean and positive when the data value is higher than the mean. By squaring the differences, the positive and negative differences don’t cancel each other out.
By adding up all the squared differences, you get the combined spreads between each data value and the mean. Smaller sums indicate that there is a smaller spread of data values; larger sums mean there's a larger spread of data values.
Interpreting the standard deviation
Most of the time, you report both the mean and standard deviation. This helps put the standard deviation in context.
Smaller standard deviations tell you that more of your data values are close to the sample mean. Larger standard deviations tell you that your data values are more spread out and that some values are further away from the sample mean.
For example, in Figure 4 below, suppose the sample mean for your data is 13. When the sample standard deviation is 3, represented by the solid orange line, you can see that more of the data is close to the sample mean. When the sample standard deviation is 6, represented by the dotted blue line, then the data is more spread out. Some values are farther from the sample mean.
How do extreme data values affect the sample standard deviation?
Extreme data values can have a significant impact on the sample standard deviation. Let’s continue our heart rate example.
Earlier, our data values for heart rate were:
55 |
60 |
65 |
75 |
80 |
85 |
We found a sample mean of 70 BPM and a sample standard deviation of 11.8 BPM.
Suppose we now have the heart rate for one additional person:
55 |
60 |
65 |
75 |
80 |
85 |
140 |
We won't walk through all of the calculations again, but we now have a sample mean of 80 BPM and a sample standard deviation of 28.6 BPM. This single extreme value had a significant effect on both the sample mean and the sample standard deviation.
CAUTION! Don’t delete an extreme data value just because it doesn't look right. First try to find out if the extreme data value is due to an error of some kind. If it is the result of an error, then you should try to find the correct value. If you cannot determine that an error occurred, then you should not omit the extreme data value. In this situation, you may decide to report your analysis both with and without the questionable data point.
For the heart rate data, the extreme value could be the real resting heart rate for a person. In that case, you would want to keep it in the data. Or the extreme value could be the heart rate from someone immediately after exercising, which is different from the other data values that measure resting heart rate. The point is you need to investigate further before deciding how to handle extreme data values.
Using statistical symbols
Population standard deviation and variance
The population standard deviation is shown in formulas by the Greek letter “sigma.” The symbol is σ.
The population variance is shown as σ2.
Many statistical formulas use σ when defining hypothesis tests or in formulas for analyses.
Remember that almost all of the time, you will not know the population standard deviation or population variance.
Sample standard deviation and variance
The sample standard deviation is shown in formulas by an italic lowercase s.
The sample variance is shown in formulas as s2.
When to use the standard deviation
Continuous data: YES
The standard deviation makes sense for continuous data. This data is measured on a scale with many possible values. Some examples of continuous data are:
- Age
- Blood pressure
- Weight
- Temperature
- Speed.
For all of these examples, it makes sense to calculate the standard deviation.
Ordinal or nominal data: NO
As defined here, the standard deviation does not make sense for ordinal or nominal data. This data is measured on a scale with only a few possible values. There are other statistics that estimate the spread of a set of ordinal or nominal data values.
Ordinal data is typically divided into groups with a specific ordering. For example, suppose you take a survey where you are asked to give your opinion on a scale from “Strongly Disagree” to “Strongly Agree.” Your responses are ordinal – see Figure 6 below.
Nominal data also divides the sample into groups but doesn’t have any particular ordering. Two examples are biological sex and country of residence (Figure 7). You can use M for Male and F for Female in your sample, or you can use 0 and 1. For country, you can use the country abbreviation, or you can use numbers to code the country name. If you use numbers for this data, you can calculate the sample standard deviation, but it won’t make any sense.
Other measures of variability
The standard deviation is one way to estimate the spread of your data. The range and interquartile range (IQR) also estimate spread. Unlike the standard deviation, neither of these statistics involves the center of the data. These statistics can be used with small data sets (the range) or skewed data sets (IQR).
Range
The range is the difference between the lowest value and the highest value in your data.
Interquartile range (IQR)
The interquartile range is the difference between the 25th and 75th percentiles in your data. The IQR is therefore less affected by extreme values than either the range or the standard deviation. If your data has extreme values or is skewed, then the IQR may be a good choice to describe the variability in your data set.