Scatter Plot
What is a scatter plot?
A scatter plot shows the relationship between two continuous variables, x and y. The values for each variable correspond to positions on the x- and y-axis respectively. A dot or some other symbol is placed at the (x, y) coordinates for each pair of variables. The pattern of the dots can provide clues regarding how the two variables are related.
How are scatter plots used?
Scatter plots are used to show relationships. For correlation, scatter plots help show the strength of the linear relationship between two variables. For regression, scatter plots often add a fitted line. In quality control, scatter plots can often include specification limits or reference lines.
See how to create a scatter plot using statistical software
- Download JMP to follow along using the sample data included with the software.
- To see more JMP tutorials, visit the JMP Learning Library.
Scatter plots show relationships
Scatter plots show how two continuous variables are related by putting one variable on the x-axis and a second variable on the y-axis.
A scatter plot for regression includes the response variable on the y-axis and the input variable on the x-axis.
Scatter plot examples
Example 1: Increasing relationship
The scatter plot in Figure 1 shows an increasing relationship. The x-axis shows the number of employees in a company, while the y-axis shows the profits for the company. The scatter plot shows that as the number of employees increases, the profit increases. Companies with fewer employees (at the left side of the graph) have lower profits, and companies with more employees have higher profits. This is a very simple example since there are many variables that can affect a company’s profits.
Example 2: Decreasing relationship
The scatter plot in Figure 2 shows a decreasing relationship. The x-axis shows the grams of sodium for a type of processed meat; the y-axis shows the cost per kilogram of protein. The scatter plot reveals that as sodium increases, the protein cost decreases. Meat with lower sodium amounts (at the left side of the graph) has higher protein costs, while higher-sodium meat has lower protein costs. This makes sense, since salt can be added to lower-quality (thus, lower-cost) meat, improving its taste, yet increasing the sodium amount.
Example 3: No relationship
The scatter plot in Figure 3 shows no relationship between two variables. The x-axis shows the size of a load for prewashing denim fabric; the y-axis shows the measured thread wear. The scatter plot shows a random cloud of points. While some might see a slight decrease in thread wear as the load size increases along the right side of the graph, we can use simple linear regression to check this idea.
Example 4: Curved relationship
The scatter plot in Figure 4 shows a curved relationship between two variables. The x-axis shows the birth rate for a group of countries; the y-axis shows the death rate. The scatter plot shows a decreasing relationship up to a birth rate between 25 to 30. After that point, the relationship changes to increasing.
Example 5: Outliers in scatter plots
Unusual points, or outliers, in the data stand out in scatter plots.
Figure 5 shows a scatter plot with an outlier, while Figure 6 shows the same data without the outlier. The single outlier in the upper right corner has an impact on your ability to visualize the data in the scatter plot. When there is an unusual data point on a scatter plot, you can investigate to find out the reason for the outlier. You may want to display the data with and without the outlier.
Customizing scatter plots
Colors and markers can be used to add details for other variables to a scatter plot, as well as reference lines to indicate such things as specification limits.
Using colors and markers
Figure 7 shows a scatter plot of weight versus horsepower for 116 models of cars.
From the basic plot, we see an increasing relationship. Heavier cars have more horsepower; lighter cars have less.
The country of origin for the cars is specified as the United States, Japan, or other, and the types of car are specified as sporty, compact, small, medium, or large. The basic scatter plot can be enhanced by using colors and markers for these two variables.
The scatter plot in Figure 8 uses colors to distinguish the data points for the three values for country of origin.
It's easy to see that cars with horsepower above 225 are from Japan or the US. The lowest horsepower cars do not include any cars from the US.
Different markers for the different types of cars can also be added.
Cars with horsepower of 200 or greater are either medium or sporty, as shown by the squares and circles. The cars with the lowest horsepower are all small cars, as shown by the upward triangles. The heaviest car of all is a large car made in the US, as shown by the green diamond near the top of the chart, but this car has average horsepower.
With your data, explore the options of using colors, markers, or both to add dimensions to a scatter plot.
Adding reference lines
Reference lines can be a useful additon to a scatter plot. Suppose we need to know which cars can't go across an old wooden bridge that has a weight limit of 4,000 lbs. The scatter plot in Figure 10 now has a reference line with an annotation explaining its relevance.
Figure 11 shows the same scatter plot with labels for the four cars that can't go across the old bridge.
Adding specification limits
Many situations have specification limits for variables. Using the Meat data from Figure 2, a buyer for school cafeterias is required to purchase meat with a minimum of 300 grams of sodium, a target of 450 grams, and a maximum soduim limit of 600 grams. Figure 12 shows a scatter plot with these specification limits.
With these lines added, it's now easy to see that there are four types of processed meat that can't be purchased for the school cafeteria. Labels and colors for these points, as shown in Figure 13, can be added to provide additional details. The buyer can share this graph to show why some meats are not an option.
Scatter plot matrix
A scatter plot matrix can show how multiple variables are related. After plotting all the two-way combinations of the variables, the matrix can show relationships between variables to highlight which relationships are likely to be important. The matrix can also identify outliers in multiple scatter plots.
Figure 14 shows a scatter plot matrix for the data on different models of cars. The scatter plots use the same colors and markers from Figures 9-11. The first scatter plot in the leftmost column shows the relationship between Weight and Turning Circle. The upper and lower triangles of the matrix are mirrors of each other.
The matrix shows that all the two-way combinations of variables have an increasing relationship.
With JMP, it's possible to add additional information to the scatter plot matrix, including histograms for each variable along the diagonal. It's also possible to replace the scatter plots in the upper triangle with the correlation between each pair of variables. The scatter plot matrix in Figure 15 shows these customizations. The legend at the right has a heatmap for the correlations, with dark red indicating a strong positive relationship between the two-way combinations of variables.
This matrix also shows possible outliers in the histogram for Displacement.
With JMP, even more information can be added to the matrix, such as density ellipses for each scatter plot to see outliers in multiple dimensions. Figure 16 demonstrates how selecting an outlier in one scatter plot highlights it in all the other scatter plots.
The scatter plot matrix in Figure 16 shows density ellipses in each individual scatter plot. The red circles contain about 95% of the data. It's possible to explore the points outside the circles to see if they are multivariate outliers. In Figure 16, the single blue circle that is an outlier in the Weight by Turning Circle scatter plot has been selected. This point is also an outlier in some of the other scatter plots but not all of them. In the Displacement by Horsepower plot, this point is highlighted in the middle of the density ellipse.
By deselecting the point, all points will appear with the same brightness, as shown in Figure 17. From the density ellipse for the Displacement by Horsepower scatter plot, the reason for the possible outliers appear in the histogram for Displacement. There are several points outside the ellipse at the right side of the scatter plot. The colors reveal that all these points are from cars made in the US, while the markers reveal that the cars are either sporty, medium, or large. Annotations explaining the colors and markers could further enhance the matrix.
For your data, you can use a scatter plot matrix to explore many variables at the same time.
Scatter plots and types of data
Continuous data: appropriate for scatter plots
Scatter plots make sense for continuous data since these data are measured on a scale with many possible values. Some examples of continuous data are:
- Age
- Blood pressure
- Weight
- Temperature
- Speed
Categorical or nominal data: use bar charts
Scatter plots are not a good option for categorical or nominal data, since these data are measured on a scale with specific values. Use bar charts instead.
With categorical data, the sample is divided into groups and the responses might have a defined order. For example, in a survey where you are asked to give your opinion on a scale from “Strongly Disagree” to “Strongly Agree,” your responses are categorical.
For nominal data, the sample is also divided into groups but there is no particular order. Country of residence is an example of a nominal variable. You can use the country abbreviation, or you can use numbers to code the country name. Either way, you are simply naming the different groups of data.
You can use categorical or nominal variables to customize a scatter plot. You can assign different colors or markers to the levels of these variables.