Scatter diagrams: a scattered approach? Steve Daum shows how this simple tool establishes support for understanding the correlations (and non-correlations) among factors.

In recent work, I’ve been thinking about the use and application of scatter diagrams. You have probably seen these. Here are some examples:

When you look at a scatter diagram, you are testing a theory. Statisticians call this *testing a hypothesis*. These scatter diagrams compare two variables: one variable on the horizontal or x-axis and a different variable on the vertical or y-axis. The **theory** you are testing is that **there is no significant correlation **between these two variables.

The quick answer to the question *Is the theory correct? *can be found by looking at the slope of the line. The flatter or more horizontal the line is, the more comfortable you can be that your theory is correct – that is: there is no significant correlation between these two variables. The steeper the slope, either downward or upward, indicates that your hypothesis is **not** correct. That is there **does** appear to be correlation between these variables. However, like almost everything with statistics, the *quick* answer does not tell the whole story.

There are many websites that provide access to interesting data. When learning a new statistical or analytical tool, it can be enlightening to use real data for experimenting as you learn. The Zillow website provides data about real estate. Listings of properties for sale and properties that have sold contain a rich collection of features to be studied. Since many people can relate to real estate, I’ve recently been using data from Zillow to study scatter diagrams.

Here is data from Zillow about recent properties sold in the area:

I used 25 recent home sales in the area and selected the columns or *features* above – to study with scatter diagrams. Here is my first diagram comparing selling price to square feet:

In terms of these two variables, the *theory* we are testing is this: **there is no correlation between square feet and selling price.** Since there **is** a slope to this line, we can be comfortable that we have proven this theory wrong. In other words there **is** a correlation between these variables. Also, when the line slopes up and to the right, the correlation is positive. As square feet increases, selling prices increases – which makes sense intuitively.

In the next scatter diagram a comparison of selling price to number of bedrooms is made:

This looks a little different because number of bedrooms is a whole number – but the diagram is testing the same theory that there is no correlation between selling price and number of bedrooms. Once again, we have proven the theory wrong – because we see a slope to the line indicating that there is *some* correlation. As the number of bedrooms increases, so does the selling price. This is why the line slopes up and to the right.

Imagine for a minute that we are looking instead at **number of expensive repairs needed. **In that case, as the number of repairs needed increases, the selling price might decrease. In this case, the line would slope downhill to the right. There would still be a correlation, but it would be a **negative** correlation.

Both our real estate scatter diagrams have shown a positive correlation:

- A positive correlation between square feet and selling price
- A positive correlation between number of bedrooms and selling price

A natural follow up question is this: **which** of these two variables has the **strongest** correlation with selling price? There are different ways to answer the question but these diagrams use **Correlation Coefficient.** In statistical references this is often referred to as the *r* statistic. It is calculated such that the answer is between -1 and 1. Values near zero show little or no correlation. Values around 1 suggest a strong positive correlation. Values around -1 suggest a strong negative correlation.

Based on the Correlation Coefficient, **square feet** has more impact on selling price than does **number of bedrooms** because it has the higher Correlation Coefficient of .84 (closer to 1).

What about that line? As you look at these diagrams you might be curious about how the line is calculated and plotted. The line is called the **line of best fit.** It is calculated using simple linear regression to minimize the distance between each plot point and the line. For any data set you can plot a line of best fit. Best does not always denote *good* in this case. When many of the plotted points are a large distance from the line it suggests that the line is not a **great** fit; just that it is the *best* fit that can be calculated with simple linear regression. In the real estate scatter diagrams above, the equation for the line of best fit is shown above the chart. You can use this to *estimate *or *predict* selling price based on arbitrary values on the x-axis. For example, say your house is 1275 square feet. Using the regression equation, you can estimate a selling price as follows:

y = 114.79(x) – 45418

or

y=114.79(**1275) **– 45418

which is

y=100,939

This is equivalent to drawing a line up from **1275** on the x-axis until you get to the line of best fit and then drawing a horizontal line left over to the y-axis.

I used real estate examples because most people have good intuition about how real estate works. However, in your analysis for quality control and quality improvement, there are many applications for scatter diagrams. Here are a few examples:

- Is there a correlation between line speed and number of defects?
- Is there a correlation between part supplier and number of failures?
- Is there a correlation between call center and customer satisfaction scores?
- Is there a correction between product feature selection and repeat purchases?

In most articles about scatter diagrams you will be reminded of this: **correlation is not causation.** This is true and you should be careful with the tool as it is possible to show correlation between totally unrelated variables. However, the scatter diagram is a simple and useful tool that can help you understand systems that lead to quality.

The scatter diagrams for this article were produced using *SQCpack. *

Good example to understand the scatter Diagram and its use .

Badri

Good examples but I like more the conclusion: “correlation between totally unrelated variables”.

I wait for related correlations examples.

Tatyana.