Data is flowing into spreadsheets and database tables at an astonishing rate. It is stacking up in personal computers, laptops, tablet computers, database servers, in the cloud, and even in smart phones. In these growing mountains of data there is insight to be found–insight that can lead to gains in quality and ultimately in profitability. It is increasingly important when we ask questions of this data, and that we do so carefully and with knowledge about how that answers get generated and how they come back to us from the data source.
The big picture workflow of extracting value from a set of data might be described like this:
- Formulate a query
- Execute the query
- Get the results, typically a set of data rows
- Ponder these results using analytical and charting techniques
- Form a theory about how to make things better
- Implement/test your theory
- Repeat starting at step 1 to see if your theory worked
Typically, in step 3 we are getting a subset of the entire data set. Often, the data set will be too large to analyze en masse. In some cases, it is easy to know when we get back the correct data. For example, assume that our query asks for all the records for a specific customer. Just by looking at the results, we can determine if we got what we expected. In other cases, it can be more difficult to know if the results will allow us to analyze the data the way we intend.
The rest of this article will focus on one data analysis problem: the handling of zeros. We have seen this pattern surface among many of our customers who analyze data on a regular basis.
Imagine that you are responsible for analyzing and reporting on your customer satisfaction survey data. Each customer survey is tallied into a final score where 100 is really great and zero is really bad. These satisfaction surveys arrive intermittently throughout the month and flow into a database table. Here is a look at part of this table:
Management leaders have requested that you track a metric for number of satisfied customers per day. Their operational definition for a satisfied customer is a survey score of 80 or higher. The highlighted rows below show a few surveys that qualify.
If you were to do a query to aggregate these few rows into a daily count, you might end up with this data:
You might be tempted to conclude that we are averaging 1.5 satisfied customers per day. Before you present this to your stakeholders, step back and think about the average. Looking at the table above, how many satisfied customers did we have on July 15? You can immediately see that our average is overstated because we are not counting the days when we have zero satisfied customers.
In this case the results we got back were accurate. Each row does represent what happened on that particular day. However, the way we queried the data does not take into account the zero days and this can affect our analysis.
Typically, when we do an aggregation query on a table like this, the query will look something like this:
SELECT Min(Date) as Date, Count(Date) as NumSatisfied FROM SatisfactionScores
WHERE SurveyScore >= 80 GROUP BY Date
This is a single table query that uses only the SatisfactionScores table.
One way to solve the problem of not counting zero days is to do a special type of query where two tables are joined together to create a result. One-to-many relationships are common in many databases. Think of a database containing invoices. There might be one table for invoices and another table for invoice-details. One record in the invoice table might be related to many records in the invoice-details table. Often, these two tables will be queried using some for of join query.
Here is how we might use a join query to solve our problem of not representing the zero days. Imagine that we create a table named AllDates. It will contain a single date column. We will fill the table with a row for date from say January 1, 2000 up to December 31, 2050. Here is just a section of that table:
Next, we will join this table to our original data table using a query. When two tables are joined in a query, a column must be specified that will be used to join the two tables. In this case, we use the date column. Here is a visual representation of what we want to do:
The table on the left is the AllDates table. The table on the right is our SatisfactionScores table. We want a query that will return one row for each date in the table on the left. We want this row even if there are no rows for that date in the table on the right. In structured query language (SQL) this type of query is sometimes referred to as a left-join; but don’t worry about the geek speak, it sounds more complex than it really is.
Often, when doing a query like this it is a good idea to create an intermediate query to make things simpler. In our example, we will create a query named SatisfiedCustomers that looks like this:
SELECT * FROM SatisfactionScores WHERE SurveyScore>=80
This just makes our final query a less complicated. Next, we want to create a query that joins together the AllDates table and this SatisfiedCustomers query. It looks like this:
MIN(AllDates.Date) as TheDate,
IS NULL,0,COUNT(SatisfiedCustomers.SurveyDate))AS NumSatisfied
LEFT JOIN SatisfiedCustomers ON SatisfiedCustomers.SurveyDate=AllDates.Date
GROUP BY AllDates.Date
The final line of the query (the group by clause) ensures that we will get one row returned for each date. Since we use group by, we also have to use aggregate functions such as MIN and COUNT in our SELECT statement.
Here are the partial results of this query:
As you can see, we now have zero for July 15. If you were to show this data on a run chart or control chart where the mean line is represented, it would accurately reflect the average number of satisfied customers per day because it accounts for days with zero satisfied customers. We hope that this is not a frequent occurrence!
The query above does look complicated. However, if you understand this general pattern, of joining your data to an AllDates table, you can easily apply this to your analysis scenarios. In many cases, you should be able to use the query above just substituting your specific table names and column names where appropriate.