A Primer on Spreadsheets and Visualization
This is part 1 in a series about spreadsheets, data, and visualization. In this part, we’ll cover the basics of spreadsheet applications and an introduction to selecting dataviz methods.
I’ve yet to run into a business that doesn’t use spreadsheets extensively for managing information. Over the years, Microsoft Excel has become the utility belt application of choice not only for working with data, but even for developing applications and automating business processes (for good or for ill).
Desktop spreadsheet applications have been around for decades, with Excel being the current giant of the industry. Long-time computer users might also remember VisiCalc, Quattro, or Lotus 1-2-3. People have been using them since the beginning of computing. As a result, they’ve become a de facto tool for doing anything and everything with data.
For some spreadsheet users, formulas and functions for statistics and financial calculations are as deep as they get into working with tabular data. Many others take their spreadsheets to another level with multiple cross-referenced sheets, pivot tables, complex calculations, macros, and more. As useful as spreadsheets can be for manipulating data, it’s hard to look at thousands of rows and columns of numbers without extracting additional meaning from the data through visuals. Let’s explore the concept of charting as a basic method of data visualization.
Making Visuals of Data
Converting tabular data into visuals is one of the best ways to surface the underlying meaning out of dense information. Effective communication of data through visualization is far from trivial. Every day we all see charts and graphs that butcher raw data or miscommunicate (or undercommunicate) the meaning of data. You can go to an entire graduate study program just to learn how to effectively summarize and communicate with data. It’s that complicated.
A few tips and tricks can help you decide what data to include in your charts and graphs, types of charts to use, and what combination of techniques clearly communicates to your consumer. Based on what you’re looking to accomplish, you can start with your end goals in mind to work backwards in determining the best method for visualization. To start with, you should think about the primary focus of your chart:
- Trends - do you want to show change over a time series?
- Distribution - do you want to see patterns, common occurrences, and outliers?
- Relative comparison - do your data fall into a discrete number of categories you want to see side-by-side?
- Composition - do you need to view a breakdown of percentages?
Take a look at this excellent flow chart for determining which style is best:
Occasionally, you’ll have multiple of these objectives, in which case it’s sometimes best for clarity to have two separate charts, but also some chart styles are good at achieving multiple communication goals. Let’s look a little deeper into some of the common types and review some examples.
Column and Bar Charts
When you need simple side-by-side item comparisons, this is a simple approach to showing quantities of discrete items. Bar charts (horizontal) are good when you’ve got more than 10 or 15 items to display because they can be stacked in vertical space. Column charts are good if you want to see relative comparison with time as a variable. My example shows cities in China ranked by total population. With a bar chart, I could even expand to show the top 50 while keeping it readable:
Pie charts are the most common way to display composition of a whole. Sections of a circle visualize relative percentages nicely, assuming you don’t have too many categories. More than 15 or so categories can make it difficult to properly label your pie chart and convey the total composition of items. This example shows the 115 active volcanoes in the United States aggregated by type:
Highlighting trends over time is best done with simple line graphs, and you’ll often see things like month-to-month revenue charts, rainfall by year, or website visits by day using a line format to show the pattern of movement over time periods. You can also include multiple variables along common time scales to compare relative magnitude between several lines.
Showing distribution is a job for a histogram, particularly useful with large amounts of data. This type of graph helps not only to show distribution but also frequency and where you might have outliers in your data. They work for well distributions of a single variable. My example here uses the previous dataset, this time with over 1500 volcanoes to see distribution of volcanoes by elevation:
If each data point has two related variables you’re trying to connect visually, a scatter plot does this by showing the two values on separate axes, with the data scattered across the plane to visualize distribution in two dimensions. They’re great for large datasets with many records. You can see the correlation between the two variables spatially. In this example, we can look at how time spent studying for exams affects test scores in a classroom:
These are just a handful of charting techniques you can use to display your data. Each variety of chart can support other more complex additions to throw in additional data or related variables. Bar and column charts can be “stacked” to add a compositional dimension, trend lines can be drawn within to see trajectories or averages, and bubble charts let you visualize an additional variable with points on scatter plots. The optimal method for your chart needs isn’t always obvious — usually you’ll need to weigh the tradeoffs between different methods to find the best fit. This chart chooser tool can help you get ideas about the different ways to communicate with your given dataset.
Stay tuned for part 2 later in the week where we’ll introduce Google Sheets (our favorite spreadsheet tool) and the myriad of things you can do with it.